User Tools

Site Tools


l3sintern:research_seminar_10

Research Seminar 2010

This Research Seminar starts January 15, 2010. It takes place at 14:00 in Appelstr. 9a, 15th floor (unless stated otherwise).

January 15, 2010

organized by: NN

Speakers: Sergey, Kerstin D.

Personalized Social Search Based on the User's Social Network - Sergey

This work investigates personalized social search based on the user's social relations { search results are re-ranked according to their relations with individuals in the user's social network. We study the effectiveness of several social network types for personalization: (1) Familiarity-based network of people related to the user through explicit familiarity connection; (2) Similarity-based network of people \similar“ to the user as reflected by their social activity; (3) Overall network that provides both relationship types. For comparison we also experiment with Topic-based personalization that is based on the user's related terms, aggregated from several social applications. We evaluate the contribution of the different personalization strategies by an online study and by a user survey within our organization. In the offline study we apply bookmark-based evaluation, suggested recently, that exploits data gathered from a social bookmarking system to evaluate personalized retrieval. In the on-line study we analyze the feedback of 240 employees exposed to the alternative personalization approaches. Our main results showthat both in the online study and in the user survey social network based personalization significantly outperforms non-personalized social search. Additionally, as reflected by the user survey, all three SN-based strategies significantly outperform the Topic-based strategy.

The M-Eco Project: Personalized Event-Based Surveillance - Kerstin D.

In this talk, the M-Eco project that just started these days will be introduced. M-Eco deals with event extraction for epidemic intelligence. Public health officials are faced with new challenges for outbreak alert and response due to the continuous emergence of infectious diseases and their contributing factors such as demographic change, or globalization. Only the early detection of disease activity, followed by a rapid response, can reduce the impact of epidemics. Conflictingly, the time with which information propagates through the traditional channels, can undermine time-sensitive strategies. Faced with these limitations, the M-Eco project will help to complement traditional systems with additional approaches for the early detection of emerging threats.

January 22, 2010

Web Cleansing

organized by: NN

Boilerplate Detection using Shallow Text Features - Christian

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.

Speakers: Christian

January 29, 2010

Latent Topic Models

organized by: Peter

Bloganalysis with Latent Dirichlet Allocation - A Study on the Quality of Topic Detection - Marko
Master-Thesis-Talk Slides
In my MSc. thesis a modified version of the Latent Dirichlet Algorithm (LDA) is applied to assign topics on the sentence-level. The quality of the sentence-level LDA is comprehensively assessed. The experiments in this work are performed on three different real-world datasets (blogposts, comments on blogposts, product reviews). The evaluation of the results show that sentence-level topics can be successfully obtained and the quality of the derived topics is good in terms of assignment to sentences and topic characteristics.

Latent Dirichlet Allocation for Tag Recommendation - Ralf

Tagging systems have become major infrastructures on the Web. They allow users to create tags that annotate and categorize content and share them with other users, very helpful in particular for searching multimedia content. However, as tagging is not constrained by a controlled vocabulary and annotation guidelines, tags tend to be noisy and sparse. Especially new resources annotated by only a few users have often rather idiosyncratic tags that do not reflect a common perspective useful for search. In this paper we introduce an approach based on Latent Dirichlet Allocation (LDA) for recommending tags of resources in order to improve search. Resources annotated by many users and thus equipped with a fairly stable and complete tag set are used to elicit latent topics to which new resources with only a few tags are mapped. Based on this, other tags belonging to a topic can be recommended for the new resource. Our evaluation shows that the approach achieves significantly better precision and recall than the use of association rules, suggested in previous work, and also recommends more specific tags. Moreover, extending resources with these recommended tags significantly improves search for new resources.

Speakers: Marko, Ralf

February 5

Misc

organized by: NN

Speakers: Fan, Mohammad

Efficient Algorithms Mining Topic Diversities and Application on Computer Science Papers - Fan

Motivated by the application of discovering topic diversities of papers on computer science, we propose a general diversity index: average similarity of all objects pairs, to quantify the diversity of a set of objects on certain aspects. We focus on Jaccard similarity and term sets such as paper titles and keywords in this paper. We argue that the index is an simple and reasonable statistic measuring diversity. However, it is computationally expensive to obtain the index value even for medium size data sets due to the quadratic time complexity of all-pair similarity comparisons. Accordingly, we study two fast approximation techniques: Random-Sampling and TrackDJ, both of which have guaranteed accuracy. We test the algorithms on real and synthetic data sets and verify our theoretical results. Experiments on computer science papers indicates that the topic diversity of computer science conference papers tends to increase over time. Also, our results show that curiosity-driven conferences have higher topic diversities than application-driven conferences.

Selecting Skyline Services for QoS-based Web Service Composition - Mohammad

Here is the abstract of the paper, which I will present in this talk:

“Web service composition enables seamless and dynamic integration of business applications on the web. The performance of the composed application is determined by the performance of the involved web services. Therefore, non-functional, quality of service (QoS) aspects (e.g. response time, availability, etc.) are crucial for selecting the web services to take part in the composition. The problem of identifying the best candidate web services from a set of functionally-equivalent services is a multi-criteria decision making problem. The selected services should optimize the overall QoS of the composed application, while satisfying all the constraints specified by the client on individual QoS parameters. In this paper, we propose an approach based on the notion of skyline to effectively and efficiently select services for composition, reducing the number of candidate services to be considered. We also discuss how a provider can improve its service to become more competitive and increase its potential of being included in composite applications. We evaluate our approach experimentally using both real and synthetically generated datasets.”

February 12

Entity Search

organized by: NN

Speakers: Gianluca, Tereza

Towards Estimation of Public Opinions from the Blogosphere - Gianluca

In the last years, the blogosphere has become a vital part of the web, covering a variety of different points of view and opinions on political and event-related topics such as immigration, election campaigns, or economic developments. Tracking the public opinion is usually done by conducting surveys resulting in significant costs both for interviewers and persons consulted. In this paper, we propose a method for automatically extracting public opinion from the blogosphere. To this end, we apply sentiment analysis techniques and forecasting models for time series in combination with aggregation methods on blog data to estimate the temporal development of opinions on polarizing topics. We consider both supervised scenarios where a limited amount of “traditional” poll results is given and purely data-driven, unsupervised settings. Our experiments on the TREC Blog 2009 data collection focusing on the US election campaign in 2008 and using professional opinion polls as ground truth indicate that trends for opinions can be estimated from the blogosphere.

February 19

Event Detection

organized by: NN

Speakers: Mihai

Event Detection - Mihai

The audience will be able to see a short survey on the existing approaches in event detection. Starting from the TDT initiative in '98 and clustering documents with a threshold, going through detection of bursty features, and the identification of events associated with them, introduced in 2002 by Kleinberg, and finishing with a few present day approaches that apply some of the news stream event detection techniques to social media, we will try to get the big picture of the state of the art event detection techniques.

slidesgeorgescu_event_detection_survey.pptx

May 7

organized by: Ralf Krestel

Speakers: Alex Wall, Juri de Coi

Analyse von politischen Neigungen in Zeitungsartikeln - Alex Wall

In dieser Bachelorarbeit werden politische Ausrichtungen in Online-Zeitungsartikeln unter Verwendung von rechnergestützter Verfahren untersucht. Für diese Aufgabe werden zwei verschiede Verfahren verwendet. Ein Verfahren basiert auf der Berechnung der Kosinus-Ähnlichkeit und ein weiteres auf der Verwendung von Support-Vektor-Maschinen. Es wird gezeigt, dass diese Verfahren eine relativ einfache Möglichkeit für das Vorselektieren großer Datenmengen liefern, um beispielsweise weitergehende, semantische Analysen durchführen zu können.
(Bachelor thesis talk; slides will be in english, talk in german)

Security control brought back to the user - Juri de Coi

Policy languages have lately emerged as a means to formally define (among else) access control policies, security policies and business rules. The expressiveness of the policy languages proposed by the scientific community increased over time, what came along with a reduction of their user-friendliness for common users. This talk will introduce

  • the policy language Protune, designed in order to meet the requirements up-to-date policy languages have to fulfill
  • the controlled natural language front-end of Protune, designed in order to increase its user-friendliness for common users
  • applications of Protune to real-world scenarios (namely, the protection of metadata stores and RDF repositories) in order to test its feasibility

(PhD defense rehearsal, 40 min.)

*Wednesday* May 12

Due to 'Himmelfahrt' on Thursday, this research seminar takes place on Wednesday instead of Friday.

organized by: Jana Westendorff

time: 4pm

Speaker: Vinay Setty

Efficiently Identifying Interesting Time-points in Web Archives

Large scale text archives are increasingly becoming available on the Web. Exploring their evolving contents along both text and temporal dimensions enables us to realize their full potential. Standard keyword queries facilitate exploration along the text dimension only. Recently proposed “time-travel keyword queries” enable query processing along both dimensions, but require the user to be aware of the exact time point of interest. This may be impractical if the user does not know the history of the query within the collection or is not familiar with the topic.

In this work, our aim is to efficiently identify interesting time points in Web archives with an assumption that we receive a result list for a given query in standard relevance-order from an existing retrieval system. We consider two forms of Web archives: (i) one where documents have a publication time-stamp and never change (such as news archives), and (ii) the archives where documents undergo revisions, and are thus versioned. In both the settings, we define interestingness as the change in top-k result set of two consecutive time-points. The key step in our solution is the maintenance of top-k results valid at all the time-points, which can then be used to compute the interestingness scores for the time-points. We propose two techniques to realize efficient identification of interesting time points: for the case when documents once published never change, we have a simple but effective technique. For the more general case with versioned documents, we develop an extension to the segment tree which makes it rank-aware and dynamic. For further improvement in efficient, we propose an early termination technique which is proven to be very effective. Our methods are shown to be effective in efficiently finding interesting time points in a set of experiments using the New York Times news archive and the Wikipedia versioned archive.

*Tuesday*, May 18

organized by: Jana Westendorff

time: 4pm

Speaker: Chide Groenouwe, VU University Amsterdam, Web & Media group

The SWiFT way to human fluency in Semantic Web writing - Chide Groenouwe

The Semantic Web is still far from realising its full potential, in part because it lacks sufficient high quality Semantic Web representations of information. Therefore, in my research I focus on fostering people’s fluency in creating such representations. For this purpose I designed and applied the on-line game SWiFT in which teams compete in fluency. The teams are supported by a constitution: a comprehensive set of guidelines and a tool. In this presentation I will share, among other things: (1) the current state of affairs regarding the capability (2) how to improve the current constitution; and (3) how to improve the game design in facilitating further evolution of the constitution.

May 21

Entity Resolution for the Web of Data

organized by: Claudia Slides

Speakers: Katerina, George

Blocking meets Large Heterogeneous Information Spaces - George Slides

Entity Resolution constitutes a challenging problem that lies at the core of information integration research. A plethora of approaches were proposed for addressing this problem, with many of them employs blocking techniques to reduce the required number of expensive pair-wise comparisons. Though quite efficient, the existing blocking approaches are inapplicable to the wider area of the Web of Data that encompasses constantly growing noisy data sets without strong schema binding.

In this paper, we present an approach to entity resolution within such large, noisy, and heterogeneous information spaces. In particular, we develop a novel blocking technique based on a minimal set of assumptions, namely that duplicates have at least one value in common independently of the associated attribute names. Combining this principle with effective scheduling, propagation, and pruning strategies, the approach becomes robust with respect to noise and heterogeneity, while limiting the number of required pair-wise comparisons and making the method scalable. Our experimental evaluation with large real world data sets, verifies the advantages of the approach.

Entity-Aware Query Processing On-the-Fly in the Presence of Linkage - Katerina Slides

Central to almost every data integration or data cleaning scenario is entity linkage. Traditional entity linkage techniques use some computed similarity among data structure to perform merges and then answer queries on the merged data. Probabilistic databases on the other hand, incorporate uncertainty into the data and return probabilistic answers.

We describe a novel framework for entity linkage with uncertainty. Instead of using the linkage information to merge structures a-priori, possible linkages are stored alongside the data with their belief value. We use a new probabilistic query answering technique to take the probabilistic linkage into consideration. The framework has a number of advantages: (i) data merges are done at run time so that they depend not only on the linkages but also on the query; (ii) the returned results may contain structures that were not explicitly represented in the data, generated as a result of some reasoning on the linkages; (iii) query condition evaluation spans across linked structures, offering a functionality that cannot be simulated by traditional probabilistic databases. We formally define the semantics of our techniques, describe their efficient implementation, and report on the findings of their experimental evaluation.

June 11

organized by: Thomas

Speakers: Nina, Thomas Theuerkauf, Kai Niklas

Using Word Sense Discrimination on Historic Document Collections -Nina

Word sense discrimination is the first, important step towards automatic detection of language evolution within large historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today's language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper articles from 1785 - 1985.

Discovery of descriptive words for sets of semantically related nouns -Thomas

Word sense discrimination algorithms extract clusters consisting of nouns from text where each cluster represents a single word sense or semantic meaning. However, these algorithms have shortcomings to label their clusters with descriptive words. For clusters containing nouns of the same semantic class, e.g. {banana, apple, orange}, a descriptive word is the name of this class, e.g. “fruit”. Knowing these descriptive words can help to improve search or automatic detection of language evolution. Previous approaches to label such clusters use either external information sources such as WordNet or sophisticated grammatical parsers to gather grammatical relationships from text to work with. Both approaches are not best suited for clusters extracted from historic document collections. In this presentation an alternative approach is proposed. This approach uses semantical relations, in particular hyponymy relations (“banana IS-A fruit”), to build a so called relation graph for each cluster. One way to extract such relations with lexico-syntactic patterns is presented. Afterwards, the procedure to build the relation graph and how it is used to discover descriptive words is given. Finally, the results of a manual evaluation of discovered descriptive words will be presented.

Unsupervised Post-Correction of OCR errors-Kai

The trend to digitize (historic) paper-based archives has emerged in the last years. The advantages of digital archives are easy access, searchability and machine readability. These advantages can only be ensured if few or no OCR errors are present. These erros are the result of misrecognized characters during the OCR process. Large archives make it unreasonable to correct errors manually. Therefore, an unsupervised, fully-automatic approach for correcting OCR errors is proposed. The approach combines several methods for retrieving the best correction proposal for a misspelled word: A general spelling correction (Anagram Hash), a new OCR adapted method based on the shape of characters (OCR-Key) and context information (bigrams). A manual evaluation of the approach has been performed on The Times Archive.

June 18, **11am**

organized by: NN

Speakers: Ivana Marenzi

Who are you working with? - Visualizing TEL Research Communities - Ivana

Author Co-Citation Analysis (ACA) provides a principled way of analyzing research communities, based on how often authors are cited together in scientific publications. In this paper, we present preliminary results based on ACA to analyze and visualize research communities in the area of technology-enhanced learning, focusing on publicly available citation and conference information provided through CiteseerX and DBLP. We describe our approach to collecting, organizing and analyzing appropriate data, as well as the problems which have to be solved in this process. We also provide a thorough interpretation of the TEL research clusters obtained, which provide insights into these research communities. The results are promising, and show the method’s potential as regards mapping and visualizing TEL research communities, making researchers aware of the different research communities relevant for technology enhanced learning, and thus better able to bridge communities wherever needed.

*Tuesday*, July 6, 14:00

organized by: NN

Speakers: Christine Preisach

Semi-Supervised Relational Classification using Multiple Relations - Christine Preisach

Nowadays new technologies allow to store vast amounts of data, hence the amount of collected data grows steadily. This trend is positive on one hand since more knowledge can be extracted from the data, but on the other hand it may lead to information overload on the side of the user. This means, we need to provide the user with facilities that help organizing the collected data. Therefore usually supervised classi fication algorithms, under the assumption that data instances are independent and identically-distributed (iid), are applied. Thus, only inherent attributes of the instance itself are taken into account. Using standard supervised classifi cation methods may lead to less accurate results because of the following three issues: First, the iid assumption may not always hold, i.e., often relations and dependencies among data instances exist, but are ignored, second, if relations are taken into account, often only one is considered even if multiple exist and third, the required labeled data for supervised classi fication is scarce and costly to obtain. Examples for data where the iid assumption does not hold are: web pages connected by hyperlinks and scientifi c publications, which are related by common authors, venue or citations.

Apart from text documents, relations can also be observed in other domains like social tagging systems, where users are related to each other by sharing the same resources. We also consider situations where a relation is not explicitly given, in this cases a relation can be constructed using similarities, so for example in the medical domain, where patients could be connected to each other if they have similar measurements (time series of blood pressure, heart rate, etc.).

In each mentioned domain labeled data is scarce while the cost of expert annotation is high, and multiple relations among data instances exist, thus we will address all three issues in this thesis. We propose and analyze several semi-supervised graph-based relational algorithms using multiple relations. We investigate their benefi ts in diff erent domains and show that independent of the type of data or the area of application semi-supervised graph-based relational methods exploiting multiple relations, are highly predictive and mostly outperform state-of-the-art algorithms.

*Tuesday*, July 13, 14:00

Speakers: Kaweh Djafari Naini

Complexity results of Description Logic (DL)

In this presentation we describe some basics of DL and we present two web applications as well. One of the applications is called DL_Navigator in which we can choose a DLogic and in which we can see its complexity results, properties and references. For the database management of the DL_Navigator, we'll see another web application, in which we can put the most important results of DLogics.

Friday, September 24, 14:00

Speaker: Fan Deng

GroupMail: An Anonymous Mailing List System with Personal Filters

A mailing list is very useful for communication among a group of people, e.g. at L3S. But in some cases, people may hesitate to use (or frequently use) a mailing list to communicate with other group members. E.g., sometimes people ask questions in a mailing list, but asking questions too often can be considered as abusing resources or bothering others.

GroupMail is an anonymous mailing list system. Being anonymous may encourage the usage of mailing lists. However, it may also bring some “side effects”, e.g. the appearance of annoying mails or large mail volumes in peak time. The system provides functionalities for users and group managers to control the quality & quantity of mails appearing in mailboxes. E.g. Users can rate (like/dislike) every message they get. If some people often send annoying messages, their mails will have higher chances being filtered. Also, group manager can also control messages (e.g.) by kicking some “bad users” out only based on what they say rather than who they are.

This system is related to but different from existing platforms such as Google Groups, Yahoo Groups, forum service websites and software, Yahoo Answers and so on.

**Wednesday**, October 13, **16:00**

Speaker: Mohamed Yahya

RDRD: Accelerating Rule-Based Query Processing in Disk-Resident RDF Knowledge Bases - Mohamed Yahya

Collections of tens of millions of automatically extracted facts represented using the subject-predicate-object RDF model are available for several domains. As big as these collections are, they are unable to capture all information about a domain, simply because the sources from which they were extracted are incomplete. This can be tackled by creating knowledge bases where facts are enforced with rules showing how new facts can be generated from existing ones and constraints which must hold in the relevant domain. Querying such knowledge bases is expensive for two main reasons. First, data is disk resident, which makes access to it slow. Secondly, rule definitions can be recursive, which requires special query evaluation techniques and renders traditional cost-based query optimization and join-ordering techniques less effective.

This talk presents the implementation of a query processor for such a setting. We show how we integrated the RDF-3X RDF query engine into our query processor. We also present optimizations along several dimensions: (i) query evaluation techniques, (ii) caching to reduce both disk access and rule evaluation, (iii) a classification of predicates which allows better utilization of the underlying storage engine's ability to optimize traditional relational queries and (iv) a probabilistic way of looking at join ordering and cost estimation in this context.

Friday, October 15, 14:00

Speakers: Claudia Orellana, Mihai Georgescu

Journalism 2.0 : A Personalized News and Social Media Monitor - Claudia Orellana

Slides in PDF [1.2 MB]

Nowadays, information is produced communally breaking classic paradigms where news production was concentrated in a few sources. This information explosion imposes a challenge to users who want to be informed and form their own opinion with respect to the reported events. On one hand, it is very difficult and requires too long to explore all the options by manually grouping similar items in order to compare and have a neutral point of view regarding an event. On the other hand, information is highly dynamic but necessary for an optimal decision making. In this talk, Claudia Orellana will present her Bachelor's thesis: “A Personalized News and Social Media Monitor”. The monitor implemented collects news articles from user defined sources and process them, automatically identifying topics, entities, relevant tags and “feelings”, to finally present different graphic results to the users in a way they can easily visualize over time, filter or search what is interesting for them. The design, architecture and concrete implementation will be discussed and demonstrated in the presentation, as well as future extensions.

Bringing Order to Your Photos: Event -Driven Classification of Flickr Images Based on Social Knowledge - Mihai Georgescu

With the rapidly increasing popularity of Social Media sites, a lot of user generated content has been injected in the Web, thus resulting in a large amount of both multimedia items (music, pictures,videos) and textual data (tags and other text-based documents). As a consequence, especially for multimedia content it has become more and more difficult to find exactly the objects that best match the users' information needs. The methods we propose in this paper try to alleviate this problem and we focus on the domain of pictures, in particular on a subset of Flickr data. Many of the photos posted by users on Flickr have been shot during events and methods aim to allow browsing and organization of picture collections in a natural way, by events. The algorithms we introduce in this paper exploit the social information produced by users in form of tags, titles and photo descriptions, for classifying pictures into different event categories. The extensive automated experiments demonstrate that our approach is very effective and opens new possibilities for multimedia retrieval, in particular image search. Moreover, the direct comparison with previous event detection algorithms confirm once more the quality of our methods.

Oct 29

organized by:

Speakers:

Topic(s)

Nov 5

organized by:

Speakers: Katerina

Data Cleaning for data integration (45 mins)

Overview of literature relevant to Entity Resolution, i.e., the task of identifying and merging data that refer/describe the same real world object such as a location, a person, or a conference. Existing approaches are presented and discussed grouped in four categories: atomic similarity methods for comparing strings, similarity methods for sets of strings, facilitating inner-relationships, and methods related to uncertain data management.

Nov 12

organized by:

Speakers: Ivana Marenzi

Topic(s) Collaborative Web - Google Wave

Overview and discussion about the collaboration platform Google Wave(www.googlewave.com)

Since I´m preparing a lecture on this topic for the WebScience course in two weeks, the idea is to involve the participants in a preliminary discussion on current communication and collaboration tools, describe the most relevant functionalities of Google Wave, and give possible examples. The final goal of the WebScience course lecture will be to collect students ideas about new scenarios in which Google Wave could be useful.

Nov 19

organized by: Ernesto Diaz-Aviles

Speaker: Zeno Gantner

Learning Attribute-to-Feature Mappings for Cold-Start Recommendations

Cold-start scenarios in recommender systems are situations in which no prior events, like ratings or clicks, are known for certain users or items. To compute predictions in such cases, additional information about users (user attributes, e.g. gender, age, geographical location, occupation) and items (item attributes, e.g. genres, product categories, keywords) must be used. We describe a method that such entity (e.g. user or item) attributes to the latent features of a matrix (or higher- dimensional) factorization model. With such mappings, the factors of a MF model trained by standard techniques can be applied to the new-user and the new-item problem, while retaining its advantages, in particular speed and predictive accuracy. We use the mapping concept to construct an attribute- aware matrix factorization model for item recommendation from implicit, positive-only feedback. Experiments on the new- item problem show that this approach provides good predictive accuracy, while the prediction time only grows by a constant factor.


MyMediaLite: a recommender system algorithm library

MyMediaLite is a lightweight, multi-purpose library of recommender system algorithms. It addresses the two most common scenarios in collaborative filtering: rating prediction (e.g. on a scale of 1 to 5 stars) and item prediction from implicit feedback (e.g. from clicks or purchase actions).

The library is open source/free software, distributed under the terms of the GNU General Public License.

Nov 26

organized by: Eelco Herder

Speakers: Ricardo Kawase, George Papadakis

The Art of Multi-faceted Tagging (Ricardo)

TagMe!, a social tagging front-end for Flickr images, that provides multifaceted tagging functionality: It enables users to attach tag assignments to a specific area within an image and to categorize tag assignments. Moreover, TagMe! maps tags and categories to DBpedia URIs to clearly define the meaning of freely-chosen words. The experiments reveal the benefits of these additional tagging facets. For example, the exploitation of the facets significantly improves the performance of FolkRank-based search. Further, we demonstrate the benefits of TagMe! tagging facets for learning semantics within folksonomies.

Incorporating Context Into Real-Time Prediction of Revisitation (George)

Users frequently return to Web pages they have visited in the past for various reasons. Apart from backtracking, they revisit a number of favorite or important pages that they monitor as well as pages that pertain to tasks reoccurring on an infrequent basis. In this paper, we introduce a collection of methods that effectively facilitates revisitation by predicting the next page request, based on contextual information they incorporate. Unlike existing approaches, our methods are real-time, since they do not require any training and configuration of machine learning algorithms. We evaluate them over a large, real-world dataset, andt the outcomes suggest a significant improvement over established prediction methods that do not take context into account.

**Monday**, Nov 29, **17:00**

organized by: Gideon Zenz

Speaker: Daniel Wichert

User interface for interactive and iterative search in structured data

Nowadays more and more information is stored in huge databases or other structured data formats like ontologies in OWL. To get the right information from a database it is necessary to know a specific query language like SQL. Most users are not familiar with these languages and the systems behind. The QUICK system developed at L3S starts a search process with a common Google like search query and than finds the right information in an iterative way. However, the current version of QUICK has only a simple and rather inflexible user interface.

The motivation of my master thesis was to develop a new improved user interface for the successor system, which supports different user search strategies by the most expedient design of the user interface components. Therefore I evaluated the components with a framework from Max Wilson. This evaluation framework reviews the support of user tactics and search strategies by counting up the steps that are necessary to reach the users aim. Based on this framework I optimised the search interface and compared different approaches. The result is one approach that is similar to facetted browsing and a second one that uses a 2D graph representation. In my talk I will present these results and demonstrate a prototype that shows both solutions with the possibility to switch between different representations of the actual search iteration.

**Wednesday**, Dec 1, **16:00**

Speaker: Julia Preusse

Analysis of the WebUni Online Student Community

Nowadays, Online Social Networks present a huge opportunity to gather various information about topics such as communication patterns, structure of social networks and flow of information. Despite the popularity of large-scale online social networks, smaller local platforms such as WebUni Magdeburg maintain their attractiveness. We believe that this is the first study to examine the complete data of a smaller social network that exists for longer than seven years.

In our study, we prove that WebUni is a scale-free small-world network based on analysis of the social network graph and the guestbook network. We surprisingly detect that the rating network, which is based on users’ hidden ratings, is also a scale-free small-world network, even though to the best of our knowledge state-of-the-art theories cannot be used to explain this fact.

The WebUni database contains quantitative information on private user interactions as well as on public ones. We use these data to compute the ratio of public and private communication for outgoing and incoming interactions of a user. We observe that users tend to have a similar ratio for outgoing and incoming interactions, although the outgoing communication is slightly more private. Comparing the overall public and private interactions of a user, we notice active users to have a balanced ratio of public and private interaction or to be more biased towards private interactions. Network newbies and inactive users on the other hand are biased towards either completely public or solely private interactions.

To overcome friendship inflation, we make use of awell-known concept of Granovetter from sociology: the strength of ties. It enables us to measure the strength of different ties ranging from friendship, mutual rating and mutual guestbook writing to combinations of each of them. We discover that friendship is a solid foundation of a strong tie, but not sufficient. Friendship paired with mutual rating and guestbook writing improves the strength of ties, whereas constraints such as mutual positive rating unexpectedly do not. We are finally able to verify that the strength of ties theory holds for WebUni. A combination of minimum number of reciprocal guestbook posting, friendship and minimum number of reciprocal ratings returns up to 46 strong ties that satisfy Granovetter’s definition.

Dec 17

organized by: Ivana

Speakers: Armin Doroudian

Topic(s)

Presentation (30 min) about “Evaluation of Search User Interfaces” based on the chapter http://searchuserinterfaces.com/book/sui_ch2_evaluation.html

The Evaluation of Search User Interfaces

“What should be measured when assessing a search interface? Traditional information retrieval research focuses on evaluating the proportion of relevant documents retrieved in response to a query. In evaluating search user interfaces, this kind of measure can also be used, but is just one component within broader usability measures. Usable interfaces are defined in terms of learnability, efficiency, memorability, error reduction, and user satisfaction (Nielsen, 2003b, Shneiderman and Plaisant, 2004). However, search interfaces are usually evaluated in terms of three main aspects of usability: effectiveness, efficiency, and satisfaction.

This presentation summarizes some major methods for evaluating user interfaces, followed by a set of guidelines about special considerations to ensure successful search usability studies and avoid common pitfalls. The presentation concludes with general recommendations for search interface evaluation.”

Jan 7

organized by:

Speakers: Elena

Topic(s) DivQ: Diversification for Keyword Search over Structured Databases

Keyword queries over structured databases are notoriously ambiguous. No single interpretation of a keyword query can satisfy all users, and multiple interpretations may yield overlapping results. This paper proposes a scheme to balance the relevance and novelty of keyword search results over structured databases. Firstly, we present a probabilistic model which effectively ranks the possible interpretations of a keyword query over structured data. Then, we introduce a scheme to diversify the search results by re-ranking query interpretations, taking into account redundancy of query results. Finally, we propose α- nDCG-W and WS-recall, an adaptation of α-nDCG and S-recall metrics, taking into account graded relevance of subtopics. Our evaluation on two real-world datasets demonstrates that search results obtained using the proposed diversification algorithms better characterize possible answers available in the database than the results of the initial relevance ranking.

Jan 14

organized by: Dimitris

Speakers: Dimitris, Julien, Marco

Topic(s)

Efficient Discovery of Frequent Subgraph Patterns in Uncertain Graph Databases (Dimitris)

Mining frequent subgraph patterns in graph databases is a challenging and important problem with applications in several domains. Recently, there is a growing interest in generalizing the problem to uncertain graphs, which can model the inherent uncertainty in the data of many applications. The main difficulty in solving this problem results from the large number of candidate subgraph patterns to be examined and the large number of subgraph isomorphism tests required to find the graphs that contain a given pattern. The latter becomes even more challenging, when dealing with uncertain graphs. In this paper, we propose a method that uses an index of the uncertain graph database to reduce the number of comparisons needed to find frequent subgraph patterns. The proposed algorithm relies on the apriori property for enumerating candidate subgraph patterns efficiently. Then, the index is used to reduce the number of comparisons required for computing the expected support of each candidate pattern. It also enables additional optimizations with respect to scheduling and early termination, that further increase the efficiency of the method. The evaluation of our approach on three real-world datasets as well as on synthetic uncertain graph databases demonstrates the significant cost savings with respect to the state-of-the-art approach.

Time-Aware Entity-Based Multi-Document Summarisation (Julien)

Automatic news multi-document summarisation received increased intention lately to cope with the increasing amount of news articles and sources. Summarisation of news article has the additional challenge that document (news articles) are timestamped, and often relate events which themselves inscribe in time

We propose three contributions which we believe will help improving summarisation quality:

  1. Considering named entities in news article
  2. Considering time for summarisation and for summary layout
  3. Considering time references in the text in addition to article timestamps

For this we augment a state-of-the-art summarisation technique with named entities and time references, and adapt a state-of-the-art news event detection to cluster sentences to improve summarisation of news article.

This work is in progress, and I will present the general approach and ideas, as well as the current status of the work.

Detecting Health Events on the Social Web to Enable Epidemic Intelligence (Marco)

Content analysis and clustering of natural language documents becomes crucial in various domains, even in public health. Recent pandemics such as Swine Flu have caused concern for public health officials. Given the ever increasing pace at which infectious diseases can spread globally, Officials must be prepared to react sooner and with greater epidemic intelligence gathering capabilities. There is a need to allow for information gathering from a broader range of sources, including the Web which in turn requires more robust processing capabilities. To address this limitation, in this paper, we propose a new approach to detect public health events in an unsupervised manner. We address the problems associated with adapting an unsupervised learner to the medical domain and in doing so, propose an approach which combines aspects from different feature-based event detection methods. We evaluate our approach with a real world dataset with respect to the quality of article clusters. Our results show that we are able to achieve a precision of 62% and a recall of 75% evaluated using manually annotated, real-world data.

Jan 21

organized by:

Speakers:

Topic(s)

Jan 28

organized by:

Speakers:

Topic(s)

tbd.

organized by: Kerstin D.

Speakers: Avaré, Ernesto

Topic(s)

tbd.

*postponed*

Service Oriented Architectures

organized by: Ralf Groeper

Speakers: Patrick, NN

tbd.

Chemical Information Systems

organized by: Wolf-Tilo

Speakers: Benjamin, Sascha

tbd.

LivingKnowledge: Methods for Handling Bias and Diversity - Dimitris

tbd.

From Living Web Archives to Community Memories

organized by: Thomas

Speakers: Nina, Gideon

l3sintern/research_seminar_10.txt · Last modified: 2011/01/13 12:29 by siberski