User Tools

Site Tools


l3sintern:research_seminar_1112

Research Seminar Winter Semester 2011/2012

The Research Seminar takes place on Friday at 14:00 in our Multimedia Room (1526), Appelstr. 9a, 15th floor (unless stated otherwise).

Oct 14

organized by:

Speaker: Sotiris Gkountelitsas

Toward Alternative Measures for Ranking Venues: A Case of Database Research Community

Ranking of publication venues is often closely related with important issues such as evaluating the contributions of individual scholars/research groups, or subscription decision making. The development of large-scale digital libraries and the availability of various meta data provide the possibility of building new measures more efficiently and accurately. In this work, we propose two novel measures for ranking the impacts of academic venues an easy-to-implement seed-based measure that does not use citation analysis, and a realistic browsing-based measure that takes an article reader's behavior into account.

This is a student seminar talk about the paper Su Yan and Dongwon Lee. Toward alternative measures for ranking venues: a case of database research community. JCDL 2007. http://doi.acm.org/10.1145/1255175.1255221

Oct 21

organized by:

Speaker: Mohamadou Nassourou

Towards a Knowledge-Based Teaching and Learning System for the Quranic Text

Understanding religious texts such as the Quran is not a trivial task. Religious texts usually encompass a lot of hidden knowledge, possess peculiar style of narration, and are sometimes confusing. Almost every religious text has got two sides: the text itself and the information surrounding the text. Referring to the Quran particularly, despite the availability of computer programs for linguistically (POS, NE tagging…) analyzing the Arabic Quranic text, many people still encounter difficulties in properly comprehending the text, simply because the background information about some parts of the text is missing, and the learning methodology is in many cases inappropriate. The Quran is an early medieval book consisting of 6236 verses with almost half of the verses similar to each other, and 98 verses repeated 181 times. Based on this similarity evidence, machine learning techniques namely text mining methods could help deriving in a generic manner missing background information for each chapter and verse, and generating concise summaries wherever needed.

In this research an attempt to create a knowledge-based teaching and learning system for the Quranic text has been performed. The knowledge base is made up of the Quranic text along with detailed information about each chapter and verse, and some rules. The system offers the possibility to study the Quran through web-based interfaces implementing novel visualization techniques for browsing, querying, consulting, and testing the acquired knowledge. Additionally the system possesses knowledge acquisition facilities for maintaining the knowledge base. From the design of an explicit representation scheme covering all pieces of the domain knowledge, to the acquisition and manipulation of the knowledge, the system assists users in their efforts to analyze, understand, and memorize religious texts such as the Quran. Knowledge collected from experts, literature review, and deductive machine learning classifiers, is represented using interconnected frames, and then automatically converted to XML model, which is finally stored in the knowledge base. Intuitive and user-friendly graphical interfaces guide users throughout the learning and consultation processes.

Oct 28, 14:30

organized by:

Speaker: Gerhard Gossen

Classifying Emails by Purpose using Multi-Type n-Grams

Email clients that know what the purpose (in contrast to the topic) of an email message is, can help users to become more efficient in their use of email. It is however hard to classify this messages automatically, because standard text classification methods perform bad at this task. Thus different methods need to be developed.

In this work we propose to use an extension of word n-grams as features in the classification. The extension is called multi-type n-grams because it allows the n-grams to contain values from multiple different sources of information (types). This allows us to work around some problems of standard n-grams. The most prominent advantage is that multi-type n-grams allow us to find patterns even in smaller datasets because infrequent items can be replaced by generalizations as needed.

The complexity of finding such extended n-grams is higher than that of finding standard n-grams. We use a sequence mining algorithm to keep this complexity at a manageable level. An evaluation of the classification performance of multi-type n-grams shows that they perform slight worse than standard word n-grams. We are still able to train a Support Vector Machine Classifier, that can classify emails according to their purpose with a F1 measure of 0.87 for multi-type n-grams and 0.9 for word n-grams. We can also show that the classifier created from multi-type n-grams reaches this performance using only 25% of the number of features necessary for the standard n-gram classifier, which suggests that multi-type n-grams enable us to compactly capture the “essence” of a text class.

Nov 4

organized by: Mohammad Alrifai

Speaker: Mohammad Alrifai and Giang Tran

Tag Clouds Revisited

Abstract:

Tagging has become a very common feature in Web 2.0 applications, providing a simple and effective way for users to freely annotate resources to facilitate their discovery and management. Subsequently, tag clouds have become popular as a summarized representation of a collection of tagged resources. A tag cloud is typically a visualization of the top-k most frequent tags in the underlying collection. In this paper, we revisit tag clouds, to examine whether frequency is the most suitable criterion for tag ranking. We propose alternative tag ranking strategies, based on methods for random walk on graphs, diversification, and rank aggregation. To enable the comparison of different tag selection and ranking methods, we propose a set of evaluation metrics that consider the use of tag clouds for search, navigation and recommendations. We apply these tag ranking methods and evaluation metrics to empirically compare alternative tag clouds in a dataset obtained from Flickr, comprising 488,112 tagged photos organized in 451 groups, and 112,514 distinct tags.

Multimodal Distributional Semantics

Abstract:

Distributional semantic models (DSMs; Turney and Pantel 2010) approximate the meaning of words with vectors that keep track of the patterns of co-occurrence of the words in a corpus, under the hypothesis that semantically related words should occur in similar contexts (the distributional hypothesis of Harris, 1954). Despite their impressive empirical successes, DSMs are not entirely satisfactory as psychological models of how we humans acquire and use semantic knowledge, since it is obvious that we can rely not only on linguistic context, but also on our rich perceptual experience (Louwerse, 2011). In our current research, we adopt a broader view of distributional semantics. We hypothesize that word meaning can be largely captured by vectors summarizing co-occurrence patterns, but observe that co-occurrence needs not be limited to linguistic contexts. In particular, our multimodal distributional semantic model (MDSM) exploits both co-occurrence with words (from a standard text corpus) and co-occurrence with visual features extracted using computer vision techniques from collections of labeled images.

We evaluate our MDSM on the tasks of predicting semantic similarity judgments, concept categorization and capturing semantic neighbours of di erent classes. A cautious interpretation of our results is that adding image-based features is at least not damaging, when compared to adding further text- based features, and possibly bene cial. Importantly, in all experiments we nd that image-based features lead to interesting qualitative di erences in performance. The MDSM is better at capturing similarities between concrete concepts and focuses on their more imageable properties (such as colour), whereas a comparable text-based DSM is more geared towards abstract concepts and properties.

Nov 11

organized by:

Speakers: Makni Bassem, Thaer Samar

Semantic integration of TV data and services - Makni Bassem

In this talk, I present the impact of semantic Web and semantic Web services on enabling novel television features. Many research efforts have contributed to extending different aspects of television delivery and consumption, with respect to content production, metadata handling, semantic enrichment and recommendation. They adhere to the semantic Web vision for two goals: the seamless integration of their data and the automation of their Web services interoperation and composition. I will also present a showcase for automated TV services invocation.

Scalable Distributed Time-Travel Text Search - Thaer Samar

Web archives play an important role in preserving born-digital contents, archiving data is important for future generations, researchers, historians, and for the public. Time-travel text search addresses the limited access to web archives by extending regular text search with time-travel functionality. Time-travel text search combines Boolean queries (e.g., mpi AND saarland) and keyword queries (e.g., mpi saarland) with a time point (e.g., 2011/02/01) or time interval of interest (e.g., [2010/01/01, 2010/12/31]. Only documents that match the query and whose valid-time interval overlaps with the given query time-interval should be retrieved in response to the query. Time-travel text search has to be highly scalable to cope with huge size of web archives. Hadoop and HBase, as open-source implementations of Google's MapReduce and BigTable, have recently become popular as tools to deal with massive datasets in a distributed environment.

In my talk, I describe a scalable distributed implementation of time-travel text search on top of Hadoop and HBase. I will present experiments to show the performance and scalability of our indexing approaches on different collections, and the query processing performance for the indexes resulted from these approaches.

Nov 18

organized by:

Speaker:

Title

Abstract

Nov 25

organized by:

Speakers: Susanne Oetzmann (uni-transfer), Thomas Gutsche, Matthias Pfau, Arne Möhle

SecureMail, an EXIST funded startup

The SecureMail project is a startup funded by the EXIST scholarship, the BMWi and the EU. We work in cooperation with the L3S to develop our product SecureMail. SecureMail is the most secure, flexible and easy-to-use webmail system. The data is stored entirely encrypted on cloud servers. SecureMail guarantees innovative features, complete security and highest cost efficiency.

We will present our project and provide information for potential entrepreneurs. This includes the EXIST Gründerstipendium (scholarship), the EXIST Forschungstransfer (research scholarship) and the outstanding founding support provided by uni-transfer.

Dec 13

(note that this talk takes place on Tuesday at 2pm)

organized by:

Speaker: Clare J. Hooper

Teasing Apart and Piecing Together: Understanding and Redesigning User Experiences for New Contexts

Teasing Apart and Piecing Together (TAPT) is a method for understanding experiences and redesigning them for new contexts, through deconstruction and reconstruction. TAPT's development was motivated by issues of accessibility, such as the lack of access to web-based social tools by people who are offline for any of a number of reasons, from disability to poverty to cultural factors. TAPT has been used to explore a variety of scenarios, such as how we might rebuild wiki technology in museums, or social networks in rural India. TAPT has also been used to evaluate technological systems. This talk will introduce the method and its uses.

Clare Hooper is a postdoctoral fellow in the User Centred Engineering research group in the Department of Industrial Design at the Eindhoven University of Technology. She completed her EngD in Computer Science at the University of Southampton, where she developed and evaluated TAPT. In addition to HCI and user experience, Clare's research interests include Web Science and Hypertext. She is intrigued by the challenges of interdisciplinary work, and combining qualitative and quantitative methods.

Jan 13

organized by:

Speakers: Natalia Prytkova, Irina Oelze, Felix Nüsser

Modelling and Evaluation of Co-Evolution of Collective Web Memories

Speaker: Natalia Prytkova

The constantly evolving Web reflects the evolution of society in the cyberspace. Projects like the open directory project (dmoz.org) can be understood as a collective memory of society on the Web. The main assumption is that such collective Web memories evolve when a certain cognition level about a concept has been exceeded. In my talk I will introduce the notion of concepts, show how to trace them in news articles and explain how this dynamics can be employed to predict changes in the category system of DMOZ.

Integration of YAGO Ontology in the IQP Query Construction System to support efficient Query Construction over a large-scale relational Database

Speaker: Irina Oelze (Presentation of the Master's thesis)

IQP query construction system empowers naive database users to create their own structured queries in an interactive way, starting from simple keywords and refining the initial query using options automatically suggested by the system. The efficiency and usability of IQP was experimentally confirmed for the middle-sized datasets such as IMDB and Lyrics.

Freebase is a large-scale open-world database currently containing more than 20 million entities. Deployment of IQP on Freebase faces additional challenges as the schema of Freebase is big and the query construction options derived based on this schema only are not informative enough to enable an efficient query construction process. In this presentation we discuss the challenges and our solutions for integration of YAGO ontology into the IQP system to summarize the database schema of Freebase and optimize the query construction process. Our evaluation results confirm that YAGO-based options improve efficiency of the query construction process significantly.

A Music Search Engine Built upon Audio-based and Web-based Similarity Measures

Speaker: Felix Nüsser (Seminar Wissensbasierte Systeme)

An approach is presented to automatically build a search engine for large-scale music collections that can be queried through natural language. While existing approaches depend on explicit manual annotations and meta-data assigned to the individual audio pieces, we automatically derive descriptions by making use of methods from Web Retrieval and Music Information Retrieval. Based on the ID3 tags of a collection of mp3 files, we retrieve relevant Web pages via Google queries and use the contents of these pages to characterize the music pieces and represent them by term vectors. By incorporating complementary information about acoustic similarity we are able to both reduce the dimensionality of the vector space and improve the performance of retrieval, i.e. the quality of the results. Furthermore, the usage of audio similarity allows us to also characterize audio pieces when there is no associated information found on the Web.

Jan 20

organized by:

Speakers: Nils Schreiber, Nattiya Kanhabua

ATAM+: Extraction Of Ailment Rates From Twitter

Speaker: Nils Schreiber (Seminar Wissensbasierte Systeme)

This talk gives an in-depth presentation of TAM, ATAM und ATAM+.

It is based on the following paper:

You Are What You Tweet: Analyzing Twitter for Public Health

Analyzing user messages in social media can measure different population characteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and insomnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.

Time-aware Approaches to Information Retrieval

Speaker: Nattiya Kanhabua

In this thesis, we address major challenges in searching temporal document collections. In such collections, documents are created and/or edited over time. Examples of temporal document collections are web archives, news archives, blogs, personal emails and enterprise documents. Unfortunately, traditional IR approaches based on term-matching only can give unsatisfactory results when searching temporal document collections. The reason for this is twofold: the contents of documents are strongly time-dependent, i.e., documents are about events happened at particular time periods, and a query representing an information need can be time-dependent as well, i.e., a temporal query.

Our contributions in this thesis are different time-aware approaches within three topics in IR: content analysis, query analysis, and retrieval and ranking models. In particular, we aim at improving the retrieval effectiveness by 1) analyzing the contents of temporal document collections, 2) performing an analysis of temporal queries, and 3) explicitly modeling the time dimension into retrieval and ranking.

Leveraging the time dimension in ranking can improve the retrieval effectiveness if information about the creation or publication time of documents is available. In this thesis, we analyze the contents of documents in order to determine the time of non-timestamped documents using temporal language models. We subsequently employ the temporal language models for determining the time of implicit temporal queries, and the determined time is used for re-ranking search results in order to improve the retrieval effectiveness.

We study the effect of terminology changes over time and propose an approach to handling terminology changes using time-based synonyms. In addition, we propose different methods for predicting the effectiveness of temporal queries, so that a particular query enhancement technique can be performed to improve the overall performance. When the time dimension is incorporated into ranking, documents will be ranked according to both textual and temporal similarity. In this case, time uncertainty should also be taken into account. Thus, we propose a ranking model that considers the time uncertainty, and improve ranking by combining multiple features using learning-to-rank techniques.

Through extensive evaluation, we show that our proposed time-aware approaches outperform traditional retrieval methods and improve the retrieval effectiveness in searching temporal document collections.

this is a defense rehearsal talk

Jan 27

Organized by: Ernesto Diaz-Aviles

Speakers: Ernesto Diaz-Aviles & Avaré Stewart

Personalized Online Ranking for Social Media Streams

Speaker: Ernesto Diaz-Aviles

The widely adoption and continued growth of the Social Web has resulted in a ever increasing volumes of data created, consumed and shared by users. Web 2.0 applications such as blogs, multimedia sharing systems and streams of status information from applications such as Facebook and Twitter are among the most popular. This vast amount of information exchanged and the temporal dynamics of the Social Web poses new challenges in terms of ranking and collaborative filtering (CF). While it is good to have more training data, it is challenging for many existing CF algorithms to handle it. Furthermore, traditional CF techniques are trained in batch mode, which has limited ability to track temporal dynamics of user's preferences. In this seminar talk we introduce a novel approach RankMF-Online for learning a Matrix Factorization model online from a stream of social media data. In particular, we address the item recommendation task and focus on predicting a personalized ranking from stream data. The proposed learning method is directly optimized for such task and based on stochastic gradient descent, which makes it capable to handle large-scale data. We present initial experimental results on a collection of millions of tweets and discuss current and future directions.


x-SiTE Trigger Sentence Transfer for Epidemic Intelligence

Speaker: Avaré Stewart

Event-Based Epidemic Intelligence (e-EI) has arisen as a body of work which relies upon different forms of pattern recognition in order to detect the disease reporting events from unstructured text that is present on the Web. Current supervised approaches to e-EI suffer both from high initial and high maintenance costs, due to the need to manually label examples to train, and update a classifier for detecting disease reporting events across sites.

In this paper, we propose a new method for the supervised detection of disease reporting events. We tackle the burden of manually labeling data by exploiting the linguistic structures in the large, open source repositories of outbreak reports. The linguistic structures in the outbreak reports act as a type of interlingua, which constrains the pattern a disease-reporting sentence can have within target sites.

The weakly labeled sentences of the outbreak reports are propagated, across sites, to detect relevant trigger sentences in blogs and news. Our experiments show that with trigger sentence transfer, we are able to overcome the burden of manually labeling the sentences of multiple target sites; showing promising results for e-EI that are comparable with state-of-the-art methods.

Feb 3

organized by: Julien Gaugaz

Speaker: Maximilian Peters (Seminar Wissensbasierte Systeme)

Who Will Follow You Back? Reciprocal Relationship Prediction

This is a student seminar talk about third-party paper recently presented at CIKM 2011: Hopcroft, J., Lou, T., & Tang, J. Who Will Follow You Back? Reciprocal Relationship Prediction. CIKM 2011.

We study the extent to which the formation of a two-way relationship can be predicted in a dynamic social network. A two-way (called reciprocal) relationship, usually developed from a one-way (parasocial) relationship, represents amore trustful relationship between people. Understanding the formation of two-way relationships can provide us insights into the micro-level dynamics of the social network, such as what is the underlying community structure and how users influence each other.

Employing Twitter as a source for our experimental data, we propose a learning framework to formulate the problem of reciprocal relationship prediction into a graphical model. The framework incorporates social theories into a machine learning model. We demonstrate that it is possible to accurately infer 90%of reciprocal relationships in a dynamic network. Our study provides strong evidence of the existence of the structural balance among reciprocal relationships. In addition, we have some interesting findings, e.g., the likelihood of two “elite” users creating a reciprocal relationships is nearly 8 times higher than the likelihood of two ordinary users. More importantly, our findings have potential implications such as how social structures can be inferred from individuals’ behaviors.

March 2

organized by: Eelco Herder

Speaker: Jessica Emma Clark (Bachelor Thesis)

Impact of Differences in Sentiment introduced by Automatic Translation

This thesis deals with changes in sentiment introduced by automatic translation. Sentiment analysis is the process of automatically determining sentiment expressed in natural language. The problem I study is to determine the overall sentiment expressed in texts written in natural language and its possible changes after machine translation using a state-of-the-art statistical machine translation system named Moses. The sentiment analysis is performed by the text analysis softwarre program LIWC. The results show both the positive and negative sentiment of all five analysed texts are slightly stronger after translating from English into German and back again.

April 27

Organized by: Ernesto Diaz-Aviles, Elena Demidova, Stefan Dietze, and Wolfgang Nejdl

As the World Wide Web Conference is a core venue for our research, this research seminar will center around it.

WWW2012 participants will provide an oral feedback and report on the World Wide Web Conference 2012 held in Lyon last week. We will highlight the main technical sessions we attended and briefly discuss some key aspects of the papers that captured our attention during the conference. The objective is to help identify new creative ideas to be developed within the L3S towards WWW 2013 in Rio.

Slides: Operation Rio


l3sintern/research_seminar_1112.txt · Last modified: 2012/04/27 14:29 by diaz