Big Data and the Lost Web

The Internet doesn’t forget. Nevertheless many contents are untraceable at first attempt – old web pages for example. There are organizations that keep and make available old pages for posterity, especially the Internet Archive in California. Since 2014, researchers at the L3S have been working intensively on web archives and developing new use cases, access methods and analysis methods for these interesting data collections. The L3S has access to a local copy of the entire German web under the domain .de, which has archived the Internet Archive since 1996.

This is a question that the researchers are concerned with: How can access to these archived pages become easier for each of us? Especially by the temporal aspect, that hardly plays a role in the current Web, in an archive however highest priority has, new requirements result to search engines. It is no longer only important to find a page that is as relevant as possible, but also a specific version of a page that may have changed in the meantime or is no longer available. Two search engines, which were developed as prototypes against this background, but are still being further developed and improved, can already be tried out at ArchiveSearch and Tempas – Temporal Archive Search.

The interest in web archives is also increasing in other scientific disciplines. For historians, political scientists and others who used to work largely with analogue data, the web is becoming more and more important – and so are archived websites. However, it is no longer possible to read all documents due to the huge amount of data. So new data processing methods are needed. The L3S is also researching this and offers ArchiveSpark, one of the most frequently used tools for efficient access to “historical” web collections. The software is developed by the scientists at the L3S together with the Internet Archive in order to enable data analyses of any kind in web archives as easily as possible.

Another project that also deals with other large amounts of data is SoBigData. Together with other scientists from Italy, Great Britain, the Netherlands, Estonia, Finland and Switzerland, the L3S is developing a European research infrastructure for Big Data. Data sets from different sources as well as different tools for working with these data are integrated in an open platform. In addition, guidelines will be shared both for the practical work with the data and with a special focus on the legal and data protection aspects as well as comprehensive examples and templates on the SoBigData platform. The L3S also makes the above-mentioned work on web archives available to other scientists from all over Europe in this way.

The growing role of data and big data is not only noticeable in research, but also in business and industry. The research center L3S offers with its work from the projects ALEXANDRIA and SoBigData the optimal basis for an easy entry into this complex field. As the web becomes more and more the primary medium for sharing news, information and data, the importance of web archives as witnesses of this development will continue to increase and also move into areas where this is not foreseeable today. The goal of ALEXANDRIA is to pave this way. The SoBigData infrastructure as a central point of contact for all questions concerning Big Data makes it possible to make the research results accessible to others and to jointly develop approaches for working with this and other data in Europe.

Featured Projects

Contact

Prof. Dr. Avishek Anand

anand@l3s.de

Avishek Anand heads the ALEXANDRIA and SoBigData projects at L3S.