Header Image
HOME | PUBLICATIONS | PROJECTS | CV | LINKS Christian Kohlschütter

PROJECTS » Web Crawling

The L3S currently runs several web crawls for research purposes. It is possible that our crawlers index your website, too.

The crawlers are designed to respect the robots.txt exclusion directives as well as <META robots> tags.
For most tasks, we use the Internet Archive's, well-tested crawler "Heritrix".

Currently, we are running crawlers on different machines in the subnet
The crawlers may connect to HTTP services on your machine (TCP ports 80, 8080 etc.).

We always try to minimize additional workload to your servers by our crawlers. However, if you notice that our crawlers behave poorly, please send me an email. You can help us with tracking down the problem. In the worst case, we can simply exclude your website from being crawled.

If you get too many hits from our crawlers or if you have questions regarding our projects, please contact me via E-Mail: Christian Kohlschütter