User Tools

Site Tools


softwareprojekt08:student:s4

Title: Web crawling, libraries/APIs

Date: 24.10.2008, 17:00

People: Rasmus Buchmann (vorname.nachname(at)gmx.de)

      Oleksandr Druzhynin (druzhynin at gmail.com)

Tutor: Avaré Stewart


Presentation

Material

Some Questions:

Do you need content from a single site; does the site provide and API?

Do you need content from multi sites; is there and API?

Does the website offer an RSS Feed?

What type of Web Page to Do Have?

Do you have many pages with *same* structure

Do you have many pages with *different* structure

Do you have to be selective about the content that you extract

Do you have to preserve the structure/type of content on the page: timestamp, tags, etc.

What other criteria is important is selecting a tool? http://www.dia.uniroma3.it/~vldbproc/015_109.pdf


Some Tools

Site-Specific APIs:

Technorati: http://technorati.com/developers/api/

Diggs API: http://apidoc.digg.com/

Last.fm API: http://www.programmableweb.com/api/last.fm

Multi-Site API:

Spinn3r: “We crawl the web, so you dont have to” http://spinn3r.com/documentation

Road Runner: Abstract from HTML and create XML with text marked up using a single tag name http://www.dia.uniroma3.it/db/roadRunner/

WWW::Mechanize: Emulate a Browser http://search.cpan.org/dist/WWW-Mechanize/

HTML Parser: Fine grained manipulation HTML markup http://htmlparser.sourceforge.net/

RSS: automatically collect XML formatted data from the web page and store it locally Curn : http://www.clapper.org/software/java/curn/

Different tools Variety of Tools - Depending on purpose http://java-source.net/open-source/rss-rdf-tools/rss-reader

Presentation

softwareprojekt08/student/s4.txt · Last modified: 2008/10/24 14:01 (external edit)