LUBMft - The RDF Fulltext Benchmark

This projects extends the Lehigh University Benchmark (LUBM) by fulltext content and queries. The generated dataset contains realistic person names and publication content. The additional queries target at fulltext search capabilities of RDF stores. The LUBM benchmark was chosen to be extended due to its wide acceptance, frequent usage, and familiar ontology domain. Other existing or future benchmarks can also be extended, similarily.

The LUBMft extension consists of two parts: the data generator UBAft and the benchmark tester UBTft. The data generator now additionally generates realistic names for all persons, and realistic content for all publications. The benchmark tester has improved benchmarking capabilities, and contains new queries targeting at fulltext queries and IR features.

Publication

E. Minack, W. Siberski and W. Nejdl. "Benchmarking Fulltext Search Performance of RDF Stores", in Proceedings of the 6th European Semantic Web Conference (ESWC), pp. 81-95, Heraklion, Crete, Greece, May 31-June 4, 2009. PDF Power Point

@inproceedings{DBLP:conf/esws/MinackSN09,
  author    = {Enrico Minack and Wolf Siberski and Wolfgang Nejdl},
  title     = {{B}enchmarking {F}ulltext {S}earch {P}erformance of {RDF} {S}tores},
  booktitle = {Proceedings of the 6th European Semantic Web Conference (ESWC)},
  address   = {Heraklion, Crete, Greece},
  month     = {May 31--June 4},
  year      = {2009},
  pages     = {81--95},
  ee        = {http://dx.doi.org/10.1007/978-3-642-02121-3_10},
  isbn      = {978-3-642-02120-6}
}

Downloads

Since the original code of LUBM is released under the GNU General Public License (GPL), this license also holds for the following source code.

For running the benchmarks, you need and any of the following RDF Store UBTft Wrapper: All these projects can be executed from command line using the java interpreter. They can be compiled using Eclipse (Eclipse project files included). Use the *.jardesc files to update the jar files.

Running

  1. Generate LUBMft(N) dataset(s):
            java -jar ubaft-1.0.0.jar -univ N -onto "http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl" -names -docs
  2. Convert OWL files into RDF serializations supported by your target system using RDF2RDF (optional)
            java -jar rdf2rdf-VERSION.jar University*.owl .n3
    This converts all University*.owl files into N3 Notation. The following notations are supported:
  3. Download and extract the UBTft project, as well as some UBTft Wrapper project archives
  4. In order to let UBTft flush the filesystem cache, you have to create a link from /usr/sbin/flush-fs-cache.sh to UBT/flush-fs-cache.sh by doing
            cd /usr/sbin
            sudo ln -s PATH/UBT/flush-fs-cache.sh
  5. Add the following line to your /etc/sudoers file using sudo sudoedit /etc/sudoers
            YOURUSERNAME ALL=(root) NOPASSWD: /usr/sbin/flush-fs-cache.sh
    so that UBTft is allowed to flush the filesystem cache.
  6. In case of UBTWrapperVirtuoso5.0, you have to download Virtuoso 5.0.9, compile it if necessary, and place the whole application into a folder called
            virtuoso
    in the project folder
  7. in each UBTWrapper* project folder,
    1. configure the config.kb.* files to let the data variable point to the folder containing all benchmark data set files of the desired benchmark size.
    2. run
              sh ./load_ubt_FLAVOUR.sh
      in order to load the benchmark data into the persistent RDF repository. FLAVOUR stands for any of the existing different config.kb.* configuration files (usually different backends).
    3. then, you can perform the benchmark by running
              sh ./query_ubt_FLAVOUR.sh
      or perform the benchmark against the current RDF store and its flavours several times using
              sh ./evaluate_all.sh
      In the former case, the statistics will be printed on screen, in the latter case, they are stores in querylog*.txt files. Please adopt evaluate_all.sh to your needs.

Queries

The fulltext queries are provided in SPARQL as a template, where certain macros have to be replaced with the RDF store specific fulltext queries. These macros are:
MacroDescription
%%FULLTEXT_SEARCH_PREFIX%%namespace declarations used by the fulltext queries
%%FULLTEXT_SEARCH(?X, "keyword")%%keyword search to bind variable ?X with resources matching given keyword
%%FULLTEXT_SEARCH(?X, ub:publicationText, "keyword")%%keyword search to bind variable ?X with resources mathing given keyword only in given predicate
%%FULLTEXT_SEARCH(?X, ub:publicationText, "keyword", ?score)%%additionally returns the relevance score of the matching resource
%%FULLTEXT_SEARCH(?X, ub:publicationText, "keyword", ?snippet)%%additionally returns a snippet of the matching content
%%FULLTEXT_SEARCH(?X, ub:publicationText, "keyword", ?score, k)%%restricts the number of matching resources to the top-k
%%FULLTEXT_SEARCH(?X, ub:publicationText, "keyword", ?score, l)%%restricts the matching resources to exceed the score by the given limit l

For the following RDF stores, the macros have to be replaced according to these examples:
Jena + LARQ
NamespacePREFIX arq: <http://jena.hpl.hp.com/ARQ/property#>
Examples?lit arq:textMatch "keyword" .
?X ub:publicationText ?lit .
(?lit ?score) arq:textMatch ("keyword" 10) .
?X ub:publicationText ?lit .
(?lit ?score) arq:textMatch ("keyword" 0,75) .
?X ub:publicationText ?lit .
Sesame2 + LuceneSail
NamespacePREFIX ls: <http://www.openrdf.org/contrib/lucenesail#>
Example?X ls:matches [
 rdf:type ls:LuceneQuery ;
 ls:query "keyword" ;
 ls:property ub:publicationText ;
 ls:score ?score ;
 ls:snippet ?snippet
]
Virtuoso5
Namespacenamespace bif build-in
Example?X ?p ?lit .
?lit bif:contains '"keyword"' .
YARS
NamespacePREFIX yars: <http://sw.deri.org/2004/06/yars#>
Example?lit yars:keyword "keyword" .
?X ub:publicationText ?lit .