index.md 2.16 KB

Edit Raw Blame History


#Multi Tier Annotation Search

In recent years, multiple solutions have come available providing search on huge amounts of plain text and metadata. Scalable searchability on annotated text however still appears to be problematic. Using Mtas, we not only take advantage of the strength from Lucene and Solr, but extend queries with CQL conditions on annotated text


[pos="LID"] [pos="ADJ"]? [lemma="amsterdam"]

<entity="location/> within (<s/> containing [lemma="utrecht"])


Parsers for several document formats are provided, each with extended possibilities for configuration, and advanced query features like statistics, termvectors and kwic are available.

Source code and releases are available on GitHub, see installation instructions on how to get started.


Nederlab 

One of the primary use cases for Mtas, the Nederlab project, currently¹ provides access, both in terms of metadata and 
annotated text, to over 15 million items for search and analysis as specified below. 


Total
Mean
Min
Max


Solr index size
1,146 G
49.8 G
268 k
163 G


Solr documents
15,859,099
689,526
201
3,616,544


Collections are added and updated regularly by adding new cores, replacing cores and/or merging new cores with existing ones. Currently, the data is divided over 23 separate cores. For 14,663,457 of these documents, annotated text varying in size from 1 to over 3.5 million words is included:


Total
Mean
Min
Max


Words
9,584,448,067
654
1
3,537,883


Annotations
36,486,292,912
2,488
4
23,589,831


¹ situation january 2017