Corpus Query Language
Within Lucene and Solr, each field containing tokenized text can be considered as a set of tokens, where each token is associated with a position and its value can be seen as a word from the original text. Mtas extends this concept by allowing to associate multiple positions with one token and by associating each token with a prefix and optional postfix instead of the single value. This makes it possible to use multiple tokens on the same position, and distinguish annotations by using a unique prefix for each type, and allows structures like sentences, paragraphs or entities consisting of multiple adjacent or non-adjacent positions.
To describe sets of tokens matching some condition, a query language is needed. Mtas supports CQL based on the Corpus Query Language introduced by the Corpus WorkBench and supported by the Lexicom Sketch Engine.
Prefix
For each field containing Mtas tokenized text, every token is associated with a prefix. Within the field, only a limited set of prefixes is used to distinguish between the different types of annotation. By using a prefix query a full list of used prefixes can be produced.
Value
The optional postfix associated with a token can be queried within CQL by providing a value. This is a regular expression, the supported syntax is documented in the RegExp class provided by Lucene. By using a termvector query, for each prefix a list of postfix values can be produced.
Variable
The optional postfix associated with a token can also be queried within CQL by providing a variable. Each variable may occur only once in a CQL query, and should be provided as a comma separated list together with this query. Each provided variable has to occur in the query.
CQL
Syntax |
Description |
Example |
token |
Matches a single position token |
[t="de"] |
multi-position |
Matches a (single or) multi position token |
<s/> |
sequence |
Matches a sequence |
[pos="ADJ"]{2}[pos="N"] |
Syntax |
Description |
Example |
cql { <number> }
|
Matches provided number of occurrence from cql
|
[pos="ADJ"]{2} |
cql { <number> , <number>}
|
Matches each number between provided start and end of occurrence from cql
|
[pos="ADJ"]{2,3} |
Syntax |
Description |
Example |
( cql ) within ( cql )
|
Matches CQL expression within another CQL expression |
([t="de"]) within (<s/>) |
( cql ) !within ( cql )
|
Matches CQL expression not within another CQL expression |
([t="de"]) !within (<s/>) |
( cql ) containing ( cql )
|
Matches CQL expression containing another CQL expression |
(<s/>) containing ([t="de"]) |
( cql ) !containing ( cql )
|
Matches CQL expression not containing another CQL expression |
(<s/>) !containing ([t="de"]) |
( cql ) intersecting ( cql )
|
Matches CQL expression intersecting another CQL expression |
(<s/>) intersecting (<div/>) |
( cql ) !intersecting ( cql )
|
Matches CQL expression not intersecting another CQL expression |
(<s/>) !intersecting (<div/>) |
Token
Single Position Expression
Multi-position
Multi Position Expression
Sequence
Syntax |
Description |
Example |
cql cql cql... |
A sequence of cql
|
[t="de"][pos="ADJ"]{2}[pos="N"] |