search_cql.md 7.28 KB

Edit Raw Blame History



Corpus Query Language

Within Lucene and Solr, each field containing tokenized text can be considered as a set of tokens, where each token is associated with a position and its value can be seen as a word from the original text. Mtas extends this concept by allowing to associate multiple positions with one token and by associating each token with a prefix and optional postfix instead of the single value. This makes it possible to use multiple tokens on the same position, and distinguish annotations by using a unique prefix for each type, and allows structures like sentences, paragraphs or entities consisting of multiple adjacent or non-adjacent positions.

To describe sets of tokens matching some condition, a query language is needed. Mtas supports CQL based on the Corpus Query Language introduced by the Corpus WorkBench and supported by the Lexicom Sketch Engine.


Prefix

For each field containing Mtas tokenized text, every token is associated with a prefix. Within the field, only a limited set of prefixes is used to distinguish between the different types of annotation. By using a prefix query a full list of used prefixes can be produced.


Value

The optional postfix associated with a token can be queried within CQL by providing a value. This is a regular expression, the supported syntax is documented in the RegExp class provided by Lucene. By using a termvector query, for each prefix a list of postfix values can be produced. 


Variable

The optional postfix associated with a token can also be queried within CQL by providing a variable. Each variable may occur only once in a CQL query, and should be provided as a comma separated list together with this query. Each provided variable has to occur in the query.


CQL


Syntax
Description
Example


token
Matches a single position token
[t="de"]


multi-position
Matches a (single or) multi position token
<s/>


sequence
Matches a sequence
[pos="ADJ"]{2}[pos="N"]


Syntax
Description
Example


cql { <number> }

Matches provided number of occurrence from cql

[pos="ADJ"]{2}


cql { <number> , <number>}

Matches each number between provided start and end of occurrence from cql

[pos="ADJ"]{2,3}


Syntax
Description
Example


( cql ) within ( cql )

Matches CQL expression within another CQL expression
([t="de"]) within (<s/>)


( cql ) !within ( cql )

Matches CQL expression not within another CQL expression
([t="de"]) !within (<s/>)


( cql ) containing ( cql )

Matches CQL expression containing another CQL expression
(<s/>) containing ([t="de"])


( cql ) !containing ( cql )

Matches CQL expression not containing another CQL expression
(<s/>) !containing ([t="de"])


( cql ) intersecting ( cql )

Matches CQL expression intersecting another CQL expression
(<s/>) intersecting (<div/>)


( cql ) !intersecting ( cql )

Matches CQL expression not intersecting another CQL expression
(<s/>) !intersecting (<div/>)


Token


Syntax
Description
Example


[ ]
Matches each single position token
[]


" value "

Matches a single position token with condition defined by a basic single-position-expression, where the prefix is the default prefix provided with the query
"de"


[ single-position-expression ]

Matches single position token with condition defined by an single-position-expression

[t="de"]


Single Position Expression


Expression
Syntax
Example


basic

prefix = \"value\"

t="de"


variable

prefix = $[variable-name]
t=$1


not

! single-position-expression

!t="de"


and

( single-position-expression & single-position-expression & ... )

t="de" & pos="LID"


or

( single-position-expression **\
** single-position-expression **\


position

# <position>
#100


range

# <position> - <position>
#100-110


Multi-position


Syntax
Description
Example


< multi-position-expression />

Matches (single and) multi position tokens with condition defined by multi-position-expression

<s/>


< multi-position-expression >

Matches start of (single and) multi position tokens with condition defined by multi-position-expression

<s>


</ multi-position-expression >

Matches end of (single and) multi position tokens with condition defined by multi-position-expression

</s>


Multi Position Expression


Expression
Syntax


prefix
prefix


basic

prefix = \"value\"


Sequence


Syntax
Description
Example


cql  cql  cql...
A sequence of cql

[t="de"][pos="ADJ"]{2}[pos="N"]