|
1
2
3
4
5
6
7
8
9
10
|
# Corpus Query Language
Within Lucene and Solr, each field containing tokenized text can be considered as a set of tokens, where each token is associated with a position and its value can be seen as a word from the original text. Mtas extends this concept by allowing to associate multiple positions with one token and by associating each token with a prefix and optional postfix instead of the single value. This makes it possible to use multiple tokens on the same position, and distinguish annotations by using a unique prefix for each type, and allows structures like sentences, paragraphs or entities consisting of multiple adjacent or non-adjacent positions.
To describe sets of tokens matching some condition, a query language is needed. Mtas supports CQL based on the Corpus Query Language introduced by the [Corpus WorkBench](http://cwb.sourceforge.net/files/CQP_Tutorial/) and supported by the Lexicom [Sketch Engine](http://www.sketchengine.co.uk/documentation/wiki/SkE/CorpusQuerying).
<a name="prefix"></a>
#### Prefix
|
|
11
|
For each field containing Mtas tokenized text, every token is associated with a prefix. Within the field, only a limited set of prefixes is used to distinguish between the different types of annotation. By using a [prefix query](search_component_prefix.html) a full list of used prefixes can be produced.
|
|
12
13
14
15
16
|
<a name="value"></a>
#### Value
|
|
17
|
The optional postfix associated with a token can be queried within CQL by providing a *value*. This is a regular expression, the supported syntax is documented in the RegExp class provided by Lucene. By using a [termvector query](search_component_termvector.html), for each [prefix](#prefix) a list of postfix values can be produced.
|
|
18
|
|
|
19
20
21
22
23
24
25
|
<a name="variable"></a>
#### Variable
The optional postfix associated with a token can also be queried within CQL by providing a *variable*. Each variable may occur only once in a CQL query, and should be provided as a comma separated list together with this query. Each provided variable has to occur in the query.
<a name="cql"></a>
|
|
26
27
28
29
30
|
## CQL
| Syntax | Description | Example |
|---------------------------------------|----------------------------------|--------------|
|
|
31
|
| [token](#token) | Matches a single position token | `[t="de"]` |
|
|
32
33
34
35
36
|
| [multi-position](#multi-position) | Matches a (single or) multi position token | `<s/>` |
| [sequence](#sequence) | Matches a sequence | `[pos="ADJ"]{2}[pos="N"]` |
| Syntax | Description | Example |
|---------------------------------------|----------------------------------|--------------|
|
|
37
38
|
| [cql](#cql) **{** \<number\> **}** | Matches provided number of occurrence from [cql](#cql)| `[pos="ADJ"]{2}` |
| [cql](#cql) **{** \<number\> , \<number\>**}** | Matches each number between provided start and end of occurrence from [cql](#cql)| `[pos="ADJ"]{2,3}` |
|
|
39
40
41
42
43
|
| Syntax | Description | Example |
|---------------------------------------|-------------------------------------------------|---------|
|
|
44
45
46
47
48
49
|
| **\(** [cql](#cql) **\) within \(** [cql](#cql) **\)** | Matches CQL expression within another CQL expression | `([t="de"]) within (<s/>)` |
| **\(** [cql](#cql) **\) !within \(** [cql](#cql) **\)** | Matches CQL expression not within another CQL expression | `([t="de"]) !within (<s/>)` |
| **\(** [cql](#cql) **\) containing \(** [cql](#cql) **\)** | Matches CQL expression containing another CQL expression | `(<s/>) containing ([t="de"])` |
| **\(** [cql](#cql) **\) !containing \(** [cql](#cql) **\)** | Matches CQL expression not containing another CQL expression | `(<s/>) !containing ([t="de"])` |
| **\(** [cql](#cql) **\) intersecting \(** [cql](#cql) **\)** | Matches CQL expression intersecting another CQL expression | `(<s/>) intersecting (<div/>)` |
| **\(** [cql](#cql) **\) !intersecting \(** [cql](#cql) **\)** | Matches CQL expression not intersecting another CQL expression | `(<s/>) !intersecting (<div/>)` |
|
|
50
51
|
| **\(** [cql](#cql) **\) fullyalignedwith \(** [cql](#cql) **\)** | Matches CQL expression fully aligned with another CQL expression | `(<s/>) fullyalignedwith (<div/>)` |
| **\(** [cql](#cql) **\) !fullyalignedwith \(** [cql](#cql) **\)** | Matches CQL expression not fully aligned with another CQL expression | `(<s/>) !fullyalignedwith (<div/>)` |
|
|
52
53
54
55
|
| **\(** [cql](#cql) **\) followedby \(** [cql](#cql) **\)** | Matches CQL expression followed by another CQL expression | `([t="de"]) followedby ([pos="ADJ"])` |
| **\(** [cql](#cql) **\) !followedby \(** [cql](#cql) **\)** | Matches CQL expression not followed by another CQL expression | `([t="de"]) !followedby ([pos="ADJ"])` |
| **\(** [cql](#cql) **\) precededby \(** [cql](#cql) **\)** | Matches CQL expression preceded by another CQL expression | `([pos="ADJ"]) precededby ([t="de"])` |
| **\(** [cql](#cql) **\) !precededby \(** [cql](#cql) **\)** | Matches CQL expression not preceded by another CQL expression | `([pos="ADJ"]) !precededby ([t="de"])` |
|
|
56
|
|
|
57
|
<a name="token"></a>
|
|
58
59
60
61
62
63
|
## Token
| Syntax | Description | Example |
|-------------------------------------|-------------------------------------------------|---------|
| **\[ \]** | Matches each single position token | `[]` |
|
|
64
|
| **"** [value](#value) **"** | Matches a single position token with condition defined by a basic [single-position-expression](#single-position-expression), where the prefix is the default prefix provided with the query | `"de"` |
|
|
65
66
67
68
69
70
71
72
73
|
| **\[** [single-position-expression](#single-position-expression) **\]** | Matches single position token with condition defined by an [single-position-expression](#single-position-expression) | `[t="de"]` |
<a name="single-position-expression"></a>
#### Single Position Expression
| Expression | Syntax | Example |
|-------------|---------------------------------------------|---------|
| basic | [prefix](#prefix) **= \"**[value](#value)**\"** | `t="de"`
|
|
74
|
| variable | [prefix](#prefix) **= $**[variable-name] | `t=$1`
|
|
75
76
77
78
79
80
81
|
| not | **\!** [single-position-expression](#single-position-expression) | `!t="de"` |
| and | **\(** [single-position-expression](#single-position-expression) **\&** [single-position-expression](#single-position-expression) **\&** ... **\)** | `t="de" & pos="LID"`|
| or | **\(** [single-position-expression](#single-position-expression) **\|** [single-position-expression](#single-position-expression) **\|** ... **\)** | `t="de" | t="het"` |
| position | **\#** \<position\> | `#100` |
| range | **\#** \<position\> **-** \<position\> | `#100-110` |
|
|
82
|
<a name="multi-position"></a>
|
|
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
|
## Multi-position
| Syntax | Description | Example |
|-------------------------------------|-------------------------------------------------|---------|
| **\<** [multi-position-expression](#multi-position-expression) **/\>** | Matches (single and) multi position tokens with condition defined by [multi-position-expression](#multi-position-expression) | `<s/>` |
| **\<** [multi-position-expression](#multi-position-expression) **\>** | Matches start of (single and) multi position tokens with condition defined by [multi-position-expression](#multi-position-expression) | `<s>` |
| **\</** [multi-position-expression](#multi-position-expression) **\>** | Matches end of (single and) multi position tokens with condition defined by [multi-position-expression](#multi-position-expression) | `</s>` |
<a name="multi-position-expression"></a>
#### Multi Position Expression
| Expression | Syntax |
|-------------|---------------------------------------------------|
| prefix | [prefix](#prefix) |
| basic | [prefix](#prefix) **= \"**[value](#value)**\"** |
|
|
103
|
<a name="sequence"></a>
|
|
104
105
106
107
108
|
## Sequence
| Syntax | Description | Example |
|---------------------------------------|----------------------------------|--------------|
|
|
109
|
| [cql](#cql) [cql](#cql) [cql](#cql)... | A sequence of [cql](#cql) | `[t="de"][pos="ADJ"]{2}[pos="N"]` |
|
|
110
|
|