Blame view

src/site/markdown/search_cql.md 8.3 KB
Matthijs Brouwer authored
1
2
3
4
5
6
7
8
9
10
# Corpus Query Language

Within Lucene and Solr, each field containing tokenized text can be considered as a set of tokens, where each token is associated with a position and its value can be seen as a word from the original text. Mtas extends this concept by allowing to associate multiple positions with one token and by associating each token with a prefix and optional postfix instead of the single value. This makes it possible to use multiple tokens on the same position, and distinguish annotations by using a unique prefix for each type, and allows structures like sentences, paragraphs or entities consisting of multiple adjacent or non-adjacent positions.

To describe sets of tokens matching some condition, a query language is needed. Mtas supports CQL based on the Corpus Query Language introduced by the [Corpus WorkBench](http://cwb.sourceforge.net/files/CQP_Tutorial/) and supported by the Lexicom [Sketch Engine](http://www.sketchengine.co.uk/documentation/wiki/SkE/CorpusQuerying).

<a name="prefix"></a>

#### Prefix
Matthijs Brouwer authored
11
For each field containing Mtas tokenized text, every token is associated with a prefix. Within the field, only a limited set of prefixes is used to distinguish between the different types of annotation. By using a [prefix query](search_component_prefix.html) a full list of used prefixes can be produced.
Matthijs Brouwer authored
12
13
14
15
16

<a name="value"></a>

#### Value
Matthijs Brouwer authored
17
The optional postfix associated with a token can be queried within CQL by providing a *value*. This is a regular expression, the supported syntax is documented in the RegExp class provided by Lucene. By using a [termvector query](search_component_termvector.html), for each [prefix](#prefix) a list of postfix values can be produced. 
Matthijs Brouwer authored
18
Matthijs Brouwer authored
19
20
21
22
23
24
25
<a name="variable"></a>

#### Variable

The optional postfix associated with a token can also be queried within CQL by providing a *variable*. Each variable may occur only once in a CQL query, and should be provided as a comma separated list together with this query. Each provided variable has to occur in the query.

<a name="cql"></a>
Matthijs Brouwer authored
26
27
28
29
30

## CQL

| Syntax                                | Description                      | Example      |
|---------------------------------------|----------------------------------|--------------|
Matthijs Brouwer authored
31
| [token](#token)                       | Matches a single position token  | `[t="de"]` |
Matthijs Brouwer authored
32
33
34
35
36
| [multi-position](#multi-position)     | Matches a (single or) multi position token   | `<s/>`       |
| [sequence](#sequence)                 | Matches a sequence               | `[pos="ADJ"]{2}[pos="N"]` |

| Syntax                                | Description                      | Example      |
|---------------------------------------|----------------------------------|--------------|
Matthijs Brouwer authored
37
38
| [cql](#cql) **{** \<number\> **}**     | Matches provided number of occurrence from [cql](#cql)| `[pos="ADJ"]{2}` |
| [cql](#cql) **{** \<number\> , \<number\>**}** | Matches each number between provided start and end of occurrence from [cql](#cql)| `[pos="ADJ"]{2,3}` |
Matthijs Brouwer authored
39
40
41
42
43



| Syntax                                | Description                                     | Example |
|---------------------------------------|-------------------------------------------------|---------|
Matthijs Brouwer authored
44
45
46
47
48
49
| **\(** [cql](#cql) **\) within \(** [cql](#cql) **\)**  | Matches CQL expression within another CQL expression   | `([t="de"]) within (<s/>)` |
| **\(** [cql](#cql) **\) !within \(** [cql](#cql) **\)**  | Matches CQL expression not within another CQL expression   | `([t="de"]) !within (<s/>)` |
| **\(** [cql](#cql) **\) containing \(** [cql](#cql) **\)**  | Matches CQL expression containing another CQL expression   | `(<s/>) containing ([t="de"])` |
| **\(** [cql](#cql) **\) !containing \(** [cql](#cql) **\)**  | Matches CQL expression not containing another CQL expression   | `(<s/>) !containing ([t="de"])` |
| **\(** [cql](#cql) **\) intersecting \(** [cql](#cql) **\)**  | Matches CQL expression intersecting another CQL expression   | `(<s/>) intersecting (<div/>)` |
| **\(** [cql](#cql) **\) !intersecting \(** [cql](#cql) **\)**  | Matches CQL expression not intersecting another CQL expression   | `(<s/>) !intersecting (<div/>)` |
Matthijs Brouwer authored
50
51
| **\(** [cql](#cql) **\) fullyalignedwith \(** [cql](#cql) **\)**  | Matches CQL expression fully aligned with another CQL expression   | `(<s/>) fullyalignedwith (<div/>)` |
| **\(** [cql](#cql) **\) !fullyalignedwith \(** [cql](#cql) **\)**  | Matches CQL expression not fully aligned with another CQL expression   | `(<s/>) !fullyalignedwith (<div/>)` |
Matthijs Brouwer authored
52
53
54
55
| **\(** [cql](#cql) **\) followedby \(** [cql](#cql) **\)**  | Matches CQL expression followed by another CQL expression   | `([t="de"]) followedby ([pos="ADJ"])` |
| **\(** [cql](#cql) **\) !followedby \(** [cql](#cql) **\)**  | Matches CQL expression not followed by another CQL expression   | `([t="de"]) !followedby ([pos="ADJ"])` |
| **\(** [cql](#cql) **\) precededby \(** [cql](#cql) **\)**  | Matches CQL expression preceded by another CQL expression   | `([pos="ADJ"]) precededby ([t="de"])` |
| **\(** [cql](#cql) **\) !precededby \(** [cql](#cql) **\)**  | Matches CQL expression not preceded by another CQL expression   | `([pos="ADJ"]) !precededby ([t="de"])` |
Matthijs Brouwer authored
56
Matthijs Brouwer authored
57
<a name="token"></a>
Matthijs Brouwer authored
58
59
60
61
62
63

## Token

| Syntax                              | Description                                     | Example |
|-------------------------------------|-------------------------------------------------|---------|
| **\[ \]**                               | Matches each single position token | `[]` |
Matthijs Brouwer authored
64
| **"** [value](#value) **"** | Matches a single position token with condition defined by a basic [single-position-expression](#single-position-expression), where the prefix is the default prefix provided with the query | `"de"` |
Matthijs Brouwer authored
65
66
67
68
69
70
71
72
73
| **\[** [single-position-expression](#single-position-expression) **\]**  | Matches single position token with condition defined by an [single-position-expression](#single-position-expression)   | `[t="de"]` |

<a name="single-position-expression"></a>

#### Single Position Expression

| Expression  | Syntax                                      | Example |
|-------------|---------------------------------------------|---------|
| basic       | [prefix](#prefix) **= \"**[value](#value)**\"** | `t="de"`
Matthijs Brouwer authored
74
| variable       | [prefix](#prefix) **= $**[variable-name] | `t=$1`
Matthijs Brouwer authored
75
76
77
78
79
80
81
| not         | **\!** [single-position-expression](#single-position-expression) | `!t="de"` |
| and         | **\(** [single-position-expression](#single-position-expression) **\&** [single-position-expression](#single-position-expression) **\&** ... **\)** | `t="de" & pos="LID"`|
| or          | **\(** [single-position-expression](#single-position-expression) **\|** [single-position-expression](#single-position-expression) **\|** ... **\)** | `t="de" | t="het"` |
| position    | **\#** \<position\> | `#100` |
| range       | **\#** \<position\> **-** \<position\>   | `#100-110` |
Matthijs Brouwer authored
82
<a name="multi-position"></a>
Matthijs Brouwer authored
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

## Multi-position

| Syntax                              | Description                                     | Example |
|-------------------------------------|-------------------------------------------------|---------|
| **\<** [multi-position-expression](#multi-position-expression) **/\>**  | Matches (single and) multi position tokens with condition defined by [multi-position-expression](#multi-position-expression)   | `<s/>` |
| **\<** [multi-position-expression](#multi-position-expression) **\>**  | Matches start of (single and) multi position tokens with condition defined by [multi-position-expression](#multi-position-expression)   | `<s>` |
| **\</** [multi-position-expression](#multi-position-expression) **\>**  | Matches end of (single and) multi position tokens with condition defined by [multi-position-expression](#multi-position-expression)   | `</s>` |


<a name="multi-position-expression"></a>

#### Multi Position Expression

| Expression  | Syntax                                            |
|-------------|---------------------------------------------------|
| prefix      | [prefix](#prefix)                                 |
| basic       | [prefix](#prefix) **= \"**[value](#value)**\"** |
Matthijs Brouwer authored
103
<a name="sequence"></a>
Matthijs Brouwer authored
104
105
106
107
108

## Sequence

| Syntax                                | Description                      | Example      |
|---------------------------------------|----------------------------------|--------------|
Matthijs Brouwer authored
109
| [cql](#cql)  [cql](#cql)  [cql](#cql)... | A sequence of [cql](#cql)  | `[t="de"][pos="ADJ"]{2}[pos="N"]` |
Matthijs Brouwer authored
110