|
1
|
#Sketch Engine
|
|
2
|
For indexing [Sketch Engine](https://www.sketchengine.co.uk/word-sketch-index-format/) resources, the *mtas.analysis.parser.MtasSketchParser* extending the *MtasBasicParser* is available; full examples of configuration files are provided on [GitHub](https://github.com/textexploration/mtas/tree/master/conf/parser/mtas).
|
|
3
|
|
|
4
5
6
7
8
9
10
11
12
13
14
15
16
|
```xml
<!-- START CONFIGURATION MTAS PARSER -->
<parser name="mtas.analysis.parser.MtasSketchParser">
...
<!-- START MAPPINGS -->
<mappings>
...
</mapping>
<!-- END MAPPINGS --->
...
</parser>
<!-- END CONFIGURATION MTAS PARSER -->
```
|
|
17
|
|
|
18
|
The [configuration file](indexing_configuration.html#configuration) defining the [mapping](indexing_mapping.html) has some specific settings for the Sketch parser distinguishing several types of elements within the XML-based Sketch resource:
|
|
19
|
|
|
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
* [words](indexing_formats_sketch.html#word) : the basic tokenisation layer
* [wordAnnotations](indexing_formats_sketch.html#wordAnnotation) : annotations occurring within a word
* [groups](indexing_formats_sketch.html#group) : containing one or multiple words
All these elements are defined inside the *mappings* part of the configuration file. The use and meaning of the different elements is illustrated and explained by some examples.
<a name="word"></a>**Words**
All rows not consisting of a start or end tag in the Sketch resource are supposed to be a set of tab-separated values. Such a row is potentially to be interpreted as *word* with each value an associated *wordAnnotation*. In the parser configuration, conditions can be put on which potential items in the Sketch resource should really be interpreted as a *word*:
```xml
<mapping type="word">
<condition>
<item type="ancestorGroupName" not="true" condition="field" />
</condition>
</mapping>
```
The example above excludes potential words that are contained within a *field* tag.
<a name="word"></a>**Word annotations**
Each value in the set of tab separated values from a word is a potential *wordAnnotation*. A mapping on such a *wordAnnotation* can be defined by referring to the position of the value in the *word* definition.
```xml
<mapping type="wordAnnotation" name="0">
<token type="string" offset="false" parent="false">
<pre>
<item type="string" value="t" />
</pre>
<post>
<item type="text" />
</post>
</token>
</mapping>
```
The example above will add a token based on the first *wordAnntotation* value from each *word*.
<a name="group"></a>**Groups**
Rows containing start and end tags in the Sketch resource define potential groups. These groups must contain words, and mappings can be configured by referring to their name.
```xml
<mapping type="group" name="s">
<token type="string" offset="false">
<pre>
<item type="name" />
</pre>
<post>
<item type="attribute" name="class" />
</post>
</token>
</mapping>
```
|
|
77
|
|