Blame view

src/site/markdown/indexing_formats_sketch.md 2.85 KB
Matthijs Brouwer authored
1
#Sketch Engine
Matthijs Brouwer authored
2
For indexing [Sketch Engine](https://www.sketchengine.co.uk/word-sketch-index-format/) resources, the *mtas.analysis.parser.MtasSketchParser* extending the *MtasBasicParser* is available; full examples of configuration files are provided on [GitHub](https://github.com/textexploration/mtas/tree/master/conf/parser/mtas).
Matthijs Brouwer authored
3
Matthijs Brouwer authored
4
5
6
7
8
9
10
11
12
13
14
15
16
```xml
<!-- START CONFIGURATION MTAS PARSER -->
<parser name="mtas.analysis.parser.MtasSketchParser">
...
  <!-- START MAPPINGS -->
  <mappings>
  ...
  </mapping>
  <!-- END MAPPINGS --->
  ...
</parser>
<!-- END CONFIGURATION MTAS PARSER -->
```
Matthijs Brouwer authored
17
Matthijs Brouwer authored
18
The [configuration file](indexing_configuration.html#configuration) defining the [mapping](indexing_mapping.html) has some specific settings for the Sketch parser distinguishing several types of elements within the XML-based Sketch resource: 
Matthijs Brouwer authored
19
Matthijs Brouwer authored
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
* [words](indexing_formats_sketch.html#word) : the basic tokenisation layer
* [wordAnnotations](indexing_formats_sketch.html#wordAnnotation) : annotations occurring within a word
* [groups](indexing_formats_sketch.html#group) : containing one or multiple words

All these elements are defined inside the *mappings* part of the configuration file. The use and meaning of the different elements is illustrated and explained by some examples. 


<a name="word"></a>**Words**

All rows not consisting of a start or end tag in the Sketch resource are supposed to be a set of tab-separated values. Such a row is potentially to be interpreted as *word* with each value an associated *wordAnnotation*. In the parser configuration, conditions can be put on which potential items in the Sketch resource should really be interpreted as a *word*: 

```xml
<mapping type="word">
  <condition>
    <item type="ancestorGroupName" not="true" condition="field" />
  </condition>
</mapping>
```

The example above excludes potential words that are contained within a *field* tag.

<a name="word"></a>**Word annotations**

Each value in the set of tab separated values from a word is a potential *wordAnnotation*. A mapping on such a *wordAnnotation* can be defined by referring to the position of the value in the *word* definition.

```xml
<mapping type="wordAnnotation" name="0">
  <token type="string" offset="false" parent="false">
    <pre>
      <item type="string" value="t" />
    </pre>
    <post>
      <item type="text" />
    </post>
  </token>
</mapping>  
```

The example above will add a token based on the first *wordAnntotation* value from each *word*.


<a name="group"></a>**Groups**

Rows containing start and end tags in the Sketch resource define potential groups. These groups must contain words, and mappings can be configured by referring to their name.

```xml
<mapping type="group" name="s">
  <token type="string" offset="false">
    <pre>
      <item type="name" />
    </pre>
    <post>
      <item type="attribute" name="class" />
    </post>
  </token>        
</mapping>
```
Matthijs Brouwer authored
77