SFST-Howto.txt 3.82 KB

Edit Raw Blame History

This document explains how to prepare morphological data, compile it into SFST format and include the compiled transducer as a part of morphological analyser.

SFST (Stuttgart Finite State Transducer Tools) is a library and set of utils to handle transducers.
More info: http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html

Using transducers allows to have compact files, lower memory load and efficient processing. It is recommended to use compiled transducers unless prototyping or working with small morphological data files (in such cases, consider using MapAnalyser).

1. Input file format.

Input files are suitable for both SFSTAnalyser (class=sfst in config file) and MapAnalyser (class=map or map-case hashmap, hashmap-case). The file format is simple. Each entry in new line, each line consists of three fields, delimited by white spaces (preferably tab):

form	lemma	tags

The file MUST be encoded in UTF-8.
Form is the orthographic form (possibly inflected) as generated by the tokeniser or read from tokenised input. Lemma is the dictionary base form. Tags is a specification of a set of tags, possibly representing one or many tags. Multiple tags are separated by plus character (+). Besides, alternative values for one category may be separated by dot character (.). When all possible values of a category (as defined by tagset) are desired, it may be shortened to underscore (_). Examples (assuming kipi tagset):

subst:sg.pl:nom.acc:f will be expanded to 4 tags (subst:sg:nom:f, subst:pl:nom:f, subst:sg:acc:f,subst:pl:acc:f)

The same four tags may be represented by subst:_:nom.acc:f (as sg and pl are all the possible values of gender, which is the first attribute for nouns).

Note that it is the same format than may be read by tagset-tool with -p switch.

For more examples, see the .txt files included in the data directory.

NOTE ON DUPLICATES: if the morphological data contain duplicated tags (specified explicitely or using wildcard representations that evaluate to duplicates), this will result in duplicated tags in the analyser output. To clean such duplicates, use tabclean.py script.

NOTE ON CASE SENSITIVITY: if the morphological data are intended to be used in case-insensitive manner, either prepare it that way (e.g. all forms should be lower case) or use tabclean.py script to convert the data (all the scripts are inside the tools directory). Remember to switch the SFSTAnalyser to lower-case=true in the config (if using MapAnalyser, you have to choose a class with desired case-sensitivity).

2. Converting to SFST input format.

NOTE: the rest of this section explains how to do the conversion manually. If you want to have it done quickly, use tools/tab-to-sfst-now script.

Use the tab-to-sfst.py script (tools dir) to get the file converted into SFST-readable format:

cd data
../tools/tab-to-sfst.py my_morpho.txt my_morpho.input

Now create a helper file, say my_morpho.helper containing just this line:
"my_morpho.input"

Run the FST compiler (contained within the SFST distribution):
fst-compiler-utf8 -c my_morpho.helper my_morpho.fst

NOTE: -c is obligarory. This will generate the compiled transducer, ready to be used by the morphological analyser.

NOTE: if you want your transducers to be installed along with other data when issuing "make install", use .fst extension and place them within the data subdir.

3. Updating analyser config.

In the config file for your analyser, add a section defining the transducer, e.g.:

[ma:my_sfst_analyser]
	class=sfst
	tagset=kipi
	file=my_morpho.fst

If you want it to be case-insensitive, remember to add lower-case=true.

In the rules for specific token labels (as output by toki) or the default section, place your transducer, e.g.:

[default]
	ma=morfeusz
	; HERE it comes
	ma=my_sfst_analyser
	; fallback to ign or what
	ma=unknown

Now you can test the analyser:

analyse -c your_config.ini