README.md
On-line demo
https://constituency.nlp.ipipan.waw.pl
Publication
Katarzyna Krasnowska-Kieraś and Marcin Woliński. Parsing headed constituencies. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12633–12643, Torino, Italy, 2024. ELRA and ICCL.
Installation
pip install http://mozart.ipipan.waw.pl/~kkrasnowska/Hydra/hydra-0.6-py3-none-any.whl
Dockerfile (recommended):
FROM tensorflow/tensorflow:2.10.0-gpu
RUN /usr/bin/python3 -m pip install --upgrade pip
RUN pip install http://mozart.ipipan.waw.pl/~kkrasnowska/Hydra/hydra-0.6-py3-none-any.whl
Usage
Set a limit for GPU memory usage (recommended):
>>> import tensorflow as tf
>>> GPU_MEM = 4*1024
>>> gpus = tf.config.list_physical_devices('GPU')
>>> if gpus:
... try:
... tf.config.set_logical_device_configuration(
... gpus[0],
... [tf.config.LogicalDeviceConfiguration(memory_limit=GPU_MEM)]
... )
... logical_gpus = tf.config.list_logical_devices('GPU')
... print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
... except RuntimeError as e:
... print(e)
Import Hydra
and load a model:
>>> from hydra import Hydra
>>> hydra_parser = Hydra.load('path/to/the/model/')
Parse some sentences:
>>> trees = hydra_parser.parse([
... 'Kot norweski jest bardzo inteligentny.',
... 'Przywiązuje się do właściciela i jego domu.'])
>>> trees[0].pretty_print()
[0] *ROOT[]
[0] *S[]
[0] NP[]
[0] *NP[]
[0] *N[]
[0] *('Kot', 'kot', 'subst:sg:nom:m2')[]
[0] AdjP[]
[0] *Adj[]
[0] *('norweski', 'norweski', 'adj:sg:nom:m2:pos')[]
[0] *VP[]
[0] *V[]
[0] *('jest', 'być', 'fin:sg:ter:imperf')[]
[0] AdjP[]
[0] AdvP[]
[0] *Adv[]
[0] *('bardzo', 'bardzo', 'adv:pos')[]
[0] *AdjP[]
[0] *Adj[]
[0] *('inteligentny', 'inteligentny', 'adj:sg:nom:m1:pos')[]
[0] Punct[]
[0] *('.', '.', 'interp')[]
For result in JSON format, use return_jsons=True
:
>>> hydra_parser.parse('Przywiązuje się do właściciela i jego domu.', return_jsons=True)[0]
{'tree': {'is_head': True, 'span': {'from': 0, 'to': 8}, 'attributes': {}, 'deprel': 'root', 'category': 'ROOT', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 7}, 'category': 'S', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'VP', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'V', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'orth': 'Przywiązuje', 'base': 'przywiązywać', 'tag': 'fin:sg:ter:imperf', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 1, 'to': 2}, 'attributes': {}, 'deprel': 'refl', 'category': 'Part', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'category': 'Refl', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'orth': 'się', 'base': 'się', 'tag': 'part', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 2, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'PrepNP', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'category': 'Prep', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'orth': 'do', 'base': 'do', 'tag': 'prep:gen', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 3, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 3, 'to': 4}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'orth': 'właściciela', 'base': 'właściciel', 'tag': 'subst:sg:gen:m1', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 4, 'to': 5}, 'category': 'Conj', 'children': [{'is_head': True, 'span': {'from': 4, 'to': 5}, 'orth': 'i', 'base': 'i', 'tag': 'conj', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 5, 'to': 7}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 5, 'to': 6}, 'attributes': {}, 'deprel': 'adjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'orth': 'jego', 'base': 'on', 'tag': 'ppron3:sg:gen:m1:ter:akc:npraep', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'orth': 'domu', 'base': 'dom', 'tag': 'subst:sg:gen:m3', 'features': {'nps': False}}]}]}]}]}]}]}, {'is_head': False, 'span': {'from': 7, 'to': 8}, 'attributes': {}, 'deprel': 'punct', 'category': 'Punct', 'children': [{'is_head': True, 'span': {'from': 7, 'to': 8}, 'orth': '.', 'base': '.', 'tag': 'interp', 'features': {'nps': True}}]}]}, 'metadata': {}}
To perform additional lemmata correction with Morfeusz (for Polish only), use correct_lemmata=True
(recommended, useful especially in case of rare inflectional paradigms):
>>> def get_lemmata(tree):
... return [leaf.category[1] for leaf in tree.get_yield()]
...
>>> get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.')[0])
['młynarz', 'mełnąć', 'zboże', 'na', 'mąka', '.']
>>> get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', correct_lemmata=True)[0])
['młynarz', 'mleć', 'zboże', 'na', 'mąka', '.']
To process an already-tokenized text, pass the sentence with space-separated tokens and use is_tokenized=True
:
>>> hydra_parser.parse('Miałem kota.',)[0].pretty_print()
[0] *ROOT[]
[0] *S[]
[0] *VP[]
[0] *V[]
[0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
[0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
[0] NP[]
[0] *N[]
[0] *('kota', 'kot', 'subst:sg:acc:m2')[]
[0] Punct[]
[0] *('.', '.', 'interp')[]
>>> hydra_parser.parse(['Miał em kota .'], is_tokenized=True)[0].pretty_print()
[0] *ROOT[]
[0] *S[]
[0] *VP[]
[0] *V[]
[0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
[0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
[0] NP[]
[0] *N[]
[0] *('kota', 'kot', 'subst:sg:acc:m2')[]
[0] Punct[]
[0] *('.', '.', 'interp')[]
Training
Load a dataset (e.g. downloaded from here):
>>> from datasets import load_dataset
>>> dataset = load_dataset('path/to/dataset')
Train a Hydra model (requires dataset
to contain 'train'
and 'validation'
parts):
>>> from hydra import Trainer
>>> trainer = Trainer(
... 'allegro/herbert-large-cased'',
... dataset=dataset,
... segmentation=True,
... lemmatisation=True,
... tagging=True,
... dependency=True,
... spines=True
... )
>>> hydra_parser = trainer.train(
... epochs=50,
... patience=3,
... log_dir='...',
... model_dir='...',
... )
Any combination of segmentation
, lemmatisation
, tagging
, dependency
and spines
can be excluded from the model by setting to False
. If dependency
and/or spines
are set to False
at training time, the model will not produce trees. Pass return_labels=True
at prediction time to receive results as labels for individual tokens:
>>> no_trees = Hydra.load('path/to/other/model/')
>>> no_trees.parse('Ala ma kota.')
... RuntimeError: This model can’t parse and won’t return trees/jsons, use return_labels=True.
>>> hydra_parser.parse('Ala ma kota.', return_labels=True)
... [(['Ala', 'ma', 'kota', '.'], {'tags': ['subst:sg:nom:f', 'fin:sg:ter:imperf', 'subst:sg:acc:m2', 'interp'], 'lemmas': ['Ala', 'mieć', 'kot', '.']})]
To train particular components of the model, your dataset must contain following columns:
component | columns |
---|---|
segmentation | (no columns required) |
lemmatisation |
lemmas : datasets.Sequence of datasets.Value("string")
|
tagging |
tags : datasets.Sequence of datasets.Value("string")
|
dependency |
heads : datasets.Sequence of datasets.Value("int16") (first token = 0 , root = None ) , deprels : datasets.Sequence of datasets.features.ClassLabel
|
spines |
nonterminals : [{'cat': datasets.Value("string"), 'children': [datasets.Value("int16")]}]
|
a tokens
column is always required for training.
Evaluation
To evaluate the parser (evaluation metrics will be calculated individually on every part of dataset
passed to Trainer()
other than 'train'
and 'validation'
):
>>> evaluation_results = trainer.evaluate(hydra_parser)
Acknowledgements
Work supported by POIR.04.02.00-00-D006/20-00 national grant (Digital Research Infrastructure for the Arts and Humanities DARIAH-PL).