Katarzyna Krasnowska-Kieraś and Marcin Woliński. Parsing headed constituencies. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12633–12643, Torino, Italy, 2024. ELRA and ICCL.

pip install

Dockerfile (recommended):

FROM tensorflow/tensorflow:2.10.0-gpu

RUN /usr/bin/python3 -m pip install --upgrade pip

RUN pip install


Set a limit for GPU memory usage (recommended):

>>> import tensorflow as tf

>>> GPU_MEM = 4*1024

>>> gpus = tf.config.list_physical_devices('GPU')
>>> if gpus:
...     try:
...         tf.config.set_logical_device_configuration(
...             gpus[0],
...             [tf.config.LogicalDeviceConfiguration(memory_limit=GPU_MEM)]
...         )
...         logical_gpus = tf.config.list_logical_devices('GPU')
...         print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
...     except RuntimeError as e:
...         print(e)

Import Hydra and load a model:

>>> from hydra import Hydra

>>> hydra_parser = Hydra.load('path/to/the/model/')

Parse some sentences:

>>> trees = hydra_parser.parse([
...    'Kot norweski jest bardzo inteligentny.',
...    'Przywiązuje się do właściciela i jego domu.'])
>>> trees[0].pretty_print()
[0] *ROOT[]
    [0] *S[]
        [0] NP[]
            [0] *NP[]
                [0] *N[]
                    [0] *('Kot', 'kot', 'subst:sg:nom:m2')[]
            [0] AdjP[]
                [0] *Adj[]
                    [0] *('norweski', 'norweski', 'adj:sg:nom:m2:pos')[]
        [0] *VP[]
            [0] *V[]
                [0] *('jest', 'być', 'fin:sg:ter:imperf')[]
        [0] AdjP[]
            [0] AdvP[]
                [0] *Adv[]
                    [0] *('bardzo', 'bardzo', 'adv:pos')[]
            [0] *AdjP[]
                [0] *Adj[]
                    [0] *('inteligentny', 'inteligentny', 'adj:sg:nom:m1:pos')[]
    [0] Punct[]
        [0] *('.', '.', 'interp')[]

For result in JSON format, use return_jsons=True:

>>> hydra_parser.parse('Przywiązuje się do właściciela i jego domu.', return_jsons=True)[0]
{'tree': {'is_head': True, 'span': {'from': 0, 'to': 8}, 'attributes': {}, 'deprel': 'root', 'category': 'ROOT', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 7}, 'category': 'S', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'VP', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'V', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'orth': 'Przywiązuje', 'base': 'przywiązywać', 'tag': 'fin:sg:ter:imperf', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 1, 'to': 2}, 'attributes': {}, 'deprel': 'refl', 'category': 'Part', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'category': 'Refl', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'orth': 'się', 'base': 'się', 'tag': 'part', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 2, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'PrepNP', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'category': 'Prep', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'orth': 'do', 'base': 'do', 'tag': 'prep:gen', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 3, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 3, 'to': 4}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'orth': 'właściciela', 'base': 'właściciel', 'tag': 'subst:sg:gen:m1', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 4, 'to': 5}, 'category': 'Conj', 'children': [{'is_head': True, 'span': {'from': 4, 'to': 5}, 'orth': 'i', 'base': 'i', 'tag': 'conj', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 5, 'to': 7}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 5, 'to': 6}, 'attributes': {}, 'deprel': 'adjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'orth': 'jego', 'base': 'on', 'tag': 'ppron3:sg:gen:m1:ter:akc:npraep', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'orth': 'domu', 'base': 'dom', 'tag': 'subst:sg:gen:m3', 'features': {'nps': False}}]}]}]}]}]}]}, {'is_head': False, 'span': {'from': 7, 'to': 8}, 'attributes': {}, 'deprel': 'punct', 'category': 'Punct', 'children': [{'is_head': True, 'span': {'from': 7, 'to': 8}, 'orth': '.', 'base': '.', 'tag': 'interp', 'features': {'nps': True}}]}]}, 'metadata': {}}

To perform additional lemmata correction with Morfeusz (for Polish only), use correct_lemmata=True (recommended, useful especially in case of rare inflectional paradigms):

>>> def get_lemmata(tree):
...     return [leaf.category[1] for leaf in tree.get_yield()]
>>> get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.')[0])
['młynarz', 'mełnąć', 'zboże', 'na', 'mąka', '.']
>>> get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', correct_lemmata=True)[0])
['młynarz', 'mleć', 'zboże', 'na', 'mąka', '.']

To process an already-tokenized text, pass the sentence with space-separated tokens and use is_tokenized=True:

>>> hydra_parser.parse('Miałem kota.',)[0].pretty_print()
[0] *ROOT[]
    [0] *S[]
        [0] *VP[]
            [0] *V[]
                [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
                [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
        [0] NP[]
            [0] *N[]
                [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
    [0] Punct[]
        [0] *('.', '.', 'interp')[]
>>> hydra_parser.parse(['Miał em kota .'], is_tokenized=True)[0].pretty_print()
[0] *ROOT[]
    [0] *S[]
        [0] *VP[]
            [0] *V[]
                [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
                [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
        [0] NP[]
            [0] *N[]
                [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
    [0] Punct[]
        [0] *('.', '.', 'interp')[]


Load a dataset (e.g. downloaded from here):

>>> from datasets import load_dataset

>>> dataset = load_dataset('path/to/dataset')

Train a Hydra model (requires dataset to contain 'train' and 'validation' parts):

>>> from hydra import Trainer
>>> trainer = Trainer(
...     'allegro/herbert-large-cased'',
...     dataset=dataset,
...     segmentation=True,
...     lemmatisation=True,
...     tagging=True,
...     dependency=True,
...     spines=True
... )
>>> hydra_parser = trainer.train(
...     epochs=50,
...     patience=3,
...     log_dir='...',
...     model_dir='...',
... )

Any combination of segmentation, lemmatisation, tagging, dependency and spines can be excluded from the model by setting to False. If dependency and/or spines are set to False at training time, the model will not produce trees. Pass return_labels=True at prediction time to receive results as labels for individual tokens:

>>> no_trees = Hydra.load('path/to/other/model/')
>>> no_trees.parse('Ala ma kota.')
... RuntimeError: This model cant parse and wont return trees/jsons, use return_labels=True.
>>> hydra_parser.parse('Ala ma kota.', return_labels=True)
... [(['Ala', 'ma', 'kota', '.'], {'tags': ['subst:sg:nom:f', 'fin:sg:ter:imperf', 'subst:sg:acc:m2', 'interp'], 'lemmas': ['Ala', 'mieć', 'kot', '.']})] 

To train particular components of the model, your dataset must contain following columns:

component columns
segmentation (no columns required)
lemmatisation lemmas : datasets.Sequence of datasets.Value("string")
tagging tags : datasets.Sequence of datasets.Value("string")
dependency heads : datasets.Sequence of datasets.Value("int16") (first token = 0, root = None) , deprels : datasets.Sequence of datasets.features.ClassLabel
spines nonterminals : [{'cat': datasets.Value("string"), 'children': [datasets.Value("int16")]}]

a tokens column is always required for training.


To evaluate the parser (evaluation metrics will be calculated individually on every part of dataset passed to Trainer() other than 'train' and 'validation'):

>>> evaluation_results = trainer.evaluate(hydra_parser)


Work supported by POIR.04.02.00-00-D006/20-00 national grant (Digital Research Infrastructure for the Arts and Humanities DARIAH-PL).