Parser

public

 

On-line demo

https://constituency.nlp.ipipan.waw.pl

Publication

Katarzyna Krasnowska-Kieraś and Marcin Woliński. Parsing headed constituencies. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12633–12643, Torino, Italy, 2024. ELRA and ICCL.

Datasets used in the paper

Installation

⚠️ The wheel is dependency-heavy and will install specific versions of large libraries like Tensorflow, Transformers etc. Using a virtual environment is highly recommended.

pip install http://mozart.ipipan.waw.pl/~kkrasnowska/Hydra/hydra-0.7-py3-none-any.whl

Models

Usage

Set the TF_USE_LEGACY_KERAS environment variable (there will be errors otherwise):

import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

If using GPU, set a limit for GPU memory usage (highly recommended):

import tensorflow as tf

GPU_MEM = 4*1024

gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.set_logical_device_configuration(
            gpus[0],
            [tf.config.LogicalDeviceConfiguration(memory_limit=GPU_MEM)]
        )
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

Import Hydra and load a model:

from hydra import Hydra

hydra_parser = Hydra.load('path/to/the/model/')

Parse some sentences (use of force_root_label and root_label parameters is highly recommended):

trees = hydra_parser.parse([
   'Kot norweski jest bardzo inteligentny.',
   'Przywiązuje się do właściciela i jego domu.'],
   return_trees=True, force_root_label=True, root_label='ROOT')
trees[0]['tree'].pretty_print()

# expect someting similar to (actual outputs may differ depending on model):
#[0] *ROOT[]
#    [0] *S[]
#        [0] NP[]
#            [0] *NP[]
#                [0] *N[]
#                    [0] *('Kot', 'kot', 'subst:sg:nom:m2')[]
#            [0] AdjP[]
#                [0] *Adj[]
#                    [0] *('norweski', 'norweski', 'adj:sg:nom:m2:pos')[]
#        [0] *VP[]
#            [0] *V[]
#                [0] *('jest', 'być', 'fin:sg:ter:imperf')[]
#        [0] AdjP[]
#            [0] AdvP[]
#                [0] *Adv[]
#                    [0] *('bardzo', 'bardzo', 'adv:pos')[]
#            [0] *AdjP[]
#                [0] *Adj[]
#                    [0] *('inteligentny', 'inteligentny', 'adj:sg:nom:m1:pos')[]
#    [0] Punct[]
#        [0] *('.', '.', 'interp')[]

For result in JSON format, use return_jsons=True:

hydra_parser.parse('Przywiązuje się do właściciela i jego domu.', return_jsons=True)[0]

#{'tree': {'is_head': True, 'span': {'from': 0, 'to': 8}, 'attributes': {}, 'deprel': 'root', 'category': 'ROOT', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 7}, 'category': 'S', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'VP', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'V', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'orth': 'Przywiązuje', 'base': 'przywiązywać', 'tag': 'fin:sg:ter:imperf', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 1, 'to': 2}, 'attributes': {}, 'deprel': 'refl', 'category': 'Part', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'category': 'Refl', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'orth': 'się', 'base': 'się', 'tag': 'part', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 2, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'PrepNP', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'category': 'Prep', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'orth': 'do', 'base': 'do', 'tag': 'prep:gen', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 3, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 3, 'to': 4}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'orth': 'właściciela', 'base': 'właściciel', 'tag': 'subst:sg:gen:m1', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 4, 'to': 5}, 'category': 'Conj', 'children': [{'is_head': True, 'span': {'from': 4, 'to': 5}, 'orth': 'i', 'base': 'i', 'tag': 'conj', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 5, 'to': 7}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 5, 'to': 6}, 'attributes': {}, 'deprel': 'adjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'orth': 'jego', 'base': 'on', 'tag': 'ppron3:sg:gen:m1:ter:akc:npraep', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'orth': 'domu', 'base': 'dom', 'tag': 'subst:sg:gen:m3', 'features': {'nps': False}}]}]}]}]}]}]}, {'is_head': False, 'span': {'from': 7, 'to': 8}, 'attributes': {}, 'deprel': 'punct', 'category': 'Punct', 'children': [{'is_head': True, 'span': {'from': 7, 'to': 8}, 'orth': '.', 'base': '.', 'tag': 'interp', 'features': {'nps': True}}]}]}, 'metadata': {}}

To perform additional lemmata correction with Morfeusz (for Polish only), use correct_lemmata=True (recommended, useful especially in case of rare inflectional paradigms):

def get_lemmata(tree):
    return [leaf.category[1] for leaf in tree.get_yield()]

get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', return_trees=True)[0]['tree'])
#['młynarz', 'mełnąć', 'zboże', 'na', 'mąka', '.']
get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', return_trees=True, correct_lemmata=True)[0]['tree'])
#['młynarz', 'mleć', 'zboże', 'na', 'mąka', '.']

To process an already-tokenized text, pass the sentence with space-separated tokens and use is_tokenized=True:

hydra_parser.parse('Miałem kota.', return_trees=True)[0]['tree'].pretty_print()

#[0] *ROOT[]
#    [0] *S[]
#        [0] *VP[]
#            [0] *V[]
#                [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
#                [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
#        [0] NP[]
#            [0] *N[]
#                [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
#    [0] Punct[]
#        [0] *('.', '.', 'interp')[]

hydra_parser.parse(['Miał em kota .'], is_tokenized=True, return_trees=True)[0]['tree'].pretty_print()

#[0] *ROOT[]
#    [0] *S[]
#        [0] *VP[]
#            [0] *V[]
#                [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
#                [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
#        [0] NP[]
#            [0] *N[]
#                [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
#    [0] Punct[]
#        [0] *('.', '.', 'interp')[]

Training

Load a dataset (e.g. downloaded from here):

from datasets import load_dataset

dataset = load_dataset('path/to/dataset')

Train a Hydra model (requires dataset to contain 'train' and 'validation' parts):

from hydra import Trainer
trainer = Trainer(
    'allegro/herbert-large-cased'',
    dataset=dataset,
    segmentation=True,
    lemmatisation=True,
    tagging=True,
    dependency=True,
     spines=True
)
hydra_parser = trainer.train(
    epochs=50,
    patience=3,
    log_dir='...',
    model_dir='...',
)

Most combinations of segmentation, lemmatisation, tagging, dependency and hybrid can be excluded from the model by setting to False. If dependency or hybrid are set to False at training time, the model will not produce trees. Pass return_labels=True at prediction time to receive results as labels for individual tokens:

no_trees = Hydra.load('path/to/other/model/')
no_trees.parse('Ala ma kota.', return_jsons=True)
# This model can’t perform hybrid parsing and won’t return jsons, setting `return_jsons` to False.
hydra_parser.parse('Ala ma kota.', return_labels=True)[0]['labels']
# [(['Ala', 'ma', 'kota', '.'], {'tags': ['subst:sg:nom:f', 'fin:sg:ter:imperf', 'subst:sg:acc:m2', 'interp'], 'lemmas': ['Ala', 'mieć', 'kot', '.']})] 

To train particular components of the model, your dataset must contain following columns:

component columns
segmentation (no columns required)
lemmatisation lemmas : datasets.Sequence of datasets.Value("string")
tagging tags : datasets.Sequence of datasets.Value("string")
dependency heads : datasets.Sequence of datasets.Value("int16") (first token = 0, root = None) , deprels : datasets.Sequence of datasets.features.ClassLabel
hybrid nonterminals : [{'cat': datasets.Value("string"), 'children': [datasets.Value("int16")]}]

The tokens column is always required for training.

Evaluation

To evaluate the parser (evaluation metrics will be calculated individually on every part of dataset passed to Trainer() other than 'train' and 'validation'):

evaluation_results = trainer.evaluate(hydra_parser)

Acknowledgements

Work supported by POIR.04.02.00-00-D006/20-00 national grant (Digital Research Infrastructure for the Arts and Humanities DARIAH-PL).