README.md 9.46 KB

On-line demo

https://constituency.nlp.ipipan.waw.pl

Publication

Katarzyna Krasnowska-Kieraś and Marcin Woliński. Parsing headed constituencies. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12633–12643, Torino, Italy, 2024. ELRA and ICCL.

Datasets used in the paper

Installation

pip install http://mozart.ipipan.waw.pl/~kkrasnowska/Hydra/hydra-0.6-py3-none-any.whl

Dockerfile (recommended):

FROM tensorflow/tensorflow:2.10.0-gpu

RUN /usr/bin/python3 -m pip install --upgrade pip

RUN pip install http://mozart.ipipan.waw.pl/~kkrasnowska/Hydra/hydra-0.6-py3-none-any.whl

Usage

Set a limit for GPU memory usage (recommended):

>>> import tensorflow as tf

>>> GPU_MEM = 4*1024

>>> gpus = tf.config.list_physical_devices('GPU')
>>> if gpus:
...     try:
...         tf.config.set_logical_device_configuration(
...             gpus[0],
...             [tf.config.LogicalDeviceConfiguration(memory_limit=GPU_MEM)]
...         )
...         logical_gpus = tf.config.list_logical_devices('GPU')
...         print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
...     except RuntimeError as e:
...         print(e)

Import Hydra and load a model:

>>> from hydra import Hydra

>>> hydra_parser = Hydra.load('path/to/the/model/')

Parse some sentences:

>>> trees = hydra_parser.parse([
...    'Kot norweski jest bardzo inteligentny.',
...    'Przywiązuje się do właściciela i jego domu.'])
>>> trees[0].pretty_print()
[0] *ROOT[]
    [0] *S[]
        [0] NP[]
            [0] *NP[]
                [0] *N[]
                    [0] *('Kot', 'kot', 'subst:sg:nom:m2')[]
            [0] AdjP[]
                [0] *Adj[]
                    [0] *('norweski', 'norweski', 'adj:sg:nom:m2:pos')[]
        [0] *VP[]
            [0] *V[]
                [0] *('jest', 'być', 'fin:sg:ter:imperf')[]
        [0] AdjP[]
            [0] AdvP[]
                [0] *Adv[]
                    [0] *('bardzo', 'bardzo', 'adv:pos')[]
            [0] *AdjP[]
                [0] *Adj[]
                    [0] *('inteligentny', 'inteligentny', 'adj:sg:nom:m1:pos')[]
    [0] Punct[]
        [0] *('.', '.', 'interp')[]

For result in JSON format, use return_jsons=True:

>>> hydra_parser.parse('Przywiązuje się do właściciela i jego domu.', return_jsons=True)[0]
{'tree': {'is_head': True, 'span': {'from': 0, 'to': 8}, 'attributes': {}, 'deprel': 'root', 'category': 'ROOT', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 7}, 'category': 'S', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'VP', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'V', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'orth': 'Przywiązuje', 'base': 'przywiązywać', 'tag': 'fin:sg:ter:imperf', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 1, 'to': 2}, 'attributes': {}, 'deprel': 'refl', 'category': 'Part', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'category': 'Refl', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'orth': 'się', 'base': 'się', 'tag': 'part', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 2, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'PrepNP', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'category': 'Prep', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'orth': 'do', 'base': 'do', 'tag': 'prep:gen', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 3, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 3, 'to': 4}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'orth': 'właściciela', 'base': 'właściciel', 'tag': 'subst:sg:gen:m1', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 4, 'to': 5}, 'category': 'Conj', 'children': [{'is_head': True, 'span': {'from': 4, 'to': 5}, 'orth': 'i', 'base': 'i', 'tag': 'conj', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 5, 'to': 7}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 5, 'to': 6}, 'attributes': {}, 'deprel': 'adjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'orth': 'jego', 'base': 'on', 'tag': 'ppron3:sg:gen:m1:ter:akc:npraep', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'orth': 'domu', 'base': 'dom', 'tag': 'subst:sg:gen:m3', 'features': {'nps': False}}]}]}]}]}]}]}, {'is_head': False, 'span': {'from': 7, 'to': 8}, 'attributes': {}, 'deprel': 'punct', 'category': 'Punct', 'children': [{'is_head': True, 'span': {'from': 7, 'to': 8}, 'orth': '.', 'base': '.', 'tag': 'interp', 'features': {'nps': True}}]}]}, 'metadata': {}}

To perform additional lemmata correction with Morfeusz (for Polish only), use correct_lemmata=True (recommended, useful especially in case of rare inflectional paradigms):

>>> def get_lemmata(tree):
...     return [leaf.category[1] for leaf in tree.get_yield()]
... 
>>> get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.')[0])
['młynarz', 'mełnąć', 'zboże', 'na', 'mąka', '.']
>>> get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', correct_lemmata=True)[0])
['młynarz', 'mleć', 'zboże', 'na', 'mąka', '.']

To process an already-tokenized text, pass the sentence with space-separated tokens and use is_tokenized=True:

>>> hydra_parser.parse('Miałem kota.',)[0].pretty_print()
[0] *ROOT[]
    [0] *S[]
        [0] *VP[]
            [0] *V[]
                [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
                [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
        [0] NP[]
            [0] *N[]
                [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
    [0] Punct[]
        [0] *('.', '.', 'interp')[]
>>> hydra_parser.parse(['Miał em kota .'], is_tokenized=True)[0].pretty_print()
[0] *ROOT[]
    [0] *S[]
        [0] *VP[]
            [0] *V[]
                [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
                [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
        [0] NP[]
            [0] *N[]
                [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
    [0] Punct[]
        [0] *('.', '.', 'interp')[]

Training

Load a dataset (e.g. downloaded from here):

>>> from datasets import load_dataset

>>> dataset = load_dataset('path/to/dataset')

Train a Hydra model (requires dataset to contain 'train' and 'validation' parts):

>>> from hydra import Trainer
>>> trainer = Trainer(
...     'allegro/herbert-large-cased'',
...     dataset=dataset,
...     segmentation=True,
...     lemmatisation=True,
...     tagging=True,
...     dependency=True,
...     spines=True
... )
>>> hydra_parser = trainer.train(
...     epochs=50,
...     patience=3,
...     log_dir='...',
...     model_dir='...',
... )

Any combination of segmentation, lemmatisation, tagging, dependency and spines can be excluded from the model by setting to False. If dependency and/or spines are set to False at training time, the model will not produce trees. Pass return_labels=True at prediction time to receive results as labels for individual tokens:

>>> no_trees = Hydra.load('path/to/other/model/')
>>> no_trees.parse('Ala ma kota.')
... RuntimeError: This model cant parse and wont return trees/jsons, use return_labels=True.
>>> hydra_parser.parse('Ala ma kota.', return_labels=True)
... [(['Ala', 'ma', 'kota', '.'], {'tags': ['subst:sg:nom:f', 'fin:sg:ter:imperf', 'subst:sg:acc:m2', 'interp'], 'lemmas': ['Ala', 'mieć', 'kot', '.']})] 

To train particular components of the model, your dataset must contain following columns:

component columns
segmentation (no columns required)
lemmatisation lemmas : datasets.Sequence of datasets.Value("string")
tagging tags : datasets.Sequence of datasets.Value("string")
dependency heads : datasets.Sequence of datasets.Value("int16") (first token = 0, root = None) , deprels : datasets.Sequence of datasets.features.ClassLabel
spines nonterminals : [{'cat': datasets.Value("string"), 'children': [datasets.Value("int16")]}]

a tokens column is always required for training.

Evaluation

To evaluate the parser (evaluation metrics will be calculated individually on every part of dataset passed to Trainer() other than 'train' and 'validation'):

>>> evaluation_results = trainer.evaluate(hydra_parser)

Acknowledgements

Work supported by POIR.04.02.00-00-D006/20-00 national grant (Digital Research Infrastructure for the Arts and Humanities DARIAH-PL).