README.md
On-line demo
https://constituency.nlp.ipipan.waw.pl
Publication
Katarzyna Krasnowska-Kieraś and Marcin Woliński. Parsing headed constituencies. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12633–12643, Torino, Italy, 2024. ELRA and ICCL.
Installation
⚠️ The wheel is dependency-heavy and will install specific versions of large libraries like Tensorflow, Transformers etc. Using a virtual environment is highly recommended.
pip install http://mozart.ipipan.waw.pl/~kkrasnowska/Hydra/hydra-0.7-py3-none-any.whl
Models
- trained on contemporary data
- trained on 17th-19th century and contemporary data (requires ‘historical’ dictionary for Morfeusz2)
Usage
Set the TF_USE_LEGACY_KERAS environment variable (there will be errors otherwise):
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"
If using GPU, set a limit for GPU memory usage (highly recommended):
import tensorflow as tf
GPU_MEM = 4*1024
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
tf.config.set_logical_device_configuration(
gpus[0],
[tf.config.LogicalDeviceConfiguration(memory_limit=GPU_MEM)]
)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
Import Hydra and load a model:
from hydra import Hydra
hydra_parser = Hydra.load('path/to/the/model/')
Parse some sentences (use of force_root_label and root_label parameters is highly recommended):
trees = hydra_parser.parse([
'Kot norweski jest bardzo inteligentny.',
'Przywiązuje się do właściciela i jego domu.'],
return_trees=True, force_root_label=True, root_label='ROOT')
trees[0]['tree'].pretty_print()
# expect someting similar to (actual outputs may differ depending on model):
#[0] *ROOT[]
# [0] *S[]
# [0] NP[]
# [0] *NP[]
# [0] *N[]
# [0] *('Kot', 'kot', 'subst:sg:nom:m2')[]
# [0] AdjP[]
# [0] *Adj[]
# [0] *('norweski', 'norweski', 'adj:sg:nom:m2:pos')[]
# [0] *VP[]
# [0] *V[]
# [0] *('jest', 'być', 'fin:sg:ter:imperf')[]
# [0] AdjP[]
# [0] AdvP[]
# [0] *Adv[]
# [0] *('bardzo', 'bardzo', 'adv:pos')[]
# [0] *AdjP[]
# [0] *Adj[]
# [0] *('inteligentny', 'inteligentny', 'adj:sg:nom:m1:pos')[]
# [0] Punct[]
# [0] *('.', '.', 'interp')[]
For result in JSON format, use return_jsons=True:
hydra_parser.parse('Przywiązuje się do właściciela i jego domu.', return_jsons=True)[0]
#{'tree': {'is_head': True, 'span': {'from': 0, 'to': 8}, 'attributes': {}, 'deprel': 'root', 'category': 'ROOT', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 7}, 'category': 'S', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'VP', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'category': 'V', 'children': [{'is_head': True, 'span': {'from': 0, 'to': 1}, 'orth': 'Przywiązuje', 'base': 'przywiązywać', 'tag': 'fin:sg:ter:imperf', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 1, 'to': 2}, 'attributes': {}, 'deprel': 'refl', 'category': 'Part', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'category': 'Refl', 'children': [{'is_head': True, 'span': {'from': 1, 'to': 2}, 'orth': 'się', 'base': 'się', 'tag': 'part', 'features': {'nps': False}}]}]}, {'is_head': False, 'span': {'from': 2, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'PrepNP', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'category': 'Prep', 'children': [{'is_head': True, 'span': {'from': 2, 'to': 3}, 'orth': 'do', 'base': 'do', 'tag': 'prep:gen', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 3, 'to': 7}, 'attributes': {}, 'deprel': 'comp', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 3, 'to': 4}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 3, 'to': 4}, 'orth': 'właściciela', 'base': 'właściciel', 'tag': 'subst:sg:gen:m1', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 4, 'to': 5}, 'category': 'Conj', 'children': [{'is_head': True, 'span': {'from': 4, 'to': 5}, 'orth': 'i', 'base': 'i', 'tag': 'conj', 'features': {'nps': False}}]}, {'is_head': False, 'span': {'from': 5, 'to': 7}, 'attributes': {}, 'deprel': 'conjunct', 'category': 'NP', 'children': [{'is_head': False, 'span': {'from': 5, 'to': 6}, 'attributes': {}, 'deprel': 'adjunct', 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 5, 'to': 6}, 'orth': 'jego', 'base': 'on', 'tag': 'ppron3:sg:gen:m1:ter:akc:npraep', 'features': {'nps': False}}]}]}, {'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'NP', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'category': 'N', 'children': [{'is_head': True, 'span': {'from': 6, 'to': 7}, 'orth': 'domu', 'base': 'dom', 'tag': 'subst:sg:gen:m3', 'features': {'nps': False}}]}]}]}]}]}]}, {'is_head': False, 'span': {'from': 7, 'to': 8}, 'attributes': {}, 'deprel': 'punct', 'category': 'Punct', 'children': [{'is_head': True, 'span': {'from': 7, 'to': 8}, 'orth': '.', 'base': '.', 'tag': 'interp', 'features': {'nps': True}}]}]}, 'metadata': {}}
To perform additional lemmata correction with Morfeusz (for Polish only), use correct_lemmata=True (recommended, useful especially in case of rare inflectional paradigms):
def get_lemmata(tree):
return [leaf.category[1] for leaf in tree.get_yield()]
get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', return_trees=True)[0]['tree'])
#['młynarz', 'mełnąć', 'zboże', 'na', 'mąka', '.']
get_lemmata(hydra_parser.parse('Młynarz mełł zboże na mąkę.', return_trees=True, correct_lemmata=True)[0]['tree'])
#['młynarz', 'mleć', 'zboże', 'na', 'mąka', '.']
To process an already-tokenized text, pass the sentence with space-separated tokens and use is_tokenized=True:
hydra_parser.parse('Miałem kota.', return_trees=True)[0]['tree'].pretty_print()
#[0] *ROOT[]
# [0] *S[]
# [0] *VP[]
# [0] *V[]
# [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
# [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
# [0] NP[]
# [0] *N[]
# [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
# [0] Punct[]
# [0] *('.', '.', 'interp')[]
hydra_parser.parse(['Miał em kota .'], is_tokenized=True, return_trees=True)[0]['tree'].pretty_print()
#[0] *ROOT[]
# [0] *S[]
# [0] *VP[]
# [0] *V[]
# [0] *('Miał', 'mieć', 'praet:sg:m1:imperf')[]
# [0] ('em', 'być', 'aglt:sg:pri:imperf:wok')[]
# [0] NP[]
# [0] *N[]
# [0] *('kota', 'kot', 'subst:sg:acc:m2')[]
# [0] Punct[]
# [0] *('.', '.', 'interp')[]
Training
Load a dataset (e.g. downloaded from here):
from datasets import load_dataset
dataset = load_dataset('path/to/dataset')
Train a Hydra model (requires dataset to contain 'train' and 'validation' parts):
from hydra import Trainer
trainer = Trainer(
'allegro/herbert-large-cased'',
dataset=dataset,
segmentation=True,
lemmatisation=True,
tagging=True,
dependency=True,
spines=True
)
hydra_parser = trainer.train(
epochs=50,
patience=3,
log_dir='...',
model_dir='...',
)
Most combinations of segmentation, lemmatisation, tagging, dependency and hybrid can be excluded from the model by setting to False. If dependency or hybrid are set to False at training time, the model will not produce trees. Pass return_labels=True at prediction time to receive results as labels for individual tokens:
no_trees = Hydra.load('path/to/other/model/')
no_trees.parse('Ala ma kota.', return_jsons=True)
# This model can’t perform hybrid parsing and won’t return jsons, setting `return_jsons` to False.
hydra_parser.parse('Ala ma kota.', return_labels=True)[0]['labels']
# [(['Ala', 'ma', 'kota', '.'], {'tags': ['subst:sg:nom:f', 'fin:sg:ter:imperf', 'subst:sg:acc:m2', 'interp'], 'lemmas': ['Ala', 'mieć', 'kot', '.']})]
To train particular components of the model, your dataset must contain following columns:
| component | columns |
|---|---|
| segmentation | (no columns required) |
| lemmatisation |
lemmas : datasets.Sequence of datasets.Value("string")
|
| tagging |
tags : datasets.Sequence of datasets.Value("string")
|
| dependency |
heads : datasets.Sequence of datasets.Value("int16") (first token = 0, root = None) , deprels : datasets.Sequence of datasets.features.ClassLabel
|
| hybrid |
nonterminals : [{'cat': datasets.Value("string"), 'children': [datasets.Value("int16")]}]
|
The tokens column is always required for training.
Evaluation
To evaluate the parser (evaluation metrics will be calculated individually on every part of dataset passed to Trainer() other than 'train' and 'validation'):
evaluation_results = trainer.evaluate(hydra_parser)
Acknowledgements
Work supported by POIR.04.02.00-00-D006/20-00 national grant (Digital Research Infrastructure for the Arts and Humanities DARIAH-PL).