Name Last Update
..
models Loading commit data...
README.md Loading commit data...
encoders.py Loading commit data...
main.py Loading commit data...
models.py Loading commit data...
mst.py Loading commit data...
parser.py Loading commit data...
requirements.txt Loading commit data...
utils.py Loading commit data...

README.md

COMBO

COMBO is jointly trained neural tagger, lemmatizer and dependency parser implemented in python 3 using Keras framework. It took part in 2018 CoNLL Universal Dependency shared task and ranked 3rd/4th in the official evaluation.

Paper

The COMBO description can be found here: Semi-Supervised Neural System for Tagging, Parsing and Lematization.

Usage

Training your own model:

python main.py --mode autotrain --train train_data.conllu --valid valid_data.conllu --embed external_embedding.txt --model model_name.pkl --force_trees

Making predictions:

python main.py --mode predict --test test_data.conllu --pred output_path.conllu --model model_name.pkl

Trained models

Models trained on UD dataset:

| Language | Treebank | LAS | MLAS | BLEX | Model | |-|-|-|-|-|-| | Afrikaans | af_afribooms | 84.72 | 72.91 | 74.98 | 377 MB | | Ancient Greek | grc_perseus | 74.20 | 53.30 | 54.29 | 101 MB | | Ancient Greek | grc_proiel | 76.45 | 59.95 | 67.47 | 101 MB | | Arabic | ar_padt | 71.95 | 62.75 | 64.38 | 737 MB | | Armenian | hy_armtdp | 28.15 | 5.02 | 11.25 | 738 MB | | Basque | eu_bdt | 83.12 | 68.82 | 77.96 | 737 MB | | Bulgarian | bg_btb | 89.36 | 81.10 | 79.98 | 738 MB | | Buryat | bxr_bdt | 15.16 | 1.09 | 1.92 | 90 MB | | Catalan | ca_ancora | 90.54 | 83.11 | 85.20 | 737 MB | | Chinese | zh_gsd | 63.92 | 53.48 | 57.84 | 744 MB | | Croatian | hr_set | 86.32 | 71.12 | 79.74 | 737 MB | | Czech | cs_cac | 90.72 | 83.27 | 86.69 | 740 MB | | Czech | cs_fictree | 91.83 | 84.23 | 87.81 | 740 MB | | Czech | cs_pdt | 90.34 | 84.04 | 86.96 | 740 MB | | Danish | da_ddt | 83.43 | 74.22 | 77.58 | 737 MB | | Dutch | nl_alpino | 87.15 | 74.93 | 77.06 | 737 MB | | Dutch | nl_lassysmall | 84.27 | 72.65 | 75.44 | 737 MB | | English | en_ewt | 82.31 | 73.33 | 76.52 | 737 MB | | English | en_gum | 82.82 | 73.24 | 73.57 | 737 MB | | English | en_lines | 80.33 | 72.25 | 74.01 | 737 MB | | Estonian | et_edt | 83.46 | 75.79 | 72.07 | 738 MB | | Finnish | fi_ftb | 86.89 | 78.42 | 81.06 | 739 MB | | Finnish | fi_tdt | 85.93 | 78.65 | 72.39 | 739 MB | | French | fr_gsd | 85.42 | 77.08 | 79.72 | 738 MB | | French | fr_sequoia | 88.99 | 81.48 | 84.67 | 738 MB | | French | fr_spoken | 74.31 | 63.43 | 65.34 | 738 MB | | Galician | gl_ctg | 81.17 | 68.15 | 73.60 | 736 MB | | Galician | gl_treegal | 73.21 | 52.88 | 62.86 | 736 MB | | German | de_gsd | 77.43 | 54.28 | 68.59 | 738 MB | | Gothic | got_proiel | 65.87 | 50.81 | 59.30 | 48 MB | | Greek | el_gdt | 88.49 | 76.15 | 78.57 | 738 MB | | Hebrew | he_htb | 63.69 | 50.26 | 53.58 | 737 MB | | Hindi | hi_hdtb | 91.43 | 76.23 | 86.29 | 593 MB | | Hungarian | hu_szeged | 79.47 | 66.09 | 72.51 | 737 MB | | Indonesian | id_gsd | 78.40 | 67.30 | 75.10 | 737 MB | | Irish | ga_idt | 69.24 | 37.31 | 47.32 | 206 MB | | Italian | it_isdt | 91.03 | 83.18 | 84.76 | 737 MB | | Italian | it_postwita | 73.99 | 61.14 | 62.98 | 737 MB | | Japanese | ja_gsd | 73.69 | 57.82 | 60.62 | 743 MB | | Kazakh | kk_ktb | 22.38 | 4.40 | 7.86 | 738 MB | | Korean | ko_gsd | 80.66 | 74.49 | 66.13 | 741 MB | | Korean | ko_kaist | 84.88 | 76.92 | 72.40 | 743 MB | | Kurmanji | kmr_mg | 21.95 | 2.26 | 05.01 | 45 MB | | Latin | la_ittb | 85.54 | 79.84 | 83.51 | 526 MB | | Latin | la_perseus | 68.07 | 49.77 | 52.75 | 526 MB | | Latin | la_proiel | 70.08 | 56.82 | 64.94 | 526 MB| | Latvian | lv_lvtb | 80.71 | 66.22 | 71.80 | 637 MB | | North Sámi | sme_giella | 57.16 | 39.66 | 45.03 | 47 MB | | Norwegian | no_bokmaal | 89.33 | 79.51 | 84.68 | 737 MB | | Norwegian | no_nynorsk | 88.36 | 79.32 | 82.89 | 737 MB | | Norwegian | no_nynorsklia | 68.26 | 57.51 | 60.98 | 737 MB | | Old Church Slavonic | cu_proiel | 71.14 | 56.52 | 66.04 | 48 MB | | Old French | fro_srcmf | 84.81 | 76.75 | 81.20 | 52 MB | | Persian | fa_seraji | 86.14 | 80.30 | 76.29 | 737 MB | | Polish | pl_lfg | 94.62 | 86.44 | 89.31 | 737 MB | | Polish | pl_sz | 91.38 | 80.45 | 85.59 | 737 MB | | Polish | poleval2018 | 86.11 | 76.18 | 79.86 | 115 MB | | Portuguese | pt_bosque | 87.57 | 74.31 | 80.31 | 737 MB | | Romanian | ro_rrt | 85.31 | 76.84 | 79.54 | 737 MB | | Russian | ru_syntagrus | 91.10 | 85.37 | 87.16 | 741 MB | | Russian | ru_taiga | 74.24 | 61.59 | 64.36 | 741 MB | | Serbian | sr_set | 87.27 | 73.79 | 79.92 | 738 MB | | Slovak | sk_snk | 83.76 | 63.97 | 75.34 | 54 MB | | Slovenian | sl_ssj | 85.72 | 75.07 | 81.11 | 737 MB | | Slovenian | sl_sst | 58.12 | 45.93 | 50.94 | 737 MB | | Spanish | es_ancora | 89.68 | 82.60 | 84.51 | 737 MB | | Swedish | sv_lines | 81.97 | 66.26 | 77.01 | 737 MB | | Swedish | sv_talbanken | 85.89 | 77.68 | 80.74 | 737 MB | | Turkish | tr_imst | 63.54 | 52.51 | 58.89 | 737 MB | | Ukrainian | uk_iu | 84.71 | 69.88 | 77.97 | 738 MB | | Upper Sorbian | hsb_ufal | 21.30 | 1.45 | 4.53 | 139 MB | | Urdu | ur_udtb | 81.53 | 55.70 | 72.49 | 485 MB | | Uyghur | ug_udt | 63.10 | 40.71 | 52.76 | 165 MB | | Vietnamese | vi_vtb | 42.53 | 35.11 | 38.47 | 736 MB |

License

CC BY-NC-SA 4.0

Citation

@InProceedings{rybak-wrblewska:2018:K18-2,
  author    = {Rybak, Piotr  and  Wr{\'{o}}blewska, Alina},
  title     = {Semi-Supervised Neural System for Tagging, Parsing and Lematization},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {45--54},
  url       = {http://www.aclweb.org/anthology/K18-2004}
}