README.md 11 KB

Edit Raw Blame History



COMBO

COMBO is jointly trained neural tagger, lemmatizer and dependency parser implemented in python 3 using Keras framework. It took part in 2018 CoNLL Universal Dependency shared task and ranked 3rd/4th in the official evaluation.


Paper

The COMBO description can be found here: Semi-Supervised Neural System for Tagging, Parsing and Lematization.


Usage

Training your own model:
python main.py --mode autotrain --train train_data.conllu --valid valid_data.conllu --embed external_embedding.txt --model model_name.pkl --force_trees

Making predictions:
python main.py --mode predict --test test_data.conllu --pred output_path.conllu --model model_name.pkl


Trained models

Models trained on UD dataset:

| Language | Treebank | LAS | MLAS | BLEX | Model |
|-|-|-|-|-|-|
| Afrikaans | af_afribooms | 84.72 | 72.91 | 74.98 | 377 MB |
| Ancient Greek | grc_perseus | 74.20 | 53.30 | 54.29 | 101 MB |
| Ancient Greek | grc_proiel | 76.45 | 59.95 | 67.47 | 101 MB |
| Arabic | ar_padt | 71.95 | 62.75 | 64.38 | 737 MB |
| Armenian | hy_armtdp | 28.15 | 5.02 | 11.25 | 738 MB |
| Basque | eu_bdt | 83.12 | 68.82 | 77.96 | 737 MB |
| Bulgarian | bg_btb | 89.36 | 81.10 | 79.98 | 738 MB |
| Buryat | bxr_bdt | 15.16 | 1.09 | 1.92 | 90 MB |
| Catalan | ca_ancora | 90.54 | 83.11 | 85.20 | 737 MB |
| Chinese | zh_gsd | 63.92 | 53.48 | 57.84 | 744 MB |
| Croatian | hr_set | 86.32 | 71.12 | 79.74 | 737 MB |
| Czech | cs_cac | 90.72 | 83.27 | 86.69 | 740 MB |
| Czech | cs_fictree | 91.83 | 84.23 | 87.81 | 740 MB |
| Czech | cs_pdt | 90.34 | 84.04 | 86.96 | 740 MB |
| Danish | da_ddt | 83.43 | 74.22 | 77.58 | 737 MB |
| Dutch | nl_alpino | 87.15 | 74.93 | 77.06 | 737 MB |
| Dutch | nl_lassysmall | 84.27 | 72.65 | 75.44 | 737 MB |
| English | en_ewt | 82.31 | 73.33 | 76.52 | 737 MB |
| English | en_gum | 82.82 | 73.24 | 73.57 | 737 MB |
| English | en_lines | 80.33 | 72.25 | 74.01 | 737 MB |
| Estonian | et_edt | 83.46 | 75.79 | 72.07 | 738 MB |
| Finnish | fi_ftb | 86.89 | 78.42 | 81.06 | 739 MB |
| Finnish | fi_tdt | 85.93 | 78.65 | 72.39 | 739 MB |
| French | fr_gsd | 85.42 | 77.08 | 79.72 | 738 MB |
| French | fr_sequoia | 88.99 | 81.48 | 84.67 | 738 MB |
| French | fr_spoken | 74.31 | 63.43 | 65.34 | 738 MB |
| Galician | gl_ctg | 81.17 | 68.15 | 73.60 | 736 MB |
| Galician | gl_treegal | 73.21 | 52.88 | 62.86 | 736 MB |
| German | de_gsd | 77.43 | 54.28 | 68.59 | 738 MB |
| Gothic | got_proiel | 65.87 | 50.81 | 59.30 | 48 MB |
| Greek | el_gdt | 88.49 | 76.15 | 78.57 | 738 MB |
| Hebrew | he_htb | 63.69 | 50.26 | 53.58 | 737 MB |
| Hindi | hi_hdtb | 91.43 | 76.23 | 86.29 | 593 MB |
| Hungarian | hu_szeged | 79.47 | 66.09 | 72.51 | 737 MB |
| Indonesian | id_gsd | 78.40 | 67.30 | 75.10 | 737 MB |
| Irish | ga_idt | 69.24 | 37.31 | 47.32 | 206 MB |
| Italian | it_isdt | 91.03 | 83.18 | 84.76 | 737 MB |
| Italian | it_postwita | 73.99 | 61.14 | 62.98 | 737 MB |
| Japanese | ja_gsd | 73.69 | 57.82 | 60.62 | 743 MB |
| Kazakh | kk_ktb | 22.38 | 4.40 | 7.86 | 738 MB |
| Korean | ko_gsd | 80.66 | 74.49 | 66.13 | 741 MB |
| Korean | ko_kaist | 84.88 | 76.92 | 72.40 | 743 MB |
| Kurmanji | kmr_mg | 21.95 | 2.26 | 05.01 | 45 MB |
| Latin | la_ittb | 85.54 | 79.84 | 83.51 | 526 MB |
| Latin | la_perseus | 68.07 | 49.77 | 52.75 | 526 MB |
| Latin | la_proiel | 70.08 | 56.82 | 64.94 | 526 MB|
| Latvian | lv_lvtb | 80.71 | 66.22 | 71.80 | 637 MB |
| North Sámi | sme_giella | 57.16 | 39.66 | 45.03 | 47 MB |
| Norwegian | no_bokmaal | 89.33 | 79.51 | 84.68 | 737 MB |
| Norwegian | no_nynorsk | 88.36 | 79.32 | 82.89 | 737 MB |
| Norwegian | no_nynorsklia | 68.26 | 57.51 | 60.98 | 737 MB |
| Old Church Slavonic | cu_proiel | 71.14 | 56.52 | 66.04 | 48 MB |
| Old French | fro_srcmf | 84.81 | 76.75 | 81.20 | 52 MB |
| Persian | fa_seraji | 86.14 | 80.30 | 76.29 | 737 MB |
| Polish | pl_lfg | 94.62 | 86.44 | 89.31 | 737 MB |
| Polish | pl_sz | 91.38 | 80.45 | 85.59 | 737 MB |
| Polish | poleval2018 | 86.11 | 76.18 | 79.86 | 115 MB |
| Portuguese | pt_bosque | 87.57 | 74.31 | 80.31 | 737 MB |
| Romanian | ro_rrt | 85.31 | 76.84 | 79.54 | 737 MB |
| Russian | ru_syntagrus | 91.10 | 85.37 | 87.16 | 741 MB |
| Russian | ru_taiga | 74.24 | 61.59 | 64.36 | 741 MB |
| Serbian | sr_set | 87.27 | 73.79 | 79.92 | 738 MB |
| Slovak | sk_snk | 83.76 | 63.97 | 75.34 | 54 MB |
| Slovenian | sl_ssj | 85.72 | 75.07 | 81.11 | 737 MB |
| Slovenian | sl_sst | 58.12 | 45.93 | 50.94 | 737 MB |
| Spanish | es_ancora | 89.68 | 82.60 | 84.51 | 737 MB |
| Swedish | sv_lines | 81.97 | 66.26 | 77.01 | 737 MB |
| Swedish | sv_talbanken | 85.89 | 77.68 | 80.74 | 737 MB |
| Turkish | tr_imst | 63.54 | 52.51 | 58.89 | 737 MB |
| Ukrainian | uk_iu | 84.71 | 69.88 | 77.97 | 738 MB |
| Upper Sorbian | hsb_ufal | 21.30 | 1.45 | 4.53 | 139 MB |
| Urdu | ur_udtb | 81.53 | 55.70 | 72.49 | 485 MB |
| Uyghur | ug_udt | 63.10 | 40.71 | 52.76 | 165 MB |
| Vietnamese | vi_vtb | 42.53 | 35.11 | 38.47 | 736 MB |


License

CC BY-NC-SA 4.0


Citation
@InProceedings{rybak-wrblewska:2018:K18-2,
  author    = {Rybak, Piotr  and  Wr{\'{o}}blewska, Alina},
  title     = {Semi-Supervised Neural System for Tagging, Parsing and Lematization},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {45--54},
  url       = {http://www.aclweb.org/anthology/K18-2004}
}