Summary

The Polish PDB-UD treebank is automatically converted from the Polish Dependency Bank 2.0 (PDB 2.0). Both treebanks, PDB 2.0 and PDB-UD, were created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw (Poland). As of the release 2.4, PDB-UD is included in the Universal Dependencies collection (UD Polish PDB) and it substituted for the first UD Polish SZ treebank (UD releases 1.1-2.3).

NKJP1M-UD is a part of NKJP with the manual morpho-syntactic annotations and the partially manual dependency trees automatically converted into Universal Dependencies format.

PDB-UD

The treebank consists of 22,152 sentences (350,036 tokens) from Polish National Corpus, Europarl, DGT-Translation Memory, OPUS, Pelcra Parallel Corpus, CDSCorpus and literature.

The PDB trees (i.e. morphological, syntactic and semantic annotations) were automatically converted to the PDB-UD trees. The conversion procedure is rule-based and it is partly based on the conversion of the UD PL-SZ trees. The following dependency labels are used in PDB-UD: acl, acl:relcl, advcl, advcl:cmpr, advcl:relcl, advmod, advmod:arg, advmod:emph, advmod:neg, amod, amod:flat, appos, aux, aux:clitic, aux:cnd, aux:imp, aux:pass, case, cc, cc:preconj, ccomp, ccomp:cleft, ccomp:obj, conj, cop, csubj, csubj:pass, dep, det, det:numgov, det:nummod, det:poss, discourse:intj, expl:pv, fixed, flat, flat:foreign, iobj, list, mark, nmod, nmod:arg, nmod:flat, nmod:poss, nmod:pred, nsubj, nsubj:pass, nummod, nummod:flat, nummod:gov, obj, obl, obl:agent, obl:arg, obl:cmpr, obl:orphan, orphan, parataxis:insert, parataxis:obj, punct, root, vocative, xcomp, xcomp:cleft, xcomp:pred, and xcomp:subj.

Enhanced PDB-UD graphs

The standard UD-like trees are enhanced with the edges encoding the shared dependents and the shared governors of the coordinated conjuncts (9141 trees with enhanced edges).

Data Split

PDB-UD is divided into three parts – training (17,722 trees), test (2215 trees) and development (2215 trees) data sets. The procedure of assigning dependency trees to particular data sets is generally random while maintaining the proportion of data from individual sources. There is one constraint on the dividing procedure: if a sentence occurs in the test, dev or train subset of the UD Polish LFG treebank, this sentence is assigned to the test, dev or train set of the Polish PDB-UD treebank, respectively.

PUD-PL treebank

The Polish Parallel Universal Dependencies treebank (PUD-PL) consists of 1000 Polish sentences (18,389 tokens) in the same order as in parallel treebanks in other languages. Morpho-syntactic annotations are automatically predicted and then manually corrected. 459 PUD-PL trees contain enhanced edges.

NKJP1M-UD

The manually annotated 1-million word subcorpus of the NKJP is dependency parsed and the resulting trees (including the manual morpho-syntactic annotations) are automatically converted into UD trees. Some automatically generated trees are replaced with their manually annotated equivalents from PDB-UD. The resource is described in Wróblewska (2020).

Licensing

The Polish PDB-UD treebank is dual-licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) and GNU General Public License (GPL v.3).

The Polish PUD treebank is distributed under CC BY-SA 4.0

NKJP1M-UD is available on GNU GPL v.3.

Acknowledgments

We would like to thank all of the contributors to the original Polish Dependency Bank. The development of PDB-UD was founded by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure.

References

Alina Wróblewska (2018) Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format. In Marie-Catherine de Marneffe, Teresa Lynn, and Sebastian Schuster, editors, Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 173–182. Association for Computational Linguistics.
Alina Wróblewska (2020) Towards the Conversion of National Corpus of Polish to Universal Dependencies. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), pages 5308–5315, Marseille, France, 2020. European Language Resources Association.

GitLab

Alina Wróblewska / PDBUD

PDBUD