P

PDBUD

public

 

Summary

The PDB-UD treebank is derived from the Polish Dependency Bank 2.0 (PDB 2.0), created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw. The treebank is licensed under the terms of CC BY-NC-SA 4.0.

Introduction

The treebank consists of 22,208 sentences (351,406 tokens) from Polish National Corpus, Europarl, DGT-Translation Memory, OPUS, Pelcra Prallel Corpus, CDSCorpus and literature.

PDB-UD contains all 8K sentences of the Polish UD treebank and 14K other unique sentences.

The morphological, syntactic and semantic annotation of the PDB-UD treebank is created through a conversion of PDB data. The conversion procedure has been designed and implemented by Alina Wróblewska partly based on the conversion of the UD PL-SZ trees. The following dependency labels are used in PDB-UD: acl, acl:relcl, advcl, advcl:relcl, advmod, advmod:arg, advmod:neg, amod, amod:flat, appos, aux, aux:clitic, aux:cnd, aux:imp, aux:pass, case, cc, cc:preconj, ccomp, ccomp:obj, conj, cop, csubj, det, det:numgov, det:nummod, det:poss, discourse:emo, discourse:intj, expl:pv, fixed, flat, iobj, list, mark, nmod, nmod:arg, nmod:flat, nmod:pred, nsubj, nsubj:pass, nummod, nummod:gov, obj, obl, obl:agent, obl:arg, obl:cmpr, orphan, parataxis, parataxis:insert, parataxis:obj, punct, root, vocative, xcomp, xcomp:pred, xcomp:subj.

There are two versions of the PDB-UD trees:

  1. the standard UD-like trees with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts (9167 trees with enhanced edges),
  2. the enhanced graphs with the semantic roles of some dependents.

Data Split

PDB-UD is divided into three parts – training (17,770 trees), test (2219 trees) and development (2219 trees) data sets. The procedure of assigning dependency trees to particular data sets is generally random while maintaining the proportion of data from individual sources. There is one constraint on the dividing procedure – the PL-SZ trees are not included in the test set. Since sentences underlying the PL-SZ trees are generally shorter than the remaining sentences, the average number of tokens per sentence is significantly higher in the test set than in two other sets.

Polish version of PUD (Parallel Universal Dependencies treebank)

PUD-PL consists of 1000 Polish sentences (18,389 tokens) in the same order as in parallel treebanks in other languages. Morpho-syntactic annotations are automatically predicted and then manually corrected. 458 PUD-PL trees contain enhanced edges.

Acknowledgments

We would like to thank all of the contributors to the original Polish Dependency Bank. The development of PDB-UD was founded by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure.

References

  • Alina Wróblewska (2018) Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format. In Proceedings of Universal Dependencies Workshop 2018 (UDW 2018).