P

PDBUD

public

 

Summary

The Polish PDB-UD treebank is autmatically converted of the Polish Dependency Bank 2.0 (PDB 2.0). The both treebanks, PDB 2.0 and PDB-UD, were created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw (Poland). As of the release 2.4, PDB-UD is included in the Universal Dependencies collection (UD Polish PDB) and it substituted for the first UD Polish SZ treebank (UD releases 1.1-2.3).

Introduction

The treebank consists of 22,152 sentences (350,036 tokens) from Polish National Corpus, Europarl, DGT-Translation Memory, OPUS, Pelcra Prallel Corpus, CDSCorpus and literature.

The PDB trees (i.e. morphological, syntactic and semantic annotations) were automatically converted to the PDB-UD trees. The conversion procedure is rule-based and it is partly based on the conversion of the UD PL-SZ trees. The following dependency labels are used in PDB-UD: acl, acl:relcl, advcl, advcl:cmpr, advcl:relcl, advmod, advmod:arg, advmod:emph, advmod:neg, amod, amod:flat, appos, aux, aux:clitic, aux:cnd, aux:imp, aux:pass, case, cc, cc:preconj, ccomp, ccomp:cleft, ccomp:obj, conj, cop, csubj, csubj:pass, dep, det, det:numgov, det:nummod, det:poss, discourse:emo, discourse:intj, expl:pv, fixed, flat, flat:foreign, iobj, list, mark, nmod, nmod:arg, nmod:flat, nmod:poss, nmod:pred, nsubj, nsubj:pass, nummod, nummod:flat, nummod:gov, obj, obl, obl:agent, obl:arg, obl:cmpr, obl:orphan, orphan, parataxis:insert, parataxis:obj, punct, root, vocative, xcomp, xcomp:cleft, xcomp:pred, and xcomp:subj.

There are two versions of the PDB-UD trees:

  1. the standard UD-like trees with the enhanced edges encoding the shared dependents and the shared governors of the coordinated conjuncts (9141 trees with the enhanced edges),
  2. the enhanced graphs with the semantic roles of some dependents.

Data Split

PDB-UD is divided into three parts – training (17,722 trees), test (2215 trees) and development (2215 trees) data sets. The procedure of assigning dependency trees to particular data sets is generally random while maintaining the proportion of data from individual sources. There is one constraint on the dividing procedure: if a sentence occurs in the test, dev or train subset of the UD Polish LFG treebank, this sentence is assigned to the test, dev or train set of the Polish PDB-UD treebank, respectively.

Polish PUD treebank

The Polish Parallel Universal Dependencies treebank (PUD-PL) consists of 1000 Polish sentences (18,389 tokens) in the same order as in parallel treebanks in other languages. Morpho-syntactic annotations are automatically predicted and then manually corrected. 459 of PUD-PL trees contain enhanced edges.

Licensing

The Polish PDB-UD treebank is dual-licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) and GNU General Public License (GPL v.3).

The Polish PUD treebank is distribute under CC BY-SA 4.0

Acknowledgments

We would like to thank all of the contributors to the original Polish Dependency Bank. The development of PDB-UD was founded by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure.

References

  • Alina Wróblewska (2018) Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format. In Proceedings of Universal Dependencies Workshop 2018 (UDW 2018).