Commit 85167180bcbe0565a09269257456961365cf6ff3
1 parent
f7a92fb7
Typos corrected
Showing
1 changed file
with
5 additions
and
5 deletions
README.md
1 | 1 | # Summary |
2 | 2 | |
3 | -The Polish PDB-UD treebank is autmatically converted of the [Polish Dependency Bank 2.0 (PDB 2.0)](http://zil.ipipan.waw.pl/PDB). The both treebanks, PDB 2.0 and PDB-UD, were created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland). As of the release 2.4, PDB-UD is included in the [Universal Dependencies](https://universaldependencies.org) collection ([UD Polish PDB](https://universaldependencies.org/treebanks/pl_pdb/index.html)) and it substituted for the first UD Polish SZ treebank (UD releases 1.1-2.3). | |
3 | +The Polish PDB-UD treebank is automatically converted from the [Polish Dependency Bank 2.0 (PDB 2.0)](http://zil.ipipan.waw.pl/PDB). Both treebanks, PDB 2.0 and PDB-UD, were created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland). As of the release 2.4, PDB-UD is included in the [Universal Dependencies](https://universaldependencies.org) collection ([UD Polish PDB](https://universaldependencies.org/treebanks/pl_pdb/index.html)) and it substituted for the first UD Polish SZ treebank (UD releases 1.1-2.3). | |
4 | 4 | |
5 | 5 | NKJP1M-UD is a part of NKJP with the manual morpho-syntactic annotations and the partially manual dependency trees automatically converted into Universal Dependencies format. |
6 | 6 | |
7 | 7 | |
8 | 8 | # PDB-UD |
9 | 9 | |
10 | -The treebank consists of 22,152 sentences (350,036 tokens) from [Polish National Corpus](http://nkjp.pl), [Europarl](http://www.statmt.org/europarl), [DGT-Translation Memory](https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory), [OPUS](http://opus.nlpl.eu), [Pelcra Prallel Corpus](http://metashare.dfki.de/repository/browse/pelcra-polish-english-parallel-corpus-of-literary-works-cc-by/e99fa4c063f111e2bff4525400d761472dc239ffeb6f47bda0553af53ddd5ef0/), [CDSCorpus](http://zil.ipipan.waw.pl/Scwad/CDSCorpus) and literature. | |
10 | +The treebank consists of 22,152 sentences (350,036 tokens) from [Polish National Corpus](http://nkjp.pl), [Europarl](http://www.statmt.org/europarl), [DGT-Translation Memory](https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory), [OPUS](http://opus.nlpl.eu), [Pelcra Parallel Corpus](http://metashare.dfki.de/repository/browse/pelcra-polish-english-parallel-corpus-of-literary-works-cc-by/e99fa4c063f111e2bff4525400d761472dc239ffeb6f47bda0553af53ddd5ef0/), [CDSCorpus](http://zil.ipipan.waw.pl/Scwad/CDSCorpus) and literature. | |
11 | 11 | |
12 | 12 | The PDB trees (i.e. morphological, syntactic and semantic annotations) were automatically converted to the PDB-UD trees. The conversion procedure is rule-based and it is partly based on the conversion of the UD PL-SZ trees. The following dependency labels are used in PDB-UD: `acl`, `acl:relcl`, `advcl`, `advcl:cmpr`, `advcl:relcl`, `advmod`, `advmod:arg`, `advmod:emph`, `advmod:neg`, `amod`, `amod:flat`, `appos`, `aux`, `aux:clitic`, `aux:cnd`, `aux:imp`, `aux:pass`, `case`, `cc`, `cc:preconj`, `ccomp`, `ccomp:cleft`, `ccomp:obj`, `conj`, `cop`, `csubj`, `csubj:pass`, `dep`, `det`, `det:numgov`, `det:nummod`, `det:poss`, `discourse:intj`, `expl:pv`, `fixed`, `flat`, `flat:foreign`, `iobj`, `list`, `mark`, `nmod`, `nmod:arg`, `nmod:flat`, `nmod:poss`, `nmod:pred`, `nsubj`, `nsubj:pass`, `nummod`, `nummod:flat`, `nummod:gov`, `obj`, `obl`, `obl:agent`, `obl:arg`, `obl:cmpr`, `obl:orphan`, `orphan`, `parataxis:insert`, `parataxis:obj`, `punct`, `root`, `vocative`, `xcomp`, `xcomp:cleft`, `xcomp:pred`, and `xcomp:subj`. |
13 | 13 | |
14 | 14 | ## Enhanced PDB-UD graphs |
15 | -The standard UD-like trees are enhanced with the edges encoding the shared dependents and the shared governors of the coordinated conjuncts (9141 trees with the enhanced edges). | |
15 | +The standard UD-like trees are enhanced with the edges encoding the shared dependents and the shared governors of the coordinated conjuncts (9141 trees with enhanced edges). | |
16 | 16 | |
17 | 17 | ## Data Split |
18 | 18 | |
... | ... | @@ -20,7 +20,7 @@ PDB-UD is divided into three parts – training (17,722 trees), test (2215 trees |
20 | 20 | |
21 | 21 | # PUD-PL treebank |
22 | 22 | |
23 | -The Polish Parallel Universal Dependencies treebank (PUD-PL) consists of 1000 Polish sentences (18,389 tokens) in the same order as in parallel treebanks in other languages. Morpho-syntactic annotations are automatically predicted and then manually corrected. 459 of PUD-PL trees contain enhanced edges. | |
23 | +The Polish Parallel Universal Dependencies treebank (PUD-PL) consists of 1000 Polish sentences (18,389 tokens) in the same order as in parallel treebanks in other languages. Morpho-syntactic annotations are automatically predicted and then manually corrected. 459 PUD-PL trees contain enhanced edges. | |
24 | 24 | |
25 | 25 | # NKJP1M-UD |
26 | 26 | |
... | ... | @@ -36,7 +36,7 @@ NKJP1M-UD is available on [GNU GPL v.3](https://www.gnu.org/licenses/gpl-3.0.en. |
36 | 36 | |
37 | 37 | # Acknowledgments |
38 | 38 | |
39 | -We would like to thank all of the contributors to the original Polish Dependency Bank. The development of PDB-UD was founded by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure. | |
39 | +We would like to thank all of the contributors to the original Polish Dependency Bank. The development of PDB-UD was founded by the Polish Ministry of Science and Higher Education as part of the investment in the CLARIN-PL research infrastructure. | |
40 | 40 | |
41 | 41 | ## References |
42 | 42 | |
... | ... |