Commit ab5dbe37545fba606c018bbc28a092cb46ad5358
1 parent
2100e842
Add README.
Showing
1 changed file
with
36 additions
and
0 deletions
README.md
0 → 100644
1 | +# The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego # | |
2 | + | |
3 | +The *[Polish Parliamentary Corpus (PPC)](http://clip.ipipan.waw.pl/PPC)* is a large collection of linguistically analysed documents from the proceedings of *Polish Parliament*, [Sejm](http://opis.sejm.gov.pl/en/) and [Senate](http://www.senat.gov.pl/en/). It is based on the [Polish Sejm Corpus](http://clip.ipipan.waw.pl/PSC) co-funded by project [CESAR](http://clip.ipipan.waw.pl/CESAR) and is currently being updated by [CLARIN-PL](http://clip.ipipan.waw.pl/CLARIN-PL-3) infrastructure. | |
4 | + | |
5 | +## Corpus data ## | |
6 | + | |
7 | +The current size of the corpus amounts over 700M segments. Apart from the stenographic records of plenary sittings and committee sittings, the corpus contains also interpellations and questions. | |
8 | + | |
9 | +Corpus files are made available in *XML TEI P5* format compatible with the annotation used by the [National Corpus of Polish](http://nkjp.pl/index.php?page=0&lang=1). This repository contains *Unannotated TEI version* of the corpora. For annotated version please go to the [PPC homepage](http://clip.ipipan.waw.pl/PPC). | |
10 | + | |
11 | +## Searching the corpus ## | |
12 | + | |
13 | + * [using the search engine](https://kdp.nlp.ipipan.waw.pl/) | |
14 | + * [using the ngram viewer](http://ngram.kdp.nlp.ipipan.waw.pl/) | |
15 | + | |
16 | +## Licence ## | |
17 | + | |
18 | +The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence. | |
19 | + | |
20 | +## Publications ## | |
21 | + | |
22 | +[Maciej Ogrodniczuk and Bartłomiej Nitoń. *New developments in the Polish Parliamentary Corpus*. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).](https://www.aclweb.org/anthology/2020.parlaclarin-1.1.pdf) | |
23 | + | |
24 | + | |
25 | +[Maciej Ogrodniczuk. *Polish Parliamentary Corpus*. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).](http://lrec-conf.org/workshops/lrec2018/W2/pdf/11_W2.pdf) | |
26 | + | |
27 | + | |
28 | +[Maciej Ogrodniczuk. *The Polish Sejm Corpus*. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).](http://www.lrec-conf.org/proceedings/lrec2012/pdf/653_Paper.pdf) | |
29 | + | |
30 | +Please see also [the slides](https://www.clarin.eu/sites/default/files/2-ogrodniczuk.pdf) from [CLARIN-PLUS Workshop "Working with Parliamentary Records"](https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records). Sofia, 27–29 March 2017. | |
31 | + | |
32 | +## Contact ## | |
33 | + | |
34 | +[Maciej Ogrodniczuk](http://zil.ipipan.waw.pl/MaciejOgrodniczuk), Institute of Computer Science, Polish Academy of Sciences | |
35 | + | |
36 | +For more information please go to the [PPC homepage](http://clip.ipipan.waw.pl/PPC) | |
... | ... |