NKJP_1M_header.xml
16.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
<?xml version="1.0" encoding="UTF-8"?>
<teiHeader xml:id="NKJP_header" xmlns="http://www.tei-c.org/ns/1.0" xmlns:nkjp="http://www.nkjp.pl/ns/1.0" xml:lang="en" type="corpus">
<fileDesc>
<titleStmt>
<title xml:lang="pl">Narodowy Korpus Języka Polskiego -- próbka 1 milion słów</title>
<title xml:lang="en">National Corpus of Polish -- the 1 million word sample</title>
<funder xml:lang="pl" xml:id="mnisw">Ministerstwo Nauki i Szkolnictwa Wyższego (Polska)</funder>
<funder xml:lang="en">Ministry of Science and Higher Education (Poland)</funder>
<respStmt>
<name xml:id="adamp">Adam Przepiórkowski</name>
<resp>head of the team at the Institute of Computer Science at the Polish Academy of Sciences</resp>
<resp>project coordinator</resp>
</respStmt>
<respStmt>
<name xml:id="rlg">Rafał L. Górski</name>
<resp>head of the team at the Institute of Polish Language at the Polish Academy of Sciences</resp>
</respStmt>
<respStmt>
<name xml:id="blt">Barbara Lewandowska-Tomaszczyk</name>
<resp>head of the team at the University of Łódź</resp>
</respStmt>
<respStmt>
<name xml:id="marekl">Marek Łaziński</name>
<resp>head of the team at the Polish Scientific Publishers PWN</resp>
</respStmt>
<respStmt>
<name xml:id="bansp">Piotr Bański</name>
<resp>design of the XML schemata</resp>
</respStmt>
<respStmt>
<name xml:id="iwill">Izabela Will</name>
<resp xml:lang="en">project administration in 2008</resp>
<resp xml:lang="en">proofreading samples</resp>
</respStmt>
<respStmt>
<name xml:id="beataw">Beata Wójtowicz</name>
<resp xml:lang="en">project administration in 2009-2010</resp>
<resp xml:lang="en">proofreading samples</resp>
</respStmt>
<respStmt>
<name xml:id="ldegorski">Łukasz Degórski</name>
<resp xml:lang="en">general technical responsibilities in the project (Warsaw)</resp>
<resp xml:lang="en">sampling for the 1 million word subcorpus</resp>
<resp xml:lang="en">proofreading samples</resp>
</respStmt>
<respStmt>
<name xml:id="pezik">Piotr Pęzik</name>
<resp xml:lang="en">general technical responsibilities in the project (Łódź)</resp>
</respStmt>
<respStmt>
<name xml:id="jbilinska">Joanna Bilińska</name>
<resp xml:lang="en">proofreading samples</resp>
</respStmt>
<respStmt>
<name xml:id="sebastianz">Sebastian Żurowski</name>
<resp xml:lang="en">proofreading samples</resp>
</respStmt>
</titleStmt>
<editionStmt>
<edition>initial version, for the University of Barcelona</edition>
</editionStmt>
<publicationStmt>
<pubPlace>Warsaw, Poland</pubPlace>
<address>
<addrLine xml:lang="pl">Instytut Podstaw Informatyki PAN</addrLine>
<addrLine xml:lang="pl">ul. Ordona 21</addrLine>
<addrLine xml:lang="pl">01-237 Warszawa</addrLine>
<addrLine>Poland</addrLine>
<addrLine>tel. (+48 22) 8362841, fax (+48 22) 8376564</addrLine>
<addrLine><email n="coordinator">adamp@ipipan.waw.pl</email></addrLine>
<addrLine><ref target="http://nkjp.pl/" n="www">http://nkjp.pl/</ref></addrLine>
</address>
<publisher>Institute of Computer Science, Polish Academy of Sciences</publisher>
<distributor>NKJP Consortium</distributor>
<availability>
<p>Once officially published, this 1 million subcorpus of the National Corpus of Polish will be available publicly for free.</p>
<p>This version is made available to the University of Barcelona, Department of General Linguistics, Barcelona, Gran Via de les Corts Catalanes 585, on the basis of the licence agreement, which should be consulted for details.</p>
</availability>
<date when="2009-06-05">First batch of texts (130K words) ready.</date>
</publicationStmt>
<sourceDesc>
<p>The origin of texts in NKJP may be:
<list type="bulleted">
<item>the IPI PAN Corpus</item>
<item>the PELCRA Corpus</item>
<item>the PWN Corpus</item>
<item>texts collected by IJP PAN, PELCRA and PWN specifically for NKJP.</item>
</list>
</p>
<p>See sourceDesc/bibl/note[@text_origin] in particular header.xml files.</p>
</sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
<p>A linguistic corpus is a collection of texts where one can find the typical use of a single word or a phrase, as well as their meaning and grammatical function. Nowadays, without access to a language corpus, it has become impossible to do linguistic research, to write dictionaries, grammars and language teaching books, to create search engines sensitive to Polish inflexion, machine translation engines and software of advanced language technology. Language corpora have become an essential tool for linguists, but they are also helpful for software engineers, scholars of literature and culture, historians, librarians and other specialists of art and computer sciences.</p>
<p>There already exist national corpora compiled by the <ref target="http://www.natcorp.ox.ac.uk">British</ref>, <ref target="http://www.ids-mannheim.de/kl/projekte/korpora/">Germans</ref>, <ref target="http://ucnk.ff.cuni.cz/english/index.html">Czech</ref> and <ref target="http://www.ruscorpora.ru/en/index.html">Russians</ref>. Polish people also need an extensive, well balanced language corpus – a language source which can be accessed online.</p>
<p>The National Corpus of Polish is a shared initiative of four institutions: <ref target="http://www.ipipan.waw.pl/">Institute of Computer Science</ref> at the Polish Academy of Sciences (coordinator), <ref target="http://www.ijp-pan.krakow.pl">Institute of Polish Language</ref> at the Polish Academy of Sciences, <ref target="http://www.pwn.pl">Polish Scientific Publishers PWN</ref>, and the Department of Computational
and Corpus Linguistics at the <ref target="http://www.uni.lodz.pl/">University of Łódź</ref>. It has been registered as a research-development project of <ref target="http://www.nauka.gov.pl">the Ministry of Science and Higher Education</ref>.</p>
<p>These four institutions have started cooperation to build a reference corpus of Polish language containing hundreds millions of words. The corpus that will appear soon on this site will be searchable by means of advanced tools that analyse Polish inflection and the Polish sentence structure.</p>
<p>The list of sources for the corpora contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. For a corpus to be reliable, not only it is necessary to contain a high number of words, but it also needs a diversity of texts with respect to the subject and genre. The conversations ought to represent both male and female speakers, in various age groups, coming from various regions in Poland.</p>
</projectDesc>
<samplingDecl>
<p>For each text type, words in all texts of this type are counted and percent of each text to be sampled is determined (so that the target subcorpus has assumed proportions). Newspaper articles etc. are grouped in aggregates (sample of an article would be too short to be sensible).</p>
<p>For each text (or aggregate), each paragraph's length in words is counted and sensible paragraphs are marked. For paragraphs shorter than 25 words the sensibility condition is "begins like a sentence and ends like a sentence". For longer paragraphs, it is just "begins like a sentence". The exact regular expressions to be matched are</p>
<p>/^\s*[\(\"]?(-\s|\s)?([0-9A-Z]|Ą|Ć|Ę|Ł|Ó|Ń|Ś|Ź|Ż).*[\.!?\x{2026}\x{2025}:]+[\"\)]*\s*$</p>
<p>and</p>
<p>/^\s*[\(\"]?(-\s|\s)?([0-9A-Z]|Ą|Ć|Ę|Ł|Ó|Ń|Ś|Ź|Ż).*$.</p>
<p>Then comes the loop:</p>
<p>* Randomly select a paragraph</p>
<p>* Look at all sensible sequences of paragraphs, 40 to 70 words in length, containing this paragraph. Sensible sequence contains only sensible, unused paragraphs, has average paragraph length greater than 5, and all paragraphs come from the same source (important in aggregates: we don't step on source file boundaries).</p>
<p>* Choose the sequence with the length closest to 55 words and add it to the sample.</p>
<p>* Mark all paragraph in the chosen sequence as used.</p>
<p>...until we have enough words in all the samples from the given file.</p>
<p>The algorithm is deliberately simple and does not attempt to anticipate and deal with all possible problems. Inevitably, in many real-life cases it requires a manual intervention anyway. Some manually altered samples may be shorter than 40 or longer than 70 words, but in only a few cases fall out of the 25-100 words interval.</p>
</samplingDecl>
<editorialDecl>
<p>For privacy reasons, some of the people's names mentioned in the transcribed conversations have been modified. Uppercase characters are only used in popular proper nouns.</p>
</editorialDecl>
<tagsDecl>
<namespace name="http://www.tei-c.org/ns/1.0">
<tagUsage gi="residence">Used to denote the speaker's longest place of residence. (Used only for spoken data.)</tagUsage>
</namespace>
<namespace name="http://www.nkjp.pl/ns/1.0">
<tagUsage gi="topic">The topic of a conversation (i.e., used only for spoken data).</tagUsage>
</namespace>
</tagsDecl>
<!-- <refsDecl> -->
<!-- AP: Here, conventions for referencing (esp., for @n and @xml:id) -->
<!-- are explained, e.g., xml:id="IPIPAN_093982391", etc. -->
<!-- AP: In case of the 1M corpus, this should contain info about the -->
<!-- connection between the samples and the paths to full files, etc. -->
<!-- </refsDecl> -->
<classDecl>
<taxonomy xml:id="taxonomy-NKJP-type">
<category xml:id="typ_lit"> <!-- target: 16% of the corpus -->
<desc xml:lang="pl">literatura piękna</desc>
<desc xml:lang="en">fiction</desc>
</category>
<category xml:id="typ_fakt"> <!-- target: 5,5% of the corpus -->
<desc xml:lang="pl">literatura faktu</desc>
<desc xml:lang="en">non-fiction novel</desc>
</category>
<category xml:id="typ_publ"> <!-- target: 50% of the corpus; see the comment below -->
<desc xml:lang="pl">publicystyka i wiadomości prasowe</desc>
<desc xml:lang="en">journalism</desc>
</category>
<category xml:id="typ_nd"> <!-- target: 2% of the corpus -->
<desc xml:lang="pl">naukowo-dydaktyczny</desc>
<desc xml:lang="en">academic writing</desc>
</category>
<category xml:id="typ_inf-por"> <!-- target: 5,5% of the corpus -->
<desc xml:lang="pl">informacyjno-poradnikowy</desc>
<desc xml:lang="en">informative and instructive writing</desc>
</category>
<category xml:id="typ_nklas"> <!-- target: no more than 1% of the corpus -->
<desc xml:lang="pl">książka niebeletrystyczna niesklasyfikowana</desc>
<desc xml:lang="en">unclassified non-fiction book</desc>
</category>
<category xml:id="typ_inne_pisane"> <!-- target: 3% of the corpus-->
<desc xml:lang="pl">inne teksty pisane</desc>
<desc xml:lang="en">miscellaneous (written)</desc>
<category xml:id="typ_urzed"> <!-- target: this category is included in "typ_inne_pisane" -->
<desc xml:lang="pl">urzędowo-kancelaryjny</desc>
<desc xml:lang="en">legal and official</desc>
</category>
</category>
<category xml:id="typ_internet"> <!-- target: 7% of the corpus -->
<desc xml:lang="pl">Internet</desc>
<desc xml:lang="en">Internet</desc>
</category>
<!-- the target of the following categories is 10% of the corpus in total -->
<category xml:id="typ_konwers"> <!-- target: 1% of the corpus -->
<desc xml:lang="pl">konwersacyjne</desc>
<desc xml:lang="en">conversational</desc>
</category>
<category xml:id="typ_media">
<desc xml:lang="pl">mówione medialne</desc>
<desc xml:lang="en">spoken from the media</desc>
</category>
<category xml:id="typ_qmow">
<desc xml:lang="pl">quasi-mówione</desc>
<desc xml:lang="en">quasi-spoken</desc>
</category>
</taxonomy>
</classDecl>
<!-- RLG: The genre "publicystyka i teksty prasowe" (Eng.:
"journalism") consists of 51% texts taken from dailies, 47% texts
taken from magazines and 2% texts taken from journalistic books.
A journalistic text in a daily is defined as a text which is
labelled in the header by a combination of 1) <catRef
scheme="#taxonomy-NKJP-type" target="#typ_prasa"> and 2) <catRef
scheme="#taxonomy-NKJP-channel" target="#kanal_prasa_dziennik">.
A journalistic text in a magazine is defined in the header by a
combination of 1) <catRef scheme="#taxonomy-NKJP-type"
target="#typ_prasa"> and 2) <catRef scheme="#taxonomy-NKJP-channel"
target="#kanal_prasa"> or one of its subcategories, with the
exception of <catRef scheme="#taxonomy-NKJP-channel"
target="#kanal_dziennik">.
A journalistic book is defined by a combination of 1) <catRef
scheme="#taxonomy-NKJP-type" target="#typ_prasa"> and 2) <catRef
scheme="#taxonomy-NKJP-channel" target="#kanal_ksiazka">.
In total texts taken from dailies, magazines and books make up
25,5%, 23,5% and 1% of the entire corpus, respectively. This
selection of channels of texts is done in order to assure
representativeness of the corpus. -->
<classDecl>
<taxonomy xml:id="taxonomy-NKJP-channel">
<category xml:id="kanal_prasa">
<desc xml:lang="pl">prasa</desc>
<desc xml:lang="en">press</desc>
<category xml:id="kanal_prasa_dziennik">
<desc xml:lang="pl">dziennik</desc>
<desc xml:lang="en">daily</desc>
</category>
<category xml:id="kanal_prasa_tygodnik">
<desc xml:lang="pl">tygodnik</desc>
<desc xml:lang="en">weekly</desc>
</category>
<category xml:id="kanal_prasa_miesiecznik">
<desc xml:lang="pl">miesiecznik</desc>
<desc xml:lang="en">monthly</desc>
</category>
<category xml:id="kanal_prasa_inne"> <!-- e.g., bi-weekly or occasional -->
<desc xml:lang="pl">inne prasowe</desc>
<desc xml:lang="en">other press</desc>
</category>
</category>
<category xml:id="kanal_ksiazka">
<desc xml:lang="pl">książka</desc>
<desc xml:lang="en">book</desc>
</category>
<category xml:id="kanal_internet">
<desc xml:lang="pl">Internet</desc>
<desc xml:lang="en">internet</desc>
</category>
<category xml:id="kanal_mowiony">
<desc xml:lang="pl">mówiony</desc>
<desc xml:lang="en">spoken</desc>
</category>
<category xml:id="kanal_ulotka">
<desc xml:lang="pl">ulotki, ogłoszenia, reklamy</desc>
<desc xml:lang="en">leaflets, announcemnets, ads</desc>
</category>
</taxonomy>
</classDecl>
<classDecl>
<taxonomy xml:id="ukd">
<bibl>
<title xml:lang="pl">Uniwersalna Klasyfikacja Dziesiętna</title>
<title xml:lang="en">Universal Decimal Classification</title>
<edition>UDC-P058</edition>
</bibl>
</taxonomy>
</classDecl>
<classDecl>
<taxonomy xml:id="bn">
<bibl>
<title xml:lang="pl">Klasyfikacja Biblioteki Narodowej</title>
<title xml:lang="en">Polish National Library Classification</title>
<edition>Słownik języka haseł przedmiotowych Biblioteki Narodowej. Wyd. 5 popr. i rozsz., stan na dzień 31 grudnia 2004 roku.</edition>
</bibl>
</taxonomy>
</classDecl>
<nkjp:fsLib>
<fLib n="tools">
<f xml:id="an8003" name="tool">
<string>Anotatornia NKJP on port 8003</string>
</f>
<f xml:id="an8004" name="tool">
<string>Anotatornia NKJP on port 8004</string>
</f>
</fLib>
</nkjp:fsLib>
</encodingDesc>
<revisionDesc>
<change who="#adamp" when="2009-06-05" xml:lang="en">First version of the header created.</change>
<change who="#adamp" when="2009-08-23">Changed <gi>catDesc</gi> to <gi>desc</gi>, as only one <gi>catDesc</gi> is allowed within <gi>category</gi>, but multiple <gi>desc</gi> elements may occur there. Without this change this apparently wouldn't be a TEI document (not even TEI Extension). This looks like a bug in TEI Guidelines, so it has been <ref target="https://sourceforge.net/tracker/index.php?func=detail&aid=2843046&group_id=106328&atid=644062">reported</ref> in the TEI Bug Tracker.</change>
<change who="#adamp" when="2010-06-09" xml:lang="en">Added <gi>nkjp:fsLib</gi> at the end of <gi>encodingDesc</gi> (actually, copied from the general NKJP corpus header). For this to work, had to declare the nkjp namespace in <gi>teiHeader</gi>.</change>
</revisionDesc>
</teiHeader>