manual.tex
7.55 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
\documentclass[runningheads,a4paper]{llncs}
\setcounter{tocdepth}{3}
\usepackage[OT4]{fontenc}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
%\usepackage[polish]{babel}
\usepackage{url}
\newcommand{\comment}[2]{\noindent{\textbf{\sffamily(\marginpar{\sffamily\footnotesize #1}#2)}}}
\newcommand{\kg}[1]{\comment{KG}{#1}}
\setlength{\parindent}{0pt}
\setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}
\begin{document}
\mainmatter
\title{Bartek Manual}
\subtitle{\today}
\author{Mateusz Kopeć}
\institute{Institute of Computer Science, Polish Academy of Sciences \\ \url{m.kopec@ipipan.waw.pl}}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{About}
The current version of the program facilitates the automatic clustering of mentions into coreferent clusters using a machine learnt method. Bartek is based on works described in \cite{kop:ogro:12:lrec} and \cite{nito:13:ltc}. Bartek comes with default models trained on the full Polish Coreference Corpus\footnote{\url{http://zil.ipipan.waw.pl/PolishCoreferenceCorpus}}. It also contains compiled resources extracted from Polish Wikipedia\footnote{\url{http://pl.wikipedia.org}} and plWordnet \footnote{\url{http://plwordnet.pwr.wroc.pl/wordnet/}}.
\textbf{Homepage:} \url{http://zil.ipipan.waw.pl/Bartek} \\
\textbf{Contact person:} Mateusz Kopeć [mateusz.kopec@ipipan.waw.pl] \\
\textbf{Author:} Mateusz Kopeć \\
\textbf{License:} CC BY v.3
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Requirements}
Java Runtime Environment (JRE) 1.8 or newer.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Input data format}
Input texts must be in TEI format used in the National Corpus of Polish (TEI NKJP, see \cite{ban:prz:10} or \cite{prz:etal:11:ed} for reference). That means they must contain at least the following layers:
\begin{itemize}
\item \texttt{text\_structure.xml} -- containing the text structure,
\item \texttt{ann\_segmentation.xml} -- with segmentation,
\item \texttt{ann\_morphosyntax.xml} -- with morphosyntactic information,
\item \texttt{ann\_mentions.xml} -- with mentions to cluster (this layer is not in National Corpus of Polish, see it's description below).
\end{itemize}
Additional layers may or may not be present:
\begin{itemize}
\item \texttt{ann\_groups.xml} -- with syntactic groups,
\item \texttt{ann\_words.xml} -- with syntactic words,
\item \texttt{ann\_named.xml} -- with named entites.
\end{itemize}
All files can be gzipped if necessary.
\subsection{Format of ann\_mentions.xml}
This file contains mentions (represented by \texttt{<seg>} tags), which are simple a set of pointers to morphosyntax layer segments. Structure of the text is also kept, mentions are grouped into sentences and paragraphs, corresponding to ones in morphosyntax.
In the example figure \ref{mentions}, each mention is preceded with a comment with its orthographical form, however it's not obligatory. All \texttt{<ptr>} elements target tokens, which form the mention. Feature \texttt{<f>} with name \texttt{semh} shows, which token of the mention is it's semantic head.
Zero subjects are distinguished from other mentions by having an additional feature \texttt{<f name="zero" fVal="true" />}.
\begin{figure}[h]
\centering
\begin{verbatim}
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<TEI>
<text>
<body>
<p xml:id="mentions_p-1" corresp="morph_1-p">
<s xml:id="mentions_p-1.1-s" corresp="morph_1.1-s">
<!-- Europejskiego Króla Kurkowego -->
<seg xml:id="mention_6">
<fs type="mention">
<f name="semh" fVal="ann_morphosyntax.xml#morph_1.1.24-seg"/>
</fs>
<ptr target="ann_morphosyntax.xml#morph_1.1.23-seg"/>
<ptr target="ann_morphosyntax.xml#morph_1.1.24-seg"/>
<ptr target="ann_morphosyntax.xml#morph_1.1.25-seg"/>
</seg>
...
</s>
<s xml:id="mentions_p-1.2-s" corresp="morph_1.2-s">
<!-- był -->
<seg xml:id="mention_11">
<fs type="mention">
<f name="semh" fVal="ann_morphosyntax.xml#morph_1.1.4-seg"/>
<f name="zero" fVal="true" />
</fs>
<ptr target="ann_morphosyntax.xml#morph_1.1.4-seg"/>
</seg>
...
</s>
</p>
<p xml:id="mentions_p-2" corresp="morph_2-p">
...
</p>
...
</body>
</text>
</TEI>
</teiCorpus>
\end{verbatim}
\caption{Example \texttt{ann\_mentions.xml} file}
\label{mentions}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Output data format}\label{output}
\textbf{Bartek} builds on TEI NKJP format, adding a new layer:
\begin{itemize}
\item \texttt{ann\_coreference.xml}
\end{itemize}
This layer stores the information about groups of mentions. Each group is supposed to contain only mentions referring to the same entity, i.e. they should be coreferent.
\subsection{Format of ann\_coreference.xml}
This file stores information about coreference clusters. Each cluster is represented by \texttt{<seg>} tag and contains pointers to it's elements -- mentions, referring to \texttt{ann\_mentions.xml} file. The comment with orthographical forms of cluster elements before each \texttt{<seg>} tag is not obligatory. Value \texttt{ident} in \texttt{type} of coreference means identity (currently it's the only type \textbf{Ruler} produces). Value of \texttt{dominant} feature is the orthographical form of mention decided to be a best representative of a cluster.
This file doesn't contain paragraphs and sentences, because clusters can span across them. The only \texttt{<p>} tag is artificial, to fit the requirements of the TEI format. Example file is presented in figure \ref{coref}.
\begin{figure}[h]
\centering
\begin{verbatim}
<?xml version="1.0" ?>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<TEI>
<text>
<body>
<p>
<!-- udział; udział; udziale -->
<seg xml:id="coreference_0">
<fs type="coreference">
<f name="type" fVal="ident"/>
<f name="dominant" fVal="udział"/>
</fs>
<ptr target="mention_1"/>
<ptr target="mention_8"/>
<ptr target="mention_21"/>
</seg>
...
<seg ...
</seg>
</p>
</body>
</text>
</TEI>
</teiCorpus>
\end{verbatim}
\caption{Example \texttt{ann\_coreference.xml} file}
\label{coref}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Usage}
Standalone jar doesn't need any installation. To run it, simply execute:\\
\texttt{java -jar bartek-1.3-jar-with-dependencies.jar <dir with input texts> <dir for output texts>}\\
All texts recursively found in \texttt{<dir with input texts>} are going to be annotated with coreference layer and saved in \texttt{<dir for output texts>}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{references}
\end{document}