manual.tex 7.55 KB

Edit Raw Blame History Permalink

\documentclass[runningheads,a4paper]{llncs}

\setcounter{tocdepth}{3}
\usepackage[OT4]{fontenc}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
%\usepackage[polish]{babel}

\usepackage{url}

\newcommand{\comment}[2]{\noindent{\textbf{\sffamily(\marginpar{\sffamily\footnotesize #1}#2)}}}
\newcommand{\kg}[1]{\comment{KG}{#1}}


\setlength{\parindent}{0pt}
\setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}

\begin{document}

\mainmatter

\title{Bartek Manual}
\subtitle{\today}

\author{Mateusz Kopeć}

\institute{Institute of Computer Science, Polish Academy of Sciences \\ \url{m.kopec@ipipan.waw.pl}}

\maketitle


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{About}

The current version of the program facilitates the automatic clustering of mentions into coreferent clusters using a machine learnt method. Bartek is based on works described in \cite{kop:ogro:12:lrec} and \cite{nito:13:ltc}. Bartek comes with default models trained on the full Polish Coreference Corpus\footnote{\url{http://zil.ipipan.waw.pl/PolishCoreferenceCorpus}}. It also contains compiled resources extracted from Polish Wikipedia\footnote{\url{http://pl.wikipedia.org}} and plWordnet \footnote{\url{http://plwordnet.pwr.wroc.pl/wordnet/}}.

\textbf{Homepage:} \url{http://zil.ipipan.waw.pl/Bartek} \\
\textbf{Contact person:} Mateusz Kopeć [mateusz.kopec@ipipan.waw.pl] \\
\textbf{Author:} Mateusz Kopeć \\
\textbf{License:} CC BY v.3

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Requirements}
Java Runtime Environment (JRE) 1.8 or newer.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Input data format}

Input texts must be in TEI format used in the National Corpus of Polish (TEI NKJP, see \cite{ban:prz:10} or \cite{prz:etal:11:ed} for reference). That means they must contain at least the following layers:
\begin{itemize}
    \item \texttt{text\_structure.xml} -- containing the text structure,
    \item \texttt{ann\_segmentation.xml} -- with segmentation,
    \item \texttt{ann\_morphosyntax.xml} -- with morphosyntactic information,
    \item \texttt{ann\_mentions.xml} -- with mentions to cluster (this layer is not in National Corpus of Polish, see it's description below).
\end{itemize}
Additional layers may or may not be present:
\begin{itemize}
	\item \texttt{ann\_groups.xml} -- with syntactic groups,
    \item \texttt{ann\_words.xml} -- with syntactic words,
    \item \texttt{ann\_named.xml} -- with named entites.
\end{itemize}

All files can be gzipped if necessary.

\subsection{Format of ann\_mentions.xml}
This file contains mentions (represented by \texttt{<seg>} tags), which are simple a set of pointers to morphosyntax layer segments. Structure of the text is also kept, mentions are grouped into sentences and paragraphs, corresponding to ones in morphosyntax.

In the example figure \ref{mentions}, each mention is preceded with a comment with its orthographical form, however it's not obligatory. All \texttt{<ptr>} elements target tokens, which form the mention. Feature \texttt{<f>} with name \texttt{semh} shows, which token of the mention is it's semantic head.

Zero subjects are distinguished from other mentions by having an additional feature \texttt{<f name="zero" fVal="true" />}.

\begin{figure}[h]
\centering
\begin{verbatim}
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<TEI>
  <text>
    <body>
      <p xml:id="mentions_p-1" corresp="morph_1-p">
        <s xml:id="mentions_p-1.1-s" corresp="morph_1.1-s">
          <!-- Europejskiego Króla Kurkowego  -->
          <seg xml:id="mention_6">
            <fs type="mention">
              <f name="semh" fVal="ann_morphosyntax.xml#morph_1.1.24-seg"/>
            </fs>
            <ptr target="ann_morphosyntax.xml#morph_1.1.23-seg"/>
            <ptr target="ann_morphosyntax.xml#morph_1.1.24-seg"/>
            <ptr target="ann_morphosyntax.xml#morph_1.1.25-seg"/>
          </seg>
          ...
        </s>
        <s xml:id="mentions_p-1.2-s" corresp="morph_1.2-s">
          <!-- był -->
          <seg xml:id="mention_11">
            <fs type="mention">
                <f name="semh" fVal="ann_morphosyntax.xml#morph_1.1.4-seg"/>
                <f name="zero" fVal="true" />
            </fs>
            <ptr target="ann_morphosyntax.xml#morph_1.1.4-seg"/>
          </seg>
          ...
        </s>
      </p>
      <p xml:id="mentions_p-2" corresp="morph_2-p">
      ...
      </p>
      ...
    </body>
  </text>
</TEI>
</teiCorpus>
\end{verbatim}
\caption{Example \texttt{ann\_mentions.xml} file}
\label{mentions}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Output data format}\label{output}

\textbf{Bartek} builds on TEI NKJP format, adding a new layer:
\begin{itemize}
    \item \texttt{ann\_coreference.xml}
\end{itemize}
This layer stores the information about groups of mentions. Each group is supposed to contain only mentions referring to the same entity, i.e. they should be coreferent.

\subsection{Format of ann\_coreference.xml}
This file stores information about coreference clusters. Each cluster is represented by \texttt{<seg>} tag and contains pointers to it's elements -- mentions, referring to \texttt{ann\_mentions.xml} file. The comment with orthographical forms of cluster elements before each \texttt{<seg>} tag is not obligatory. Value \texttt{ident} in \texttt{type} of coreference means identity (currently it's the only type \textbf{Ruler} produces). Value of \texttt{dominant} feature is the orthographical form of mention decided to be a best representative of a cluster.

This file doesn't contain paragraphs and sentences, because clusters can span across them. The only \texttt{<p>} tag is artificial, to fit the requirements of the TEI format. Example file is presented in figure \ref{coref}.

\begin{figure}[h]
\centering
\begin{verbatim}
<?xml version="1.0" ?>
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
  <TEI>
    <text>
      <body>
        <p>
          <!--  udział; udział; udziale -->
          <seg xml:id="coreference_0">
            <fs type="coreference">
              <f name="type" fVal="ident"/>
              <f name="dominant" fVal="udział"/>
            </fs>
            <ptr target="mention_1"/>
            <ptr target="mention_8"/>
            <ptr target="mention_21"/>
          </seg>
          ...
          <seg ...
          </seg>
        </p>
      </body>
    </text>
  </TEI>
</teiCorpus>
\end{verbatim}
\caption{Example \texttt{ann\_coreference.xml} file}
\label{coref}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Usage}

Standalone jar doesn't need any installation. To run it, simply execute:\\

\texttt{java -jar bartek-1.3-jar-with-dependencies.jar <dir with input texts> <dir for output texts>}\\

All texts recursively found in \texttt{<dir with input texts>} are going to be annotated with  coreference layer and saved in \texttt{<dir for output texts>}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{references}

\end{document}