manual.tex 6.78 KB

Edit Raw Blame History Permalink

\documentclass[runningheads,a4paper]{llncs}

\setcounter{tocdepth}{3}
\usepackage[OT4]{fontenc}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
%\usepackage[polish]{babel}

\usepackage{url}

\newcommand{\comment}[2]{\noindent{\textbf{\sffamily(\marginpar{\sffamily\footnotesize #1}#2)}}}
\newcommand{\kg}[1]{\comment{KG}{#1}}


\setlength{\parindent}{0pt}
\setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}

\begin{document}

\mainmatter

\title{MentionDetector 1.2}
\subtitle{\today}

\author{Mateusz Kopeć}

\institute{Institute of Computer Science, Polish Academy of Sciences \\ \url{m.kopec@ipipan.waw.pl}}

\maketitle


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{About}

The current version of the program facilitates the automatic mention detection, including zero subject mentions.

MentionDetector uses information provided in it's input to produce mentions for coreference resolution. It merges entities provided by named entity recognition tools, shallow parsers and taggers.

It also finds zero subjects in clauses and marks the verbs using zero subjects as mentions, using the algorithm presented in \cite{kop:14:eacl:short}, for which a model was trained using the full Polish Coreference Corpus, version 0.92 (corpus description in \cite{ogro:etal:13:ltc}). Training data had 15875 positive and 37798 negative examples; 10-fold cross validation yielded an accuracy of 86.14\% for the task of finding zero subjects. Precision of 79.8\% and recall of 71.2\% for the zero subject class of verbs was obtained.

\textbf{Homepage:} \url{http://zil.ipipan.waw.pl/MentionDetector} \\
\textbf{Contact person:} Mateusz Kopeć [mateusz.kopec@ipipan.waw.pl] \\
\textbf{Author:} Mateusz Kopeć \\
\textbf{License:} CC BY v.3


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Requirements}
Java Runtime Environment (JRE) 1.8 or newer.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Input data format}

Input texts must be in TEI format used in the National Corpus of Polish (TEI NKJP, see \cite{ban:prz:10} or \cite{prz:etal:11:ed} for reference). That means they must contain at least the following layers:
\begin{itemize}
    \item \texttt{text\_structure.xml} -- containing the text structure,
    \item \texttt{ann\_segmentation.xml} -- with segmentation,
    \item \texttt{ann\_morphosyntax.xml} -- with morphosyntactic information,
\end{itemize}
Additional layers may or may not be present:
\begin{itemize}
	\item \texttt{ann\_groups.xml} -- with syntactic groups,
    \item \texttt{ann\_words.xml} -- with syntactic words,
    \item \texttt{ann\_named.xml} -- with named entites.
\end{itemize}
All files can be gzipped if necessary.

MentionDetector uses information from morphosyntactic, syntactic words, syntactic groups, and named entity annotations, therefore the more layers are present in the input, the more mentions will be found in text.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Output data format}\label{output}

\textbf{MentionDetector} builds on TEI NKJP format, adding a new layer:
\begin{itemize}
    \item \texttt{ann\_mentions.xml}
\end{itemize}
This layer stores the information about mentions. It's structure is described below.

\subsection{Format of ann\_mentions.xml}
This file contains mentions (represented by \texttt{<seg>} tags), which are simple a set of pointers to morphosyntax layer segments. Structure of the text is also kept, mentions are grouped into sentences and paragraphs, corresponding to ones in morphosyntax.

In the example figure \ref{mentions}, each mention is preceded with a comment with its orthographical form, however it's not obligatory. All \texttt{<ptr>} elements target tokens, which form the mention. Feature \texttt{<f>} with name \texttt{semh} shows, which token of the mention is it's semantic head.

Zero subjects are distinguished from other mentions by having an additional feature \texttt{<f name="zero" fVal="true" />}.

\begin{figure}[h]
\centering
\begin{verbatim}
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
<TEI>
  <text>
    <body>
      <p xml:id="mentions_p-1" corresp="morph_1-p">
        <s xml:id="mentions_p-1.1-s" corresp="morph_1.1-s">
          <!-- Europejskiego Króla Kurkowego  -->
          <seg xml:id="mention_6">
            <fs type="mention">
              <f name="semh" fVal="ann_morphosyntax.xml#morph_1.1.24-seg"/>
            </fs>
            <ptr target="ann_morphosyntax.xml#morph_1.1.23-seg"/>
            <ptr target="ann_morphosyntax.xml#morph_1.1.24-seg"/>
            <ptr target="ann_morphosyntax.xml#morph_1.1.25-seg"/>
          </seg>
          ...
        </s>
        <s xml:id="mentions_p-1.2-s" corresp="morph_1.2-s">
          <!-- był -->
          <seg xml:id="mention_11">
            <fs type="mention">
                <f name="semh" fVal="ann_morphosyntax.xml#morph_1.1.4-seg"/>
                <f name="zero" fVal="true" />
            </fs>
            <ptr target="ann_morphosyntax.xml#morph_1.1.4-seg"/>
          </seg>
          ...
        </s>
      </p>
      <p xml:id="mentions_p-2" corresp="morph_2-p">
      ...
      </p>
      ...
    </body>
  </text>
</TEI>
</teiCorpus>
\end{verbatim}
\caption{Example \texttt{ann\_mentions.xml} file}
\label{mentions}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Usage}

Standalone jar doesn't need any installation. To run it, simply execute:\\

\texttt{java -jar md-1.3-jar-with-dependencies.jar <dir with input texts> <dir for output texts>}\\

All texts recursively found in \texttt{<dir with input texts>} are going to be annotated with mentions layer and saved in \texttt{<dir for output texts>}.\\

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Custom zero subject detection model}
If you want to use custom zero subject detection model, you may try:\\

\texttt{java -jar md-1.3-jar-with-dependencies.jar <dir with input texts> <dir for output texts> <model\_path>}

To create such model, use the \texttt{pl.waw.ipipan.zil.core.md.detection.zero.Trainer} class.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{references}

\end{document}