manual.tex 5.33 KB
\documentclass[runningheads,a4paper]{llncs}

\setcounter{tocdepth}{3}
\usepackage[OT4]{fontenc}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
%\usepackage[polish]{babel}

\usepackage{url}

\newcommand{\comment}[2]{\noindent{\textbf{\sffamily(\marginpar{\sffamily\footnotesize #1}#2)}}}
\newcommand{\kg}[1]{\comment{KG}{#1}}


\setlength{\parindent}{0pt}
\setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}

\begin{document}

\mainmatter

\title{Scoreference Manual}
\subtitle{\today}

\author{Mateusz Kopeć}

\institute{Institute of Computer Science, Polish Academy of Sciences \\ \url{m.kopec@ipipan.waw.pl}}

\maketitle


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{About}

The current version of the program facilitates the automatic evaluation of mention detection and coreference resolution, given an automatically and manually annotated corpus. Mention detection is evaluated by precision and recall. Coreference resolution is scored by 5 metrics: MUC, B$^3$, CEAFE, CEAFM, BLANC. Details about the scores calculation are in section \ref{details}.

\textbf{Homepage:} \url{http://zil.ipipan.waw.pl/MentionDetector} \\
\textbf{Contact person:} Mateusz Kopeć [mateusz.kopec@ipipan.waw.pl] \\
\textbf{Author:} Mateusz Kopeć \\
\textbf{License:} CC BY v.3

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Requirements}
Java Runtime Environment (JRE) 1.7 or newer.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Input data format}

Gold and system texts must be stored in the same format, either TEI or MMAX. Details about TEI or MMAX format used may be found in the Polish Coreference Corpus description\footnote{Available at \url{http://zil.ipipan.waw.pl/PolishCoreferenceCorpus}}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Output data format}

Results are printed to standard output.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Usage}

Standalone jar doesn't need any installation. To run it, simply execute:\\

\texttt{java -jar scoreference-1.0-SNAPSHOT.one-jar.jar <dir with gold texts> <dir for system texts> <type>}\\

\texttt{<dir with system texts>} is the directory with the corpus annotated automatically with coreference, \texttt{<dir with gold texts>} is the gold standard version of the same data. \texttt{<type>} should be either ``mmax'' or ``tei'', which indicates in what format the corpora are stored. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Score details}\label{details}

Scoring performed by \emph{Scoreference} is mostly in line with the SemEval \cite{Marquez2012} approach: we have chosen to evaluate separately mention detection and coreference resolution, but also allow for end-to-end system evaluation.

\subsection{Mention detection measures}
We evaluate mention detection using precision, recall and F-measure.
Different than during SemEval, we have decided not to reward partial matches, but to present instead two alternative mention detection scores:
\begin{itemize}
	\item score of exact boundary matches (there is a match when automatic and manual mention have exactly the same boundaries) (EXACT),
	\item score of head matches (we reduce system and manual mentions to their single head tokens and compare them) (HEAD).
\end{itemize}

\subsection{Coreference resolution measures}
As there is still no consensus about the single best coreference resolution measure, our evaluation tool provides results for 5 widely known measures: MUC\cite{muc}, $B^3$\cite{b3}, mention- and entity-based CEAF \cite{ceaf}(called CEAFM  and CEAFE, respectively) and BLANC \cite{blanc}.

As these measures assume that system and gold mentions are the same, we implemented two alternatives to make that happen for systems not using gold mentions:
\begin{itemize}
	\item consider only correct system mentions (i.e. the intersection between gold and system mentions) (INTERSECT),
	\item transform system and gold mentions as in \cite{Marquez2012}, following a procedure described below (TRANSFORM). 
\end{itemize}
TRANSFORM procedure of dealing with so-called "twinless" mentions (not in the intersection of system and gold mention sets) was presented in \cite{Marquez2012} and uses the following steps:
\begin{enumerate}
	\item insert twinless true mentions into the response partition as singletons, 
	\item remove twinless system mentions that are resolved as singletons,
	\item insert twinless system mentions that are resolved as coreferent into the key partition (as singletons).
\end{enumerate}
This approach was also used in CoNLL-2011 shared task \cite{Pradhan2011}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{references}

\end{document}