README.md 18.3 KB

Liner2.6

Build Status Coverage Status License: LGPL v3

Copyright (C) Wrocław University of Science and Technology (PWr), 2010-2018. All rights reserved.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Contributors

  • Michał Marcińczuk (2010–present),
  • Jan Kocoń (2014–present),
  • Michał Gawor (2019),
  • Adam Kaczmarek (2014–2015),
  • Michał Krautforst (2013-2015),
  • Dominik Piasecki (2013),
  • Maciej Janicki (2011)

Citing

System architecture and KPWr NER models

Marcińczuk, Michał; Kocoń, Jan; Oleksy, Marcin. Liner2 — a Generic Framework for Named Entity Recognition In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 86–91, Valencia, Spain, 4 April 2017. Association for Computational Linguistics

[PDF]

[Bibtex]

``` @InProceedings{W17-1413, author = "Marci{\'{n}}czuk, Micha{\l} and Koco{\'{n}}, Jan and Oleksy, Marcin", title = "Liner2 --- a Generic Framework for Named Entity Recognition", booktitle = "Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing", year = "2017", publisher = "Association for Computational Linguistics", pages = "86--91", location = "Valencia, Spain", doi = "10.18653/v1/W17-1413", url = "http://aclweb.org/anthology/W17-1413" } ```

NKJP NER model

Marcińczuk, Michał; Kocoń, Jan; Gawor, Michał. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches Ogrodniczuk, Maciej; Kobyliński, Łukasz (Eds.): Proceedings of the PolEval 2018 Workshop, pp. 63-73, Institute of Computer Science, Polish Academy of Science, Warszawa, 2018.

[PDF]

[Bibtex]

``` @inproceedings{poldeepner2018, title = "Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches", author = "Marcińczuk, Michał and Kocoń, Jan and Gawor, Michał", year = "2018", editor = "Ogrodniczuk, Maciej and Kobyliński, Łukasz", booktitle = "Proceedings of the PolEval 2018 Workshop", location = "Warsaw, Poland", pages = "77--92", publisher = "Institute of Computer Science, Polish Academy of Science" } ```

Service in Docker

Requirements

  • Docker
  • Docker Compose
  • Python3 (for demo script)

Setup

Build the Docker:

docker-compose build

Run the service:

docker-compose up

Test the service:

python3 stuff/python/liner2rmq.py -t "Pani Ala Nowak mieszkw w Zielonej Górze"

Expected output:

[INFO] Temp route: route-ET7DWN
[INFO] Temp input file: /tmp/ez6s96sn
[INFO] Sent msg 'route-ET7DWN /tmp/ez6s96sn' to liner2-input
[INFO] Temp output file: b'/tmp/ez6s96sn-ner.xml'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1">
  <sentence id="s1">
   <tok>
    <orth>Pani</orth>
    <lex disamb="1"><base>pani</base><ctag>subst:sg:nom:f</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Ala</orth>
    <lex disamb="1"><base>Ala</base><ctag>subst:sg:nom:f</ctag></lex>
    <ann chan="persname" head="1">1</ann>
    <ann chan="persname_forename" head="1">1</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
    <prop key="persName:lemma">Ala Nowak</prop>
    <prop key="persname_forename:lemma">Ala</prop>
   </tok>
   <tok>
    <orth>Nowak</orth>
    <lex disamb="1"><base>Nowak</base><ctag>subst:sg:nom:m1</ctag></lex>
    <lex disamb="1"><base>nowak</base><ctag>subst:sg:nom:m1</ctag></lex>
    <ann chan="persname">1</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname" head="1">1</ann>
    <ann chan="placename_settlement">0</ann>
    <prop key="persname_surname:lemma">Nowak</prop>
   </tok>
   <tok>
    <orth>mieszkw</orth>
    <lex disamb="1"><base>mieszkw</base><ctag>subst:sg:nom:m1</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>w</orth>
    <lex disamb="1"><base>w</base><ctag>prep:loc:nwok</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Zielonej</orth>
    <lex disamb="1"><base>zielony</base><ctag>adj:sg:loc:f:pos</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement" head="1">1</ann>
    <prop key="placename_settlement:lemma">Zielonej G</prop>
   </tok>
   <tok>
    <orth>G</orth>
    <lex disamb="1"><base>G</base><ctag>brev:pun</ctag></lex>
    <lex disamb="1"><base>godzina</base><ctag>brev:pun</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">1</ann>
   </tok>
   <ns/>
   <tok>
    <orth>?</orth>
    <lex disamb="1"><base>?</base><ctag>interp</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <ns/>
   <tok>
    <orth>?</orth>
    <lex disamb="1"><base>?</base><ctag>interp</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <ns/>
   <tok>
    <orth>rze</orth>
    <lex disamb="1"><base>rze</base><ctag>subst:sg:nom:n</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
  </sentence>
 </chunk>
</chunkList>

Requirements

Compilation

  • Java 8
  • C++ compiler (gcc 3.0 or higher) for CRF++
  • set JAVA_HOME variable: bash export JAVA_HOME=/usr/lib/jvm/default-java
  • install dh-autoreconf: bash sudo apt-get install dh-autoreconf

Runtime

Optional libraries:

Installation

Compile

If you do not have CRF++ installed then do the following steps:

cd g419-external-dependencies
tar -xvf CRF++-0.57.tar.gz
cd CRF++-0.57
./configure
make
sudo make install
sudo ldconfig

Then:

./gradlew jar

Runtime test

./liner2-cli

Output:

*-----------------------------------------------------------------------------------------------*
* A framework for multitask sequence labeling, including: named entities, temporal expressions. *
*                                                                                               *
* Authors: Michał Marcińczuk (2010–2016), Jan Kocoń (2014–2016), Adam Kaczmarek (2014–2015)     *
*    Past: Michał Krautforst (2013-2015), Dominik Piasecki (2013), Maciej Janicki (2011)        *
* Contact: michal.marcinczuk@pwr.wroc.pl                                                        *
*                                                                                               *
*          G4.19 Research Group, Wrocław University of Technology                               *
*-----------------------------------------------------------------------------------------------*


Use one of the following tools:
 - agreement           -- checks agreement (of annotations) between suplied documents
 - agreement2          -- compare sets of annotations for each pair of corpora. One set is
                          treated as a reference set and the other as a set to evaluate. It is a
                          refactored version of the agreement action.
 - annotations         -- generates an arff file with a list of annotations and their features
 - constituents-eval   -- evaluates normalizer against a specific set of documents (-i
                          batch:FORMAT, -i FORMAT)
 - convert             -- converts documents from one format to another and applies defined
                          converters
 - curve               -- brak opisu
 - eval                -- evaluates chunkers against a specific set of documents (-i
                          batch:FORMAT, -i FORMAT) #or perform cross validation (-i cv:{format})
 - eval-unique         -- evaluates chunkers against a specific set of documents (-i
                          batch:FORMAT, -i FORMAT) #or perform cross validation (-i
                          cv:{format}). The evaluation is performed on the sets#with unique
                          annotations, i.e. annotations with the same orth/base are treated as a
                          single annotation
 - inplace             -- process documents in place
 - interactive         -- processes text entered directly into the terminal
 - lemmatize           -- ToDo
 - normalizer-eval3    -- processes data with given model
 - normalizer-validate -- Read all annotation and their metadata and look for errors.
 - pipe                -- processes data with given model
 - search              -- earches for a phrases matching given pattern based on a set of token
                          features
 - selection           -- todo
 - stats               -- prints corpus statistics
 - train               -- trains chunkers

usage: ./liner2-cli [action] [options]

Pre-trained models

KPWr NER for Polish

The package contains three models for recognition named entities according to KPWr NE guidelines.

  • nam — named entity boundaries,
  • top9 — coarse-grained categories,
  • n82 — fine-grained categories.

Resources:

Download the package:

cd Liner2
wget -O liner25_model_ner_rev1.7z https://clarin-pl.eu/dspace/bitstream/handle/11321/263/liner25_model_ner_rev1.7z 

Unpack the package:

7z x liner25_model_ner_rev1.7z

Process a sample CCL file:

./liner2-cli pipe -i ccl -o tuples -f stuff/resources/sample-sentence.xml -m liner25_model_ner_rev1/config-top9.ini

Expected output:

(4,11,nam_liv,"Ala Nowak")
(20,28,nam_loc,"Warszawie")

PolEval 2018 Task 2: Named Entity Recognition

Mirror: https://www.dropbox.com/s/wem3fp685zleuq6/liner26_model_ner_nkjp.zip?dl=0

DSpace page: https://clarin-pl.eu/dspace/handle/11321/598 (temporarily off-line)

Direct link to the package: https://clarin-pl.eu/dspace/bitstream/handle/11321/598/liner26_model_ner_nkjp.zip (temporarily off-line)

Liner2 participated in PolEval 2018 Task 2 on named entity recognition. It got a third place with the following scores:

Metric F1 score
Final 0.810
Exact 0.778
Overlap 0.818

Download the package with model:

cd Liner2
wget -O liner26_model_ner_nkjp.zip https://clarin-pl.eu/dspace/bitstream/handle/11321/598/liner26_model_ner_nkjp.zip 

Unpack the model:

unzip liner26_model_ner_nkjp.zip

Process a sample CCL file:

./liner2-cli pipe -i ccl -o tuples -f stuff/resources/sample-sentence.xml -m liner26_model_ner_nkjp/config-nkjp-poleval2018.ini

Expected output:

(4,6,null,persname_forename,"Ala","Ala")
(4,11,null,persName,"Ala Nowak","Ala Nowak")
(7,11,null,persname_surname,"Nowak","Nowak")
(20,28,null,placename_settlement,"Warszawie","Warszawie")

PolEval 2019 Task 1: Recognition and normalisation of temporal expressions

DSpace page: https://clarin-pl.eu/dspace/handle/11321/697

Download the package with model:

cd Liner2
wget -0 https://clarin-pl.eu/dspace/handle/11321/697/timex_model_full.tar.gz

Unpack the model:

tar xvzf timex_model_full.tar.gz

Process a sample CCL file:

./liner2-cli pipe -m timex_model_full/timex_model_full/cfg.ini -f timex_model_full/test2.xml -i ccl -o tuples

Expected output:

(0,24,null,t3_date,"Ostatnia niedziela września","Ostatnia niedziela września")

Service mode (using RabbitMQ)

Introduction

Liner2 can be run as a service which listen to a RabbitMQ queue for upcomming requests (liner2-input). and submit the results to another queue (liner2-output). The input message (send by the client) should have the following format:

ROUTE_KEY PATH

Where:

  • ROUTE_KEY — name of a route used to post the results to the output queue. The routing key is used by the client to receive the response for their request ignoring others,
  • PATH — an absolute path to the file to process.

For example:

client-001 /tmp/document.txt

The message send by the service will contain path to a file which contains the output of processing.

Running the service

./liner2-daemon rabbitmq -m liner26_model_ner_nkjp/config-nkjp-poleval2018.ini -i plain:wcrft

Expected output:

 INFO [Thread-1] (RabbitMqWorker.java:91) - Listing to RabbitMQ on channel liner2-input ...
Consumer amq.ctag-m6D9fIMI_Qsm61BH7HoxlA registered

It is possible to run more than one instance of ./liner2-daemon rabbitmq. However, all of them should use the same model and input format.

Testing

Folder stuff/python contains a Python script to test the communication with the service. The script takes a text to process, stores the texts in a temporal file, generates a routing key, send both to the liner2-input queue and listen to liner2-output. After receiving the response it reads the output file, removes both temporal files and prints the output.

python3 stuff/python/liner2rmq.py -t "Pani Ala Nowak mieszkw w Zielonej Górze"

The output should be as follows:

[INFO] Temp route: route-1DVRP4
[INFO] Temp input file: /tmp/amu7_3at
[INFO] Sent msg 'route-1DVRP4 /tmp/amu7_3at' to liner2-input
[INFO] Temp output file: b'/tmp/amu7_3at-ner.xml'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1">
  <sentence id="s1">
   <tok>
    <orth>Pani</orth>
    <lex disamb="1"><base>pani</base><ctag>subst:sg:nom:f</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Ala</orth>
    <lex disamb="1"><base>Ala</base><ctag>subst:sg:nom:f</ctag></lex>
    <ann chan="persname" head="1">1</ann>
    <ann chan="persname_forename" head="1">1</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Nowak</orth>
    <lex disamb="1"><base>Nowak</base><ctag>subst:sg:nom:m1</ctag></lex>
    <lex disamb="1"><base>nowak</base><ctag>subst:sg:nom:m1</ctag></lex>
    <ann chan="persname">1</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname" head="1">1</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>mieszka</orth>
    <lex disamb="1"><base>mieszkać</base><ctag>fin:sg:ter:imperf</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>w</orth>
    <lex disamb="1"><base>w</base><ctag>prep:loc:nwok</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">0</ann>
   </tok>
   <tok>
    <orth>Zielonej</orth>
    <lex disamb="1"><base>zielony</base><ctag>adj:sg:loc:f:pos</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement" head="1">1</ann>
   </tok>
   <tok>
    <orth>Górze</orth>
    <lex disamb="1"><base>góra</base><ctag>subst:sg:loc:f</ctag></lex>
    <ann chan="persname">0</ann>
    <ann chan="persname_forename">0</ann>
    <ann chan="persname_surname">0</ann>
    <ann chan="placename_settlement">1</ann>
   </tok>
  </sentence>
 </chunk>
</chunkList>

Logs on the server side:

 INFO [pool-1-thread-5] (RabbitMqWorker.java:99) - Received path: '/tmp/amu7_3at'
 INFO [pool-1-thread-5] (RabbitMqWorker.java:108) - Output saved to /tmp/amu7_3at
 INFO [pool-1-thread-5] (RabbitMqWorker.java:121) - Sent /tmp/amu7_3at-ner.xml to liner2-output:route-1DVRP4'
 INFO [pool-1-thread-5] (RabbitMqWorker.java:84) - Request processing done