INSTALL.md 4.35 KB

Edit Raw Blame History Permalink



COLLECTOR


Prerequisites

This project requires a Conda installation. For instructions on how to install miniconda see: https://conda.io/miniconda.html.


Creating a new environment

To install the project for first time, create new conda environment:

On Linux/MacOS machines, create a new collector environment with:
conda env create -n collector -f environment.yml
source activate collector
pip install -e .

On Windows machines the activation command is slightly different:
conda env create -n collector -f environment.yml
activate collector 
pip install -e .


Updating environment requirements

To update changes in the requirements (as defined in the environment.yml file):

Linux/MacOS:
conda env update -n collector -f environment.yml
source activate collector
pip install -e .

Windows:
conda env update -n collector -f environment.yml
activate collector
pip install -e .


Collector

Install Java JRE:


sudo add-apt-repository universe
sudo apt install openjdk-8-jre


To install the Collector first you have to create a collector database:


sudo -u postgres createdb collector -E UTF8 -T template0 -l pl_PL.utf8 [-p 5432]


Create a collector user:


sudo -u postgres createuser collector [-p 5432]


Access a PostgreSQL interactive terminal:


sudo -u postgres psql [-p 5432]


Create a password for the collector user:


postgres=# alter user collector with encrypted password '';


Grant the collector user rights to the collector database:


postgres=# grant all privileges on database collector to collector;


Grant the collector user rights for creating new databases (for testing purposes):


postgres=# alter user collector createdb;


Update 'PASSWORD' and 'PORT' (if needed) keys in the DATABASES 'default' item in the collector/settings.py file.

Create database tables:


python ./collector/manage.py makemigrations
python ./collector/manage.py migrate


And superuser (if needed):


python ./collector/manage.py createsuperuser


Configure pipelines:


python ./collector/manage.py configure_[PROJECT_NAME]_pipelines


Download documents:


python ./collector/manage.py download_[PROJECT_NAME]_documents


Extract text from documents:


python ./collector/manage.py extract_[PROJECT_NAME]_documents


Write documents in the defined output format:


python ./collector/manage.py write_[PROJECT_NAME]_documents


where the PROJECT_NAME is 'marcell', 'ppc' or 'nkjp'.

To use the annotation module install also:


latest Morfeusz2 version (http://sgjp.pl/morfeusz/download/)
Concraft2 (https://github.com/kawu/concraft-pl)
Liner2 (https://github.com/CLARIN-PL/Liner2) with the PolEval 2018 model (https://clarin-pl.eu/dspace/bitstream/handle/11321/598/liner26_model_ner_nkjp.zip)
COMBO (https://github.com/360er0/COMBO) with the latest desirable model (http://zil.ipipan.waw.pl/PDB/PDBparser)


Than set proper values for programs, used models and dictionaries (MORFEUSZ2_DICT_PATH, MORFEUSZ2_DICT_NAME, CONCRAFT2_PATH, CONCRAFT2_MODEL_PATH, LINER2_PATH, LINER2_MODEL_PATH, COMBO_PATH, COMBO_MODEL_PATH) in the collector/settings.py file.

You can also define a port assigned to the Concraft2 server (CONCRAFT2_PORT), number of used cores (CONCRAFT2_CORES) and an additional dictionary supporting disambiguation (FREQ_1M_PATH).

To use XIX century annotation pipeline set values for XIX_MORFEUSZ2_DICT_PATH and XIX_MORFEUSZ2_DICT_NAME (in the collector/settings.py file), default values pointing to the "parlamentareusz" dictionary will be fine.

To index documents install MTAS (https://github.com/mwasiluk/mtas), Solr (https://lucene.apache.org/solr/) and provide SOLR_URL and SOLR_TIMEOUT values in the collector/settings.py file. Name of SOLR core for particular project should be same as project name in the database.

If you want to use the web interface, set STATIC_ROOT in settings.py and:


python manage.py collectstatic


To install Language-agnostic BERT Sentence Embedding model for Keywords2EuroVoc mappings. Create labse directory in tools directory:
mkdir ./collector/tools/labse

Unpack and copy there labse_bert_model from Google Drive:
https://drive.google.com/file/d/14Zaq8RE9NMyJb_9B-lkgFZQ9H1K-U-Nf


Remember to re-activate

When running the project, remember to activate the collector environment:

On Linux/MacOS:
source activate collector

Windows:
activate collector


Verify and test

TBA