NKJP | master | Wojciech Jaworski / ENIAM

Name	Last Update	Last Commit fb29c4c0 – bugfix w date-interval cd History
..
frequency	Loading commit data...
postProcessing	Loading commit data...
preProcessing	Loading commit data...
validation	Loading commit data...
NKJP.ml	Loading commit data...
NKJP.mli	Loading commit data...
NKJPxmlbasics.ml	Loading commit data...
README.txt	Loading commit data...
clean.sh	Loading commit data...
makefile	Loading commit data...
parse.sh	Loading commit data...
sentenceCompare.ml	Loading commit data...
test.ml	Loading commit data...
validate.sh	Loading commit data...

Name

Last Update

Last Commit

fb29c4c0 – bugfix w date-interval cd

History

frequency

postProcessing

preProcessing

validation

NKJP.ml

NKJP.mli

NKJPxmlbasics.ml

README.txt

clean.sh

makefile

parse.sh

sentenceCompare.ml

test.ml

validate.sh

README.txt

This program parses the 1-million word subcorpus of the National Corpus of Polish.
The parser requires the corpus to be placed in a directory named fullCorpus located in the parser's main directory.
It also requires Python 3 as well as the OCaml compiler to be installed.

To perform the parsing run
./parse.sh
WARNING: this may take a long time.

The preprocessing and postprocessing required by the parser can be performed separately using
cd preProcessing; ./preProcess.sh
and
cd postProcessing; ./postProcess.sh
respectively.

The results of the parsing can be validated using
./validate.sh

The results of the parsing can be cleaned using
./clean.sh