Polish CDSCorpus is a data set that enables research on compositional distributional semantics. It was created at the Institute of Computer Science, Polish Academy of Sciences in Warsaw. The treebank is licensed under the terms of CC BY-NC-SA 4.0.


Polish CDSCorpus consists of 10K Polish sentence pairs which are human-annotated for semantic relatedness and entailment. The dataset may be used for the evaluation of compositional distributional semantics models of Polish. The data set was presented at ACL 2017. Please refer to the Wróblewska and Krasnowska-Kieraś (2017) for a detailed description of the resource.

Data split

CDSCorpus is divided into three parts: training, test and development data sets. These data sets were used in the experiments on evaluating sentence embeddings presented in Krasnowska-Kieraś and Wróblewska (2019).


The creation of the resource was supported by SONATA 8 grant no 2014/15/D/HS2/03486 from the National Science Centre Poland.


  • Alina Wróblewska and Katarzyna Krasnowska-Kieraś. Polish evaluation dataset for compositional distributional semantics models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 784-792, Vancouver, Canada, 2017. Association for Computational Linguistics.

  • Katarzyna Krasnowska-Kieraś and Alina Wróblewska. Empirical Linguistic Study of Sentence Embeddings. To apper in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019. Association for Computational Linguistics.