Name	Last Update	Last Commit 4cdccbb3 – Readme file added History
Paralela_simple_ObjNum_ENG_part.txt	Loading commit data...
Paralela_simple_ObjNum_POL_part.txt	Loading commit data...
Paralela_simple_Passive_ENG_part.txt	Loading commit data...
Paralela_simple_Passive_POL_part.txt	Loading commit data...
Paralela_simple_SentLen_ENG_part.txt	Loading commit data...
Paralela_simple_SentLen_POL_part.txt	Loading commit data...
Paralela_simple_SentType_ENG_part.txt	Loading commit data...
Paralela_simple_SentType_POL_part.txt	Loading commit data...
Paralela_simple_SubjNum_ENG_part.txt	Loading commit data...
Paralela_simple_SubjNum_POL_part.txt	Loading commit data...
Paralela_simple_Tense_ENG_part.txt	Loading commit data...
Paralela_simple_Tense_POL_part.txt	Loading commit data...
Paralela_simple_TopDeps_ENG_part.txt	Loading commit data...
Paralela_simple_TopDeps_POL_part.txt	Loading commit data...
Paralela_simple_TreeDepth_ENG_part.txt	Loading commit data...
Paralela_simple_TreeDepth_POL_part.txt	Loading commit data...
Paralela_simple_WC_ENG_part.txt	Loading commit data...
Paralela_simple_WC_POL_part.txt	Loading commit data...
README.md	Loading commit data...

README.md

Probing datasets for Polish and English

A probing task is a classification problem that focuses on simple linguistic properties of sentences. The objective for training a classifier is to check whether a liguistic property is retained in vector representations of sentences (i.e. sentence embeddings). A classifier can be trained on a probing dataset that contains the pairs of sentences and their categories (i.e. linguistic properties).

The proposed probing datasets were automatically extracted from a corpus of dependency parsed sentences. The extraction procedure was based on a set of rules compatible with the Universal Dependencies annotation schema and is thus universal for languages with the UD style dependency treebanks. Probing datasets consist of 90k examples each (75k for training, 7.5k for validation and testing). The details are presentend in the article Empirical Linguistic Study of Sentence Embeddings published in the proceedings of ACL 2019.

Probing tasks

WC (word content)

Classification of sentences containing exactly one of pre-selected 750 target words.

SentLen (sentence length)

Classification of sentences by their length.

TreeDepth (dependency tree depth)

Classification of sentences based on the depth of the corresponding dependency trees.

TopDeps (top dependency schema)

Classification of sentences based on the multiset of the dependency types labelling the relations between the top-most node (the ROOT’s only dependent) and all its children, barring punct relations.

Passive (passive voice)

Classification of passive voice sentences and active sentences.

Tense (grammatical tense)

Classsification of sentences by the grammatical tense of their main predicates.

SubjNum (grammatical number of subjects)

Classification of sentences by the grammatical number of nominal subjects (marked with the UD label nsubj) of main predicates.

ObjNum (grammatical number of objects)

Classification of sentences by the grammatical number of direct objects (marked with the UD label obj) of main predicates.

SentType (sentence type)

classification of sentences by their types: inter for interrogatve sentences (e.g. Do you like him?), imper for imperative sentences (e.g. Get out of here!), and other for declarative sentences (e.g. He likes her.) and exclamatory sentences (e.g. What a liar!).

Acknowledgments

The research presented in this paper was supported by SONATA 8 grant no 2014/15/D/HS2/03486 from the National Science Centre Poland. The computing was performed at Poznań Supercomputing and Networking Center.

Licensing

tba

References

Katarzyna Krasnowska-Kieraś and Alina Wróblewska (2019) Empirical Linguistic Study of Sentence Embeddings In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5729–5739 Florence, Italy, July 28-August 2, 2019.

GitLab

Scwad / SCWAD-probing-data