IT applications in the humanities: Estonian Interlanguage Corpus

Pille Eslon
Mart Laanpere

Estonian Interlanguage Corpus established at Tallinn University is a collection of written texts of Estonian as the second language and as the foreign language with a number of sub-corpora, user interface, a multi-level annotation and tagging system, statistics module, option to automatic parsing of texts, etc. By combining the different characteristics of text (e.g. genre, number of words or sentences), types of errors, and metadata on language learner (e.g. first language, country of origin, gender, education, level of language proficiency) the user interface of the Estonian Interlanguage Corpus allows carrying out multi-level inquiries.

As of October 2013, the corpus contains 11,720 texts with a total of 3,185,591 running words, and the average length of each text is 272 running words.

Table: Sub-corpora of Estonian Interlanguage Corpus

Sub-corpus	No. of texts	No. of running words	Average length of text (no. of words)
K2 main corpus	3,151	804,094	255
K2 national examinations	7,856	1,989,844	253
K2 open contests and olympiads	63	58,684	932
K2 academic writing in Estonian	13	14,716	1132
K1 academic writing in Estonian *	4	3,339	835
K1 Russian (reference corpus)	370	209,885	567
K3 Russian (reference corpus)	273	101,566	372

* The sub-corpus was compiled at the Centre of Academic Language, University of Tallinn (P. Nemvalts)

Estonian Interlanguage Corpus can be used in (1) empirical and applied research (e.g. acquiring Estonian as L2, language proficiency levels of the European Council, usage patterns of Estonian language, language development tendencies); (2) training future language teachers and linguists (e.g. error analysis, frequency of words and forms, cluster analysis); (3) further training of active language teachers (e.g. using the corpora in language teaching, using the corpus data in assessing the validity of textbooks), etc.

<<<

>>>

Digital humanities in Estonia

IT applications in the humanities: Estonian Interlanguage Corpus