Pille Eslon
Mart Laanpere
Estonian Interlanguage Corpus established at Tallinn University is a collection of written texts of Estonian as the second language and as the foreign language with a number of sub-corpora, user interface, a multi-level annotation and tagging system, statistics module, option to automatic parsing of texts, etc. By combining the different characteristics of text (e.g. genre, number of words or sentences), types of errors, and metadata on language learner (e.g. first language, country of origin, gender, education, level of language proficiency) the user interface of the Estonian Interlanguage Corpus allows carrying out multi-level inquiries.
As of October 2013, the corpus contains 11,720 texts with a total of 3,185,591 running words, and the average length of each text is 272 running words.
Table: Sub-corpora of Estonian Interlanguage Corpus
Sub-corpus | No. of texts | No. of running words | Average length of text (no. of words) |
K2 main corpus | 3,151 | 804,094 | 255 |
K2 national examinations | 7,856 | 1,989,844 | 253 |
K2 open contests and olympiads | 63 | 58,684 | 932 |
K2 academic writing in Estonian | 13 | 14,716 | 1132 |
K1 academic writing in Estonian * | 4 | 3,339 | 835 |
K1 Russian (reference corpus) | 370 | 209,885 | 567 |
K3 Russian (reference corpus) | 273 | 101,566 | 372 |
* The sub-corpus was compiled at the Centre of Academic Language, University of Tallinn (P. Nemvalts)
Estonian Interlanguage Corpus can be used in (1) empirical and applied research (e.g. acquiring Estonian as L2, language proficiency levels of the European Council, usage patterns of Estonian language, language development tendencies); (2) training future language teachers and linguists (e.g. error analysis, frequency of words and forms, cluster analysis); (3) further training of active language teachers (e.g. using the corpora in language teaching, using the corpus data in assessing the validity of textbooks), etc.