Raul Sirel
Software Technology and Applications Competence Center
Data mining and data analytics have become increasingly topical during the age of big data. Although providing numerous possibilities, the analysis of large datasets is often limited, because much of the interesting data is presented in the form of free text. The problem can be combated by utilising natural language processing (NLP)—a hybrid research field of computer science and linguistics. Despite of the intriguing perspectives of big data and NLP, several practical issues exist:
- most of the existing methodologies in NLP are usually language-specific and therefore not applicable for processing other languages,
- most of the research in NLP is currently focused on general language (e.g. newspaper articles). This makes it difficult to use the developed methods or models for analysing sublanguages (e.g. tweets or clinical notes),
- when working with sublanguages, it is often the problem that lexical resources (e.g. dictionaries or thesauri) built for general language correspond poorly to the actual language usage. More simply said, they do not contain the terms and concepts characteristic to the sublanguage,
- the existing NLP methods usually require large scale resources in order to be used in dig data analysis (i.e. they are not scalable).
Thus it was the purpose of this research to develop a flexible software solution for analysing free text datasets that takes into account the linguistic and typological properties of the target (sub)language. As the proposed solution was based on extracting relevant concepts from domain corpora using statistical language models, it does not require existing lexical resources for analysing the texts. The developed software has been packaged as the Terminology EXtraction and Text Analytics (TEXTA) Toolkit, a web-based collection of tools for extracting relevant information from arbitrary free text datasets.
The toolkit is domain independent and can therefore be used for analysing datasets in different (sub)languages. The robustness of the toolkit makes it highly scalable and suitable for processing datasets containing millions of documents.