While most learner language features had to be annotated manually, NLP (Natural Language Processing) was used for automatic learner language annotations such as tokenization and lemmatization, part-of-speech tagging or segmentation into sentences or T-units.
Annotations in the full MERLIN corpus
The main annotations available for the almost all learner texts (for detailed figures please see MERLIN corpus) are target hypotheses (target hypotheses 1) and annotations of grammatical and orthographical learner language features (error annotation 1):
All annotation is based on human interpretation of what the person who produced the text might have had on his/her mind. In a learner text collection (learner corpus), it is important to make this interpretation explicit to make annotations more easily understandable and to avoid problems of reliability. Therefore, the MERLIN team formulated target hypotheses (TH) that are a corrected version of the learner texts. The team followed the rules developed for the FALKO corpus and adapted them to the project needs where necessary (cf. Reznicek/Lüdeling et al. 2012).
The "minimal target hypothesis" (TH1) is a minimally intervening version of the learner text that is orthographically and grammatically correct, but might contain deviations from what a native speaker would say on other levels (e.g., lexical). TH1 were written for the whole MERLIN corpus.
Based on these target hypotheses, data were annotated with a wide range of language characteristics – the learner language features – originating from various sources (learn more here). These language features are described in detail in the annotation scheme. You can find a list of the features with some examples here. In the MERLIN corpus, learner language features from the fields of orthography and grammar are available for the whole database (error annotation 1).
Annotations in the core corpus
In the explorative, smaller MERLIN core corpus, linguistic aspects regarding vocabulary, pragmatics, sociolinguistic appropriateness are taken into consideration. The core corpus consists of two groups of texts which received either A2 or B2 ratings (for Italian: A2 and B1/B1+).
Almost all core corpus texts were enriched with an extended target hypothesis (TH2) that aims at creating an acceptable (for a native speaker) version of the original learner text. TH2 takes into account more language dimensions that often regard context-dependent phenomena.
Also, learner language features regarding vocabulary, sociolinguistics, pragmatics, and intelligibility are included in a part of the core corpus annotations (error annotation 2). Very often, these phenomena are not errors. These language features are also described in detail in the annotation scheme.
In the future, it would be desirable to have these explorative pilot annotations re-checked to then be able to cover the whole MERLIN database with TH2 & EA2 annotations. For now, we ask users to handle the core corpus annotations with caution.
You can find more details on the annotation layers and contents here.