MERLIN for research

1. Linking the MERLIN texts to the CEFR

1.1 Re-ratings

The MERLIN texts are the writings sections of CEFR-related, standardized high-quality tests from telc (Frankfurt/Main, Italian and German tests, homepage) and ÚJOP (Prague, Czech tests, homepage). These institutions are ALTE-audited (ALTE-homepage). The tasks were in use until 2013 and are now freely available on the platform. However, to have explicit and direct information about the CEFR profiles of the written productions themselves (and not only of the tests as a whole), for MERLIN all texts were re-rated independently by two professional raters per language. The reliability of the re-ratings was examined with the help of Classical Test Theory and a Multi-Facet Rasch analysis. The latter is a probabilistic statistical procedure often used in language testing which allows for a correction of rating tendencies (e.g., leniency/harshness) and makes it possible to arrive at a fair average rating for each text. The intra-rater and inter-rater reliability was generally very high in MERLIN, with some exceptions for Italian. Therefore, the whole re-rating process was repeated for Italian resulting in a satisfying rating quality. In MERLIN, the fair average is calculated based on a holistic scale (see 1.2 rating instruments). If you compile your own corpus based on CEFR levels, these are also based on the fair average ratings (Define a subcorpus » Overall CEFR rating). If you are interested in more details regarding the quality of the ratings and the difficulty of the single rating criteria, please consult the technical report.


1.2 Rating instruments

Two rating instruments were used: An assessor-oriented version (Alderson 1991) of the holistic scale (page 2 of the MERLIN rating grid) for "General Linguistic Range" (Chapter 5, CEFR) was accompanied by an analytical rating grid (page 3 of the MERLIN rating grid) that is closely connected to Table 3 of the CEFR (CoE 2001). This table was of great importance in the process of scaling the CEFR descriptors (North 2005, 2000). The MERLIN version includes six rating criteria (vocabulary range | vocabulary control | grammatical accuracy | coherence & cohesion|orthography | sociolinguistic appropriateness). These criteria stem from scales in Chapter 5 of the CEFR that specifies aspects of communicative L2 competence. For the construction of the grid, descriptors of these scales were modified in an assessor-oriented way. Plus-levels (A2+, B1+) were excluded as the CEFR does not specify descriptors for these levels for all rating criteria. The rating instruments were piloted before their implementation in the MERLIN project.


2. Preparing the data

2.1 Transcriptions

The hand-written original learner texts were transcribed in an xml-based editor (xml mind©) inside the testing institutions (telc and ÚJOP). The transcribers followed transcription guidelines (available only in German) and the reliability of the transcripts was checked, initially for a sample of 5% of the texts per CEFR level. As many transcription errors were detected, in the end almost all texts had to undergo a revision stage.
The transcription guidelines included tags (inline annotation) for basic textual features such as unreadable or ambiguous stretches of language, foreign language words, emoticons, images, paragraphs, copied words from the rubrics, or greeting formulae. The anonymization (names, places) was part of the transcription process and was carried through based on the guidelines.

2.2 Tools & formats

Once the transcriptions were available, all data was converted to PAULA (, a standoff XML format designed as an exchange format for linguistic annotation. Further manual annotations were carried through with two tools: MMAX2 ( and the Falko Excel Add-in ( MMAX2 is a text annotation tool that allows multi-layered annotation. It was used for the annotation of learner language features (see 2.3.1). The Falko Add-in was used for annotating both target hypothesis 1 and 2 (» for more details on the annotation of target hypotheses with the Falko Add-in see Falko-Handbuch). Automatic annotation made use of the UIMA framework ( UIMA allows a modular integration of a wide range of NLP tools such as part-of-speech taggers and parsers. For the advanced search functions, the open source web-browser based search and visualization architecture ANNIS ( is used in the MERLIN interface (see explanations on search output in ANNIS).



2.3 Annotations

A short introduction to the structure of the MERLIN annotations is provided here. Here, you find more detailed information on the single annotation layers that are available for the whole corpus, for the smaller core corpus, and you find indications on quality control aspects.

2.3.1 Manual annotations available for the whole corpus



Minimal target hypotheses / target hypotheses 1 (TH1)

All annotation is necessarily based on human interpretation of what the person who produced the text might have had on his/her mind. It is important to make this interpretation explicit so that MERLIN users can understand the annotations better. Therefore, the MERLIN corpus contains rule-based target hypotheses that suggest a corrected version of the learner texts.
In the main phase of annotation, an orthographically and grammatically correct version of the learner text was created (target hypotheses 1, TH1) for the whole corpus.  As little interventions as possible were allowed by the annotator. In this table, you find a simple example (for a definition of the tiers, please refer to the explanations of the search output):

The following example by the same learner shows that in TH1, errors from other linguistic areas were ignored. There are content and technical reasons for this.

While the orthographical (capitalization error, word boundary error, missing hyphen) and grammatical (missing article) errors are corrected in the TH1 (termed ‘ZH1’ here), the lexically erroneous form *Reisespass (instead of “Reisepass”) was not substituted by another lexeme. Phenomena like this are annotated in the MERLIN core corpus (for definitions of the errors see MERLIN annotation scheme).

The team followed the target hypotheses rules developed for the Falko corpus and adapted them to the project needs where necessary (cf. Reznicek/Lüdeling et al. 2012; see annotation structure guidelines and Documentation of annotation decisions). In some cases, annotators agreed upon annotation rules on a very fine-grained level. For example, it was decided that in German, the final double <ss> instead of standard German spelling <ß> was not changed in texts in which it might be possible that the learner consistently used the Swiss spelling, which does not use the <ß>. For single decisions that you might be interested in, please consult the Documentation of annotation decisions.

TH1 were compiled for the whole MERLIN corpus. The TH1 were written in Excel with the help of the Falko Add-in. The TH1 was piloted before the actual annotation took place.

If you want to display the TH1 on the MERLIN platform, go to Advanced search. To get explanations about the output you get there, read more here. You can also display TH1 for whole texts in the search results of Define a subcorpus.



Useful links & downloads with regard to TH1:
MERLIN annotation manual
Das Falko-Handbuch. Korpusaufbau und Annotationen. Version 2.01. HU Berlin (Falko guidelines)
Documentation of annotation decisions


Manual annotation of grammatical and orthographical learner language features – error annotation 1 (EA1)

Building on the target hypotheses 1, all MERLIN texts were annotated with grammatical and orthographical language features from various sources (error annotation 1 – EA1). You can find a complete list of the features (“tags”) with examples here, while the annotation scheme gives you full access to the definitions of each learner language feature and additional examples.

The MERLIN annotation tags for EA1 and EA2 were derived from …

  1. CEFR scales: some tags were chosen to support research about the empirical validity of the CEFR scales underlying the MERLIN analytical rating grid (chapter 5 of the CEFR, CoE 2001). They can help to control whether the predictions of selected CEFR descriptors correspond to learner behaviour, e.g.: intelligibility, use of idioms, content jumps (see 3.2 MERLIN for scale validation).  

  2. issues in current SLA research, e.g. grammatical aspects such as verb valency, word order, negation, or lexical aspects, e.g. the use of formulaic sequences (references)

  3. features reported to the MERLIN team by testers, teachers and teacher trainers in a questionnaire study and in expert interviews as being relevant for assessing language mastery at certain levels, e.g. the verbal aspect in Italian and Czech

  4. textbook and language test analyses revealed further recurrent topics some of which were included in the MERLIN annotation scheme, e.g. German modal verbs

  5. learner text analyses carried out in a random sample of MERLIN texts (5% per test level/language), e.g. use of articles and clitics


The annotation scheme specifies to which group(s) the single learner language features belong.

Furthermore, most error-related MERLIN tags (EA1 & EA2) incorporate the widely used ‘target language modification’ dimension (cf. Díaz-Negrillo/Fernández-Domínguez 2006). This dimension specifies the type of error: an element might have been omitted, changed, added, repositioned, merged with, or split from another element). You can find details about this in the annotation scheme.

You can search for the annotated learner language features in the Advanced search, or you can extract lists of features relevant for a specific linguistic field or a specific CEFR level here Statistics.



Further links:
advanced search output explanation
annotation scheme
list with learner language features and examples


2.3.2 Manual annotations in the MERLIN core corpus


The structure of the MERLIN core corpus

For a small pilot sample (the MERLIN core corpus), in addition to grammar and orthography more linguistic dimensions are taken into consideration. The MERLIN core corpus consists of texts that received fair averages of either A2 or B2. Thus, two groups of learners with a clearly distinct level of proficiency can be compared. It is important to notice that the ratings the learners received do not necessarily correspond to the CEFR level of the test they decided to take. You can distinguish between these dimensions here Define a subcorpus (“CEFR level of test” and “Overall CEFR rating”).

Many outperformed the targeted CEFR levels, while others’ performances were rated lower than the learners would have expected. An extreme case is Italian, where only two texts actually received a B2 level, while many more students took B2 tests. Here, the MERLIN core corpus incorporates the 100 texts that were placed highest on the Rasch logit scale (technical report).


Core corpus: extended target hypotheses / target hypotheses 2 (TH2)  

Target hypotheses 2 aim at creating an acceptable version of the learner text. This process involves more subjectivity and difficulties of decision reliability, which is why it was separated from the level of target hypotheses 1 like in the Falko project with which there was a strong cooperation. The aim of TH2 is to capture the perspective of acceptability of the learner text (not, like for TH1, its correctness). TH2 therefore are an extension of TH1. To this aim, the learner text was still only minimally modified while at the same time its reconstruction comes close to what a native speaker utterance would look like. This reconstruction regards semantic and lexical aspects, pragmatics, and sociolinguistics. Other than in the TH1, phenomena that over-arch sentences and that are determined by the context are modified, too.

You can search for the TH2 in the Simple search and in the Advanced search.


Core corpus: annotations of sociolinguistic, pragmatic, lexical, and other learner language features  (error annotation 2, EA2)

For a part of the MERLIN core corpus, many tags from various linguistic perspectives were added to the grammatical and orthographical learner language features annotated in the main stage of the project. These tags stem from the same sources as the EA1 annotations (see 2.3.1).

You can find detailed information about the single tags which include, for example, the speech act REQUEST, the use of language with an inappropriate level of formality, the use of structures that pertain to spoken language variants, or reference problems in the annotation scheme. You can get an overview of the annotated features and find examples in this table.

Again, the MERLIN tags incorporate the widely used ‘target language modification’ dimension (cf. Díaz-Negrillo/Fernández-Domínguez 2006) which yields information about the type of the learner language feature (an element might have been omitted, changed, added, repositioned, merged with, or split from another element).

You can find these learner language features in the » Advanced search. You can compile a list of these features for a particular linguistic area or a specific CEFR level here Statistics.


2.3.3 Quality control aspects of the annotation process

It was important to make sure that the annotations in the MERLIN corpus are as consistent as possible, even if a certain degree of subjectivity is unavoidable. To this aim, the MERLIN project carried through a number of measures:

First of all, all instruments (TH 1 & TH2 rules, annotation scheme for EA1 and EA2) were piloted before their implementation. This allowed to detect possibly problematic aspects which could be corrected before the annotations started.

Secondly, all annotations are based on guidelines (annotation manual, Falko-Handbuch).  The guidelines were enriched by fine-grained decisions on single aspects of annotation (documentation of annotation decisions).

A third measure to control the quality of annotations is their documentation. Many decisions had to be taken about which tag to apply to what phenomenon, and consistency among the three project languages had to be taken care of. The most important discussions among the annotators are documented in the documentation of annotation decisions. In the annotation scheme, the ‘related tags’ sections mirror some of the extensive discussion processes. 

Last but not least, the reliability of the annotations was controlled also a little bit more formally. Reliability of annotations was controlled for 5% of the texts on each test level for target hypotheses (1 & 2) and error annotation (1 & 2). Different methods were applied:

In a qualitative approach, half of the files were annotated independently by the coders to then be commonly discussed with the aim to arrive at a consensus. This happened before the annotation (which was done level by level) of the level started. The texts served as a reference throughout the annotation process.

The second half of the files checked for reliability was annotated by all coders without their knowledge. This quantitative, double-blind procedure allows to check for intra-coder reliability (the consistency of one and the same annotator) and inter-coder reliability (the degree of agreement between different annotators).

Although EA2 annotations underwent these quality control measures as well, they are of an explorative pilot character. Therefore, it would be desirable to have these annotations completely re-checked before extending the annotations to the complete MERLIN database. In the meantime, users are asked to analyse EA2 annotations with caution.


Consistency and interference of annotation layers

From a technical perspective, it was complex to integrate and harmonize the different annotation formats in MERLIN without losing information or creating imprecisions.
At the same time, on a content level, contradictions between the different annotation levels (TH1-EA1-TH2-EA2) were to be avoided.
TH1 and EA1 are closely connected. If there is a change of the learner text on TH1, there ought to be a tag on EA1 that makes the learner language feature explicit in detail. There are single exceptions to this rule which are documented in the documentation of annotation decisions.
Also, all EA2 annotations are reflected in TH2. The opposite, however, is not necessarily true: There might be TH2 modifications that are needed to arrive at an acceptable version of the learner text and that are not part of the MERLIN annotation scheme. The MERLIN team might have not included a phenomenon if it was not considered relevant and/or feasible.

2.3.4 Automatic annotations in MERLIN

In MERLIN, a combination of automatic and manual [link] annotation procedures was used in order to prepare learner texts for integration into the platform. We have applied existing automatic annotation tools developed for the target languages in order to expand the range of available linguistic annotation beyond what would have been possible with time-consuming and expensive manual annotation. However, it is important to keep in mind that automatic annotation is particularly challenging for learner language, since learner language often deviates considerably from the target language across all levels of linguistic analysis, from spelling to semantics.


The following tools were used for all three MERLIN languages:

Texts were tokenized using the tokenizer for Indo-European languages from LingPipe and the resulting tokenization was then corrected by hand.
Sentences were annotated with the OpenNLP sentence segmenter.
Repetitions were identified using the Saphre library on the basis of the automatic part-of-speech and lemma annotation described below.


Language-Specific Tools

MERLIN contains part-of-speech tags (tok_pos), lemmas (tok_lemma), and dependency parses (dependencies) for all three languages. Additional part-of-speech tags, lemmas, and morphological analyses from alternate tools are included where available. Details about the annotation tools and annotation schemes are provided for each language individually below.



Part-of-speech tags and lemmas (tok_pos and tok_lemma):

MorphoDiTa was used to annotate POS tags and lemmas according to the Prague Dependency Treebank guidelines. There are 12 basic POS tags (seen in the first character of each tag) and more than 4000 possible detailed morphosyntactic tags in the full tag set.

Dependency parses:

The joint tagger and parser from Bernd Bohnet et al. (2013) was trained on data from the Prague Dependency Treebank. The parser also provides basic POS tags (tok_pos_bohnet) and morphological analyses (tok_morph_bohnet).


Part-of-speech tags and lemmas (tok_pos and tok_lemma):

TreeTagger was used to annotate POS tags and lemmas using the Stuttgart-Tübingen tag set, which contains 54 tags.

Dependency parses:

The joint tagger and parser from Bernd Bohnet et al. (2013) was trained on a dependency conversion of the Tiger Treebank with additional data from the SMOR morphological analyzer.
Bernd Bohnet kindly provided a version of the German parsing model customized for the MERLIN data. The parser also provides basic POS tags (tok_pos_bohnet), lemmas (tok_lemma_bohnet), and morphological analyses (tok_morph_bohnet).

T-units (tunit and complextunit):

T-units and complex t-units were identified using the algorithms presented in Julia Hancke's 2013 master's thesis "Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language", which relies on automatic parses produced by the Stanford parser. The parses are not presented in the MERLIN corpus, but the POS tags from the Stanford parser, which uses the same German tag set as TreeTagger (STTS), are shown for reference in tok_pos_stanford.


Part-of-speech tags and lemmas (tok_pos and tok_lemma):

TreeTagger was used to annotate POS tags and lemmas. The POS tag set developed by Achim Stein contains 38 tags.

Dependency parses:

The joint tagger and parser from Bernd Bohnet et al. (2013) was trained with data from the Italian Stanford Dependency Treebank. Additional POS tags and morphological analysis provided by the parser are included as tok_pos_bohnet and tok_morph_bohnet.



3. Using MERLIN for research purposes

The main aim of MERLIN is not research-oriented: the platform was developed for practitioners who need empirical illustrations of rated CEFR levels for Czech, Italian, and German. Recently, an increasing number of initiatives (like SLATE) have started to collect authentic learner language rated according to CEFR levels. Some of them pertain to the Reference Level Descriptions (RLD) initiative, i.e. a specification of the CEFR levels for single languages (the most prominent example is the English Profile Project, other projects are ASK for Norwegian, Carlsen 2013, or the Profilo della lingua italiana, Spinelli/Parizzi 2010). The Council of Europe encourages the development of RLDs (CoE 2005, see CoE website for Reference Level Descriptions).
From corpora like these, features that characterize CEFR levels (sometimes called “criterial features”, Hawkins/Filipovíc 2012) can be extracted. This process helps to deepen the understanding of what CEFR-related ratings mean and to build its use on firmer, empirical grounds. MERLIN contributes to the empirically-based exploration of the CEFR for German, Italian, and Czech. It differs from most existing initiatives in that all data, including full texts, test tasks and annotations, are fully and freely available online.
Apart from this major practical aim, MERLIN is relevant for research purposes from various perspectives:


3.1 Validating CEFR scales with MERLIN

The Council of Europe effort of scaling the CEFR descriptors (CoE 2001; North 2000; Schneider/North 2000) has led to immense improvements in standardization and transparency in language learning, teaching, and testing. Important decisions about language learners' lives are taken with reference to the CEFR levels. In many ways, it seems as if the scales have acquired a life of their own; often, they are over-estimated, misunderstood and applied in ways that they were not meant to be used for (North 2000). One crucial aspect that is yet insufficiently understood is the empirical validity of the CEFR scales (Fulcher 2004; Hulstijn 2007): If scales are used to describe or rate learner language, they must reflect what learners actually do (Alderson 1991). In spite of this, up to date there is almost no research that examines the power of the CEFR descriptors to capture the language learners actually produce (Wisniewski 2014). MERLIN allows to directly analyze the relationship between selected CEFR descriptors (such as "circumlocutions" or "content jumps" which were operationalized and annotated (see MERLIN annotation scheme) and learner language without having to rely on ratings.

3.2 MERLIN and second language acquisition studies

Many studies from the area of second language acquisition (SLA) refer to proficiency levels when describing the development and the variation of learner language. However, in many cases the proficiency classification is not yet based on procedures that comply with the strict standards that need to be met from the perspective of research-based, high-quality language testing (see for example AERA/APA/NCME; ALTE 2001; Bachman/Palmer 1996; EALTA code of practice). There is a particular lack of strict testing procedures and easily accessible empirical data for languages other than English when it comes to CEFR-based proficiency classifications. Although MERLIN is small in size, its reliable relationship to the CEFR makes it a precious resource for future SLA studies. Also, it can be used for triangulating and validating data for many existing studies.

3.3 MERLIN to advance NLP of learner language

The MERLIN corpus provides valuable data for the development and evaluation of natural language processing tools for learner language (Meurers 2012). The corpus and its meta-information on learners and ratings readily support research on automatic native language identification, enabling such research to go beyond the current English learner focus. In a similar vein, the corpus has already been used for research on automatic proficiency classification for German (Hancke 2013). The MERLIN corpus also provides richly annotated learner data for the development and adaptation of NLP tools and applications that assist language learners in improving their vocabulary usage, coherence, spelling and grammatical accuracy.



