MERLIN Corpus | Resources for research and practice related to foreign language learning

MERLIN is an error-annotated written learner corpus for German, Italian and Czech. It was created within the MERLIN project (2012-2014). The texts in MERLIN were taken from standardized language tests and are methodologically precisely related to the Common European Framework of Reference for Languages (Council of Europe 2001, 2020). This platform makes all corpus texts available with their ratings. It shows possible usage scenarios, in the teaching practice as well as in research, and informs about the structure and the design of the corpus and of the annotations. Users can search the corpus with the help of the integrated web-based search engine ANNIS.

1 Download MERLIN texts and resources

You can download the whole corpus (2.286 texts) in the following file formats:

TXT-files including the target hypothesis and metadata such as age, gender, mother tongue, task, and rating
Transcription files in the EXMARaLDA format
in the PAULA and ANNIS format

In addition, the following corpus-related overviews are available:

an overview of texts (IDs) and assigend metadata in *.xlsx
Tasks on which the target lanuage tests (L2 test) are based
the complete documentation of the transcription, rating, and annotation process

All MERLIN data and ressources are freely accesible under a Creative Commons licence (CC BY-SA 4.0). They are part of the CLARIN infrastructure (European Research Infrastructure for Language Resources and Technology). Download the whole corpus from the Eurac Research CLARIN Centre Repository.

2 Display and filter MERLIN texts

The MERLIN texts are TXT-files that you can open in a standard text editor. Descriptive file names help you easily filter the files by metadata. In addition, you can use the ANNIS search tool to sort texts and display them in the document browser.

Open texts with the file manager

Open the texts after downloading and unpack / extract them from your native file manager, e. g. Windows File Explorer. Choose↘� meta-ltext for learner texts (L2 texts) with metadata or↘ meta_ltext_THs for L2 texts with target hypothesis.

Filter texts with the file manager

Use the search box of your native file manager, e. g. in the Windows File Explorer (you can find it to the right of the address bar) to filter the file list for the following features (metadata):

overall rating of the text, CEFR level, e. g. B1
task on which the L2 test is based, e. g. visit-letter
mother tongue (L1) of the learner, e. g. Russian

For example, to find all texts with the overall CEFR rating B1 written by learners with Russian as their mother tongue, enter B1 Russian.

The following L1 occur in the corpus: Arabic, Czech, English, Chinese, French, German, Hungarian, Italian, Polish, Portuguese, Russian, Slovak, Spanish, Turkish.

On MERLIN Corpus you will find an overview of all tasks including the abbreviations we used in the file names.

Open texts in ANNIS

Open the ANNIS search interface, go to Corpus List and select the corpus you want to display (i. e. the target language). Click on the↘ document icon [1]. In the field to the right, the list view of all MERLIN texts of the chosen language opens up. Click on ↘ Full text [2] next to a text to open it and on "i" [3] to display the assigned metadata.

corpus-list

Sort texts in ANNIS

Select a corpus (according to the target language) in the ANNIS search interface↘ Corpus List and click on the ↘ document icon. In the field to the right, a list view of all MERLIN texts of the chosen language opens up.

By clicking on↘ _rating_fair_cefr you can quickly sort the texts according to the CEFR level (overall rating).

full-text

If you start a search for learner language features directly in ANNIS, you can also filter texts by metadata such as the learner's L1, age or the assigned task. More on this in the next section.

3 Search the MERLIN corpus

You can search the MERLIN Corpus for lexcial, grammatical and other features as well as for words, lemmas, or tagged parts of speech. By doing so, you will obtain examples for learner language (L2) in context. To provide the search functionality, the MERLIN platform uses the visualization and search architecture of ANNIS, which allows to display multi-layer annotations as those of the MERLIN corpus.

Example searches

DE ↘ Realisations of forms of the word 'Gruß' in L2 texts
DE ↘ Orthographical errors related to the word 'grüßen'
DE ↘ Examples of use for the word 'fahren' in complex predicates (e. g. after modal verbs)
DE ↘ Grammatical errors related to all forms of' 'warten'
CZ ↘ Case errors with Czech nouns after the preposition 'na'
CZ ↘ Case errors in texts of German learners of Czech
CZ ↘ Use of the structure 'mít rád'
IT ↘ Mood errors in texts of learners of Italian

Using the metadata, you can restrict queries to a specific sub-corpus, for example:

DE ↘ Case errors in texts of learners at B2 level (CEFR fair rating)
CZ ↘ Aspect errors of learners with German L1 at B1 level (L1, CEFR fair rating)
IT ↘ Mood errors in texts of learners at B1 level (CEFR fair rating)
IT ↘ Use of communicative phraseologisms in application letters (task topic)

hint bulb The ANNIS User Guide offers a thorough introduction to using the ANNIS interface. You can also refer to the ANNIS help section under↘ Help/Tutorial. For explanations on the annotation layers please go to Search and help.