Workshops | Learner Corpus Research Conference 2026

Workshop 1 (3 x 45 minutes)

Building and Using Learner Corpora in TEITOK (Maarten Janssen)

(Wednesday 16 September 2026 – 9.15–11.30)

TEITOK is an online corpus environment that from the start has been developed with a partial focus on learner corpora, since it was built in the context of the COPLE2 learner corpus for Portuguese. It has since been used in a variety of different learner corpora over the years, including CORTEGAL (Galician), CroLTeC (Croatian), CzeSL (Czech), EFFE-On (Portuguese), ESAM (Latvian/Lithuanian), LLC (Lithuanian), PEAPL2 (Portuguese), and PoLKo (Polish).

The learner corpora in TEITOK can contain both written and spoken data, where the spoken data can be aligned with the audio (to have the option to directly listen to sound fragments coming from searches), and the written data can be aligned with facsimile image of the original manuscript (to immediately be able to see exactly what the learner wrote). Corpus documents in TEITOK are stored in the TEI/XML format, which allows storing detailed information about both written and spoken data to mark corrections, deletion, repetitions, truncations, and other phenomena that might occur in the data. And annotations can be added on top of the core transcription at various levels, to annotate interesting linguistic phenomena in the data. There is a multi-layered option to annotate „errors“ (a conflictive term, but deviation from the native standard), where it is possible to not only mark orthographic errors, but also provide a „corrected“ form to allow a detailed view on what exactly was written incorrectly. And it is possible to independently annotate morphological errors, syntactic errors, and lexical errors, where all levels can in principle be annotate additionally with POS a lemma data. And it is also possible to add explicit tags marking out any kind of interesting phenomenon in the transcription, or to provide a fully native version of an entire sentence.

On top of the detailed transcription-level annotations, the corpus documents can also be provided with detailed document-level metadata, such as the age and sex of the students, their native language, which other foreign languages they speak, their proficiency level in the language they are learning, when they started, etc. Especially for learner corpora, such metadata are often crucial, to be able to do contrastive analysis between speakers of different native languages, different age or proficiency groups, etc. All these can be done using the query language that TEITOK provides to quickly obtain statistical data from the corpus, or to pinpoint interesting phenomena to study in more detail in context.

In this workshop we will explore learner corpora in TEITOK from two perspectives: on the one hand how the existing corpora in TEITOK can be used to do learner corpus research. And on the other hand, how to build new learner corpora in TEITOK.

Workshop 2 (2×60 minutes)

The Czech National Corpus and its corpus tools (Mgr. Michal Křen, Ph.D.)

(Wednesday 16 September 2026 – 12.30–14.30)

The talk will give an overview of the Czech National Corpus (CNC) research infrastructure and its services for international users. At the European level, CNC is anchored within the CLARIN network which will be briefly presented. Despite the national character of the CNC and its services naturally concentrated on the Czech language, the CNC has much to offer also for researchers in languages other than Czech. In particular, the newest release of a large multilingual parallel InterCorp corpus will be presented. It consists of fiction core with manually checked alignments that is supplemented by automatically processed text collections. InterCorp covers 61 languages, most of which are annotated using the Universal Dependencies scheme that is comparable across languages and includes also syntax. The talk will show how this can be utilised using the KonText query interface that is designed to work with various corpus types including parallel corpora such as InterCorp. Finally, the talk will introduce other language-independent user applications developed at the CNC, namely KWords and Calc.

Workshop 3 (2×60 minutes)

Using AI in corpus research (doc. Jiří Milička, Ph.D.)

(Wednesday 16 September 2026 – 14.45–16.45)

This hands-on workshop will present AI as a tool for corpus linguistics and AI as a subject of corpus-linguistic inquiry.

In the first part, participants will work directly with Claude Code connected to a live multilingual corpus infrastructure, extracting linguistic patterns through natural-language interaction and testing just how far AI can go in turning raw corpus data into a publishable piece of research.

The second part turns the lens around. We examine two AI-generated corpora — AI Brown and AI Koditex — designed to be comparable with their human-written counterparts in size, genre, and structure. Using standard corpus methods, we ask: what does AI-generated language actually look like?

By the end, we hope to collapse the distinction between the two halves entirely: if AI can mine a corpus and populate one, what does that mean for how we do corpus linguistics?

No prior programming experience required. Curiosity about the future of the field is mandatory.