Challenges of creating dialect corpora: The case of Torlak
Maja Miličević Petrović
Докладчик
доцент
Универзитет у Београду
Универзитет у Београду
Teodora Vuković
Докладчик
аспирант
University of Zurich
University of Zurich
4-У
2019-03-19
15:20 -
15:40
Ключевые слова, аннотация
South Slavic languages, Torlak dialect; dialectology, dialect corpora, linguistic annotation.
Тезисы
This paper describes the creation of a digital corpus of the Torlak dialect, spoken in south-east Serbia, as well as parts of west Bulgaria and north Macedonia. Torlak is a linguistically interesting dialect, as it belongs to the Balkan Sprachbund and shares numerous features with Macedonian and Bulgarian in addition to Serbian: it has postpositive articles, lacks verbal infinitive, and uses a simplified nominal case system, to name a few characteristics. It is also appealing due to spreading across administrative and standard language borders, and it is socially stratified, i. e. mostly used by older and less educated speakers. The central idea in our corpus creation is thus to enable the study of linguistic, areal-linguistic and sociolinguistic phenomena related to Torlak.
The corpus is created from recordings of field interviews conducted in the areas where Torlak is spoken in Serbia and Bulgaria, based on Plotnikova's (2009) questionnaire. The data collection occurred over 100 locations, for a total of over 450 hours of material; 50 hours (350.000 words) have already been transcribed, and further recordings are currently undergoing transcription. We will present the material and the adopted transcription rules, as well as outline the next steps in the corpus building process (already piloted on a smaller sample): lemmatisation, morphosyntactic annotation, metadata coding, and the resulting search possibilities. Particular attention will be dedicated to the fact that dialect corpora are a crossing between spoken and non-standard language corpora, presenting a double set of challenges, especially concerning linguistic annotation.
References
Плотникова, A. (2009) Материалы для этнолингвистического изучения балканославянского ареала. Москва: Институт славяноведения РАН.
The corpus is created from recordings of field interviews conducted in the areas where Torlak is spoken in Serbia and Bulgaria, based on Plotnikova's (2009) questionnaire. The data collection occurred over 100 locations, for a total of over 450 hours of material; 50 hours (350.000 words) have already been transcribed, and further recordings are currently undergoing transcription. We will present the material and the adopted transcription rules, as well as outline the next steps in the corpus building process (already piloted on a smaller sample): lemmatisation, morphosyntactic annotation, metadata coding, and the resulting search possibilities. Particular attention will be dedicated to the fact that dialect corpora are a crossing between spoken and non-standard language corpora, presenting a double set of challenges, especially concerning linguistic annotation.
References
Плотникова, A. (2009) Материалы для этнолингвистического изучения балканославянского ареала. Москва: Институт славяноведения РАН.