Tagging Historic Bulgarian texts: Experiments and Challenges
Петя Начева Осенова
Докладчик
профессор
Софийский университет им. Св. Климента Охридского
Софийский университет им. Св. Климента Охридского
Цветана Иванова Димитрова
Докладчик
старший ассистент
Институт болгарского языка Болгарской академии наук
Институт болгарского языка Болгарской академии наук
Ключевые слова, аннотация
Bulgarian language, diachronic texts, tagging.
Тезисы
The
paper focuses on the automatic morphological analysis of a collection
of Bulgarian text fragments excerpted from 17th-century-texts
(the so-called damaskins). Since at the moment and to our knowledge
no historic tagger is freely available for the
task, we employed
the Linguistic
Processing Pipe with a POS tagger trained on Modern Bulgarian texts. This
is the BTB Processing Pipe, with a tokenizer for Bulgarian, and with
a POS tagger trained on media texts and literature, using the MATE
Tool.
First, we run the tool over the original texts, with a big number of errors coming as early as the tokenization level and subsequently at the POS level. We started to normalize the texts where in the resulting texts: a) no diacritics were present; b) symbols that are non-existent in the present-day alphabet, were replaced with their ‘successors’ (such as о, ѡ, ѻ, ꙫ = о [о]; etc.); the abbreviations for frequently written words and letter under titlo were also normalized (бь = богь). We ‘deleted’ all small ers (ь) in word endings, while within the word we replaced them with schva (ъ).
We performed a manual error analysis to see whether there is an improvement of the automatic analyses. The tagger uses some constraints with the accuracy being 0,8494 when all analysis are considered.
Expectedly, errors are mostly found with proper nouns, imperatives, case inflected nouns and adjectives, while there is very good recognition of the syntactic functional elements of closed POS, such as conjunctions, subjunctions, prepositions, invariant relativisers (що, щото, дето), as well as pronouns.
First, we run the tool over the original texts, with a big number of errors coming as early as the tokenization level and subsequently at the POS level. We started to normalize the texts where in the resulting texts: a) no diacritics were present; b) symbols that are non-existent in the present-day alphabet, were replaced with their ‘successors’ (such as о, ѡ, ѻ, ꙫ = о [о]; etc.); the abbreviations for frequently written words and letter under titlo were also normalized (бь = богь). We ‘deleted’ all small ers (ь) in word endings, while within the word we replaced them with schva (ъ).
We performed a manual error analysis to see whether there is an improvement of the automatic analyses. The tagger uses some constraints with the accuracy being 0,8494 when all analysis are considered.
Expectedly, errors are mostly found with proper nouns, imperatives, case inflected nouns and adjectives, while there is very good recognition of the syntactic functional elements of closed POS, such as conjunctions, subjunctions, prepositions, invariant relativisers (що, щото, дето), as well as pronouns.