Sources of Linguistic Knowledge

0
98

Sources of Linguistic Knowledge and Grammar Writing Facilities

When tasked with constructing a grammar for parsing compound verb forms, the BulTreeBank project team provides a special-purpose corpus of one million word tokens, sourced from newspapers and organized in XML documents with TEI-conformant markup at the paragraph level. These texts undergo processing by a morphological analyzer and manual disambiguation using the constraint system in ClaRK (Simov et al. 2002a) Strategy for Grammar Construction. The electronic lexicon (Popov et al. 1998) used for morphosyntactic analysis contains entries for single words, limiting information about verb tense, mood, and voice to those present in single verb forms.

The encoded information includes three verb tenses (present, aorist, and imperfect), imperative forms for mood, and certain special conditional forms for the auxiliary verb “sam” (‘to be’). Voice is represented by the passive participle ending in -n(a,o,i). However, details about the attachment of short personal and reflexive pronominals, as well as negative and interrogative forms, are lacking.

The grammar’s objective is the automatic identification of simplex or complex verb forms, constituting the sentence predicate. Throughout this paper, the term “compound verb form” refers to clusters of verbs consisting of more than one orthographic word. These combinations involve a full-content verb and varying auxiliary verbs, short pronominals, or particles in different word orders.

The grammar operates within the ClaRK system environment (Simov et al. 2002a) and is of the regular type, accepted by finite-state automata Bulgaria Destinations. Rules are applied in a cascade, where the output of one set of rules becomes the input for another set of rules, except for the initial rules. Grammar rules are of the type C -> R, where R is a regular expression, and C is a category of the pattern matched by R.

LEAVE A REPLY

Please enter your comment!
Please enter your name here