Difference between revisions of "Lemmatization"
(→-st middle-verbs) |
(→-st middle-verbs) |
||
Line 42: | Line 42: | ||
<pre> | <pre> | ||
andast 'die' != anda 'breathe' | andast 'die' != anda 'breathe' | ||
− | gerast 'happen' != gera 'do' (note that the -st form is tagged VB*, not DO*) | + | gerast 'happen' or 'become (intentionally)' != gera 'do' (note that the -st form is tagged VB*, not DO*) |
reiðast 'get angry' != reiða 'transport (on a horse)' | reiðast 'get angry' != reiða 'transport (on a horse)' | ||
skjótast 'move quickly' != skjóta 'shoot' | skjótast 'move quickly' != skjóta 'shoot' |
Revision as of 10:22, 21 May 2010
General principles
The words in the corpus all occur with a corresponding lemma, in the form: (POSTAG word-lemma). The lemma is always all lowercase, even for proper names.
In general, the lemma for the word is the dictionary citation form for that word. However, there are some systematic differences between our analysis and traditional Icelandic lexicography, which will be listed here or under Treatment of individual words.
For the Old(er) Icelandic texts in the corpus, some of the words have modernized lemmas, i.e., the lemma for the corresponding word in modern Icelandic is used rather than the citation form found in Old Icelandic / Old Norse dictionaries. This is done primarily when the word has a form that might be confusing to speakers of modern Icelandic.
The systematically modernized lemmas are below:
(ADV þ$) (NEG $eygi-ekki) (NEG eigi-ekki) (NEG ei-ekki) (Q ekki-ekkert) (Q nekkvar-nokkur) (P fyr-fyrir) (P fyrr-fyrir) (ADJ-A átta-áttundi) , i.e. an old ordinal number which has the same form as the modern cardinal number. (eg-ég)
Proper names are systematically modernized, if possible:
(NPR Moises-móses) (NPR Herodes-heródes)
-st middle-verbs
The lemma of an -st verb ends in -st if the meaning is clearly different from the corresponding verb without -st, or if there is no verb there is no such without-st-verb, or if the syntax of the -st verb is different, notably if it is a DAT-NOM verb:
Different meaning:
andast 'die' != anda 'breathe' gerast 'happen' or 'become (intentionally)' != gera 'do' (note that the -st form is tagged VB*, not DO*) reiðast 'get angry' != reiða 'transport (on a horse)' skjótast 'move quickly' != skjóta 'shoot' villast 'get lost' != villa 'mislead' kannast 'be familar with' != kanna 'explore' þykjast 'pretend' != þykja 'think'
No without-st-verb:
heppnast (*heppna) iðrast (*iðra) leiðast (*leiða)
Different case pattern
sýnast (DAT-NOM) != sýna (NOM-ACC) finnast (DAT-NOM) != finna (NOM-ACC) óast (NOM subject) != óa (ACC subject) fyllast (NOM subject) != fylla (ACC subject when on argumental, NOM-ACC when monotransitive)
Individual words
HVORTVEGGJA, HVORTVEGGI, lemmatized as hvortveggja
Issues
The comparative of heilagur is usually helgari, not heilagri, in Íslensk hómilíubók
aldregi: aldregi or aldrei?
fullting: -fullting or -fulltingi
líkamur, líkhamur 'body': -líkami or -líkamur/-líkhamur?
vor: -vor or -ég?
Jóan: -Jóan or -Jóhannes?
engi: -engi or -enginn (or even -einngi)?
sing. mánaður, pl. mánuður: -mánaður or mánuður?
sétti: -sétti or -sjötti?
Marie (gen.): -marie or -María; or even LATIN?
VB ríta/VBN ritið: -ríta or -rita
orðaslaug: what is this?