Lemmatization
General principles
The words in the corpus all occur with a corresponding lemma, in the form: (POSTAG word-lemma). The lemma is always all lowercase, even for proper names.
In general, the lemma for the word is the dictionary citation form for that word. However, there are some systematic differences between our analysis and traditional Icelandic lexicography, which will be listed here or under Treatment of individual words.
For the Old(er) Icelandic texts in the corpus, some of the words have modernized lemmas, i.e., the lemma for the corresponding word in modern Icelandic is used rather than the citation form found in Old Icelandic / Old Norse dictionaries. This is done primarily when the word has a form that might be confusing to speakers of modern Icelandic.
The systematically modernized lemmas are below:
(ADV þ$) (NEG $eygi-ekki) (NEG eigi-ekki) (NEG ei-ekki) (Q ekki-ekkert) (Q nekkvar-nokkur) / (Q nekkver-nokkur) (P fyr-fyrir) (P fyrr-fyrir) (P und-undir) (P viður-við) (P meður-með) (ADJ-A átta-áttundi) , i.e. an old ordinal number which has the same form as the modern cardinal number. (PRO eg-ég) (PRO ér-þú) (WADV hverninn-hvernig) (ALSO einninn-einnig)
Proper names are systematically modernized, if possible:
(NPR Moises-móses) (NPR Herodes-heródes)
-st middle-verbs
The lemma of an -st verb ends in -st if the meaning is clearly different from the corresponding verb without -st, or if there is no verb there is no such without-st-verb, or if the syntax of the -st verb is different, notably if it is a DAT-NOM verb:
Different meaning:
andast 'die' != anda 'breathe' eignast != eigna gerast 'happen' or 'become (intentionally)' != gera 'do' (note that the -st form is tagged VB*, not DO*) reiðast 'get angry' != reiða 'transport (on a horse)' skjótast 'move quickly' != skjóta 'shoot' villast 'get lost' != villa 'mislead' kannast 'be familar with' != kanna 'explore' látast 'pretend' 'die' != láta 'let' þykjast 'pretend' != þykja 'think'
No without-st-verb:
aðhyllast (*aðhylla) heppnast (*heppna) iðrast (*iðra) leiðast (*leiða)
Different case pattern
sýnast (DAT-NOM) != sýna (NOM-ACC) finnast (DAT-NOM) != finna (NOM-ACC) óast (NOM subject) != óa (ACC subject) fyllast (NOM subject) != fylla (ACC subject when on argumental, NOM-ACC when monotransitive) venjast != venja verjast (NOM-GEN) != verja (NOM-ACC) setjast != setja berast != bera undra (ACC-ACC) != undrast (NOM-ACC) minnast (NOM-GEN) != minna (NOM-ACC-PP)
Pronouns
Gender only matters for personal pronouns.
PRO-N hann-hann PRO-N hún-hún PRO-A hana-hún PRO-N það-það PRO-D því-það
The number is not a dividing factor (so the dual is lemmatized as the singular).
vér,oss-ég ér,þér-þú þeir-hann
For determiners and quantifiers, they are only divided by nature, the default gender is the masculine, the default case the nominative and the default number the singular.
D-N sú-sá D-N það-sá Q-N engar-enginn
The possessive pronouns sinn, minn and þinn get their own lemmas (sinn,minn,þinn).
Plural possessive pronouns existed in Old Icelandic. They are forms of the personal pronouns but are specific forms if they inflect with the noun (in Old Icelandic).
minn,mitt,mín-minn vár,vor,ór,vort-vor okkar,okkrum-okkar þinn-þinn yðvar,yðar,yðrum-yðar ykkar,ykkrum-ykkar
If they did not inflect with the noun they are just genitive forms of the personal pronouns and are lemmatized as such.
hans-hann hennar-hún þess-það þeirra-hann/hennar/það okkar-ég ykkar-þú
The personal pronoun það is easily confused with the determiner það (lemmatized sá). If there is doubt, PRO is default.
Individual words
HVORTVEGGJA, HVORTVEGGI, lemmatized as hvortveggja When it is written in two words TVEGGJA is lemmatized as tveggi
Q manngi-manngi
Q engi-enginn
WPRO hvorgi-hvorgi
WPRO hvergi-hvergi
Q hvatki-hvergi
Issues
HÉÐAN Í FRÁ, ÞAR ÚT Í FRÁ
The comparative of heilagur is usually helgari, not heilagri, in Íslensk hómilíubók
aldregi: aldregi or aldrei?
fullting: -fullting or -fulltingi
líkamur, líkhamur 'body': -líkami or -líkamur/-líkhamur?
vor: -vor or -ég?
Jóan: -Jóan or -Jóhannes?
engi: -engi or -enginn (or even -einngi)?
sing. mánaður, pl. mánuður: -mánaður or mánuður?
sétti: -sétti or -sjötti?
Marie (gen.): -marie or -María; or even LATIN?
VB ríta/VBN ritið: -ríta or -rita
orðaslaug: what is this?
allmáttkur: allmáttkur or allmáttugur?
ritka and séka: How to express the negative and the pronoun? In Firstgrammar2.psd . -HO
sömnuðu: samna or safna?