Difference between revisions of "Lemmatization"

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search
(Issues)
(-st middle-verbs)
Line 52: Line 52:
 
'''No without-st-verb''':
 
'''No without-st-verb''':
 
<pre>
 
<pre>
 +
aðhyllast (*aðhylla)
 
heppnast (*heppna)
 
heppnast (*heppna)
 
iðrast (*iðra)
 
iðrast (*iðra)

Revision as of 20:32, 24 May 2010

General principles

The words in the corpus all occur with a corresponding lemma, in the form: (POSTAG word-lemma). The lemma is always all lowercase, even for proper names.

In general, the lemma for the word is the dictionary citation form for that word. However, there are some systematic differences between our analysis and traditional Icelandic lexicography, which will be listed here or under Treatment of individual words.

For the Old(er) Icelandic texts in the corpus, some of the words have modernized lemmas, i.e., the lemma for the corresponding word in modern Icelandic is used rather than the citation form found in Old Icelandic / Old Norse dictionaries. This is done primarily when the word has a form that might be confusing to speakers of modern Icelandic.

The systematically modernized lemmas are below:

(ADV þ$) (NEG $eygi-ekki)

(NEG eigi-ekki)

(NEG ei-ekki)

(Q ekki-ekkert)

(Q nekkvar-nokkur)

(P fyr-fyrir)

(P fyrr-fyrir)

(ADJ-A átta-áttundi) , i.e. an old ordinal number which has the same form as the modern cardinal number.

(eg-ég)

Proper names are systematically modernized, if possible:

(NPR Moises-móses)

(NPR Herodes-heródes)

-st middle-verbs

The lemma of an -st verb ends in -st if the meaning is clearly different from the corresponding verb without -st, or if there is no verb there is no such without-st-verb, or if the syntax of the -st verb is different, notably if it is a DAT-NOM verb:

Different meaning:

andast 'die' != anda 'breathe'
gerast 'happen' or 'become (intentionally)' != gera 'do' (note that the -st form is tagged VB*, not DO*)
reiðast 'get angry' != reiða 'transport (on a horse)'
skjótast 'move quickly' != skjóta 'shoot'
villast 'get lost' != villa 'mislead'
kannast 'be familar with' != kanna 'explore'
þykjast 'pretend' != þykja 'think'

No without-st-verb:

aðhyllast (*aðhylla)
heppnast (*heppna)
iðrast (*iðra)
leiðast (*leiða)

Different case pattern

sýnast (DAT-NOM) != sýna (NOM-ACC)
finnast (DAT-NOM) != finna (NOM-ACC)
óast (NOM subject) != óa (ACC subject)
fyllast (NOM subject) != fylla (ACC subject when on argumental, NOM-ACC when monotransitive)
venjast != venja

Individual words

HVORTVEGGJA, HVORTVEGGI, lemmatized as hvortveggja

Issues

The comparative of heilagur is usually helgari, not heilagri, in Íslensk hómilíubók

aldregi: aldregi or aldrei?

fullting: -fullting or -fulltingi

líkamur, líkhamur 'body': -líkami or -líkamur/-líkhamur?

vor: -vor or -ég?

Jóan: -Jóan or -Jóhannes?

engi: -engi or -enginn (or even -einngi)?

sing. mánaður, pl. mánuður: -mánaður or mánuður?

sétti: -sétti or -sjötti?

Marie (gen.): -marie or -María; or even LATIN?

VB ríta/VBN ritið: -ríta or -rita

orðaslaug: what is this?

allmáttkur: allmáttkur or allmáttugur?