The corpus is syntactically parsed, annotated for full phrase structure using an adaptation of the annotation scheme used by the Penn parsed corpora of historical English and other corpora in that tradition (see links from website). The corpus contains ca. 120.000 words from 6 different centuries (12th, 13th, 16th, 17th, 18th and 19th). Please note that this is a small portion of the ultimate goal for the completed corpus, ca. 1 million words from the 12th-19th centuries.
The corpus is distributed as raw UTF-8 data in labeled bracketing format and it is therefore compatible with various existing programs, including CorpusSearch.
The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download
Further information on the annotation guidelines and project organization can be found on the project wiki:
www.linguist.is/icelandic_treebank/
We hope that this release will result in feedback that allows us to improve the resource for upcoming versions. Updates are released every three months - the upcoming 0.3 version will be released on January 1st 2011. Between releases, development can be tracked at our open repository at Github (http://github.com/antonkarl/icecorpus) but use of released versions is encouraged to ensure that results can be replicated.
Texts included in Version 0.2:
4585 words from The First Grammatical Treatise (entire text) (12th century)
8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th century)
3459 words from Egils saga (theta fragment) (13th century)
22719 words from Sturlunga saga (13th century)
20683 words from the New Testament's Gospel of John (1540)
16421 words from the New Testament's Acts (1540)
4521 words from Jón Indíafari's travelogue (1661)
22097 words from Jón Steingrímsson's biography (1791)
17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)
Total number of words: 120355
Joel Wallenberg ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
Anton Karl Ingason ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
Einar Freyr Sigurðsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
Eiríkur Rögnvaldsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
University of Iceland
The project is funded by the following grants:
Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language Technology beyond English – Icelandic as a test case".
U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English".



