We are very pleased to announce that version 0.9 of the Icelandic Parsed Historical Corpus (IcePaHC) is now available for free download.
The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download
The corpus is a treebank of over 1 million words in size, annotated for full phrase structure parse, and hand-corrected, using an adaptation of the annotation scheme used by the Penn Treebank and the Penn parsed corpora of historical English (http://www.ling.upenn.edu/hist-corpora/). Note that this release contains all of the text for version 1.0, but some minor corrections remain to be finished.
The corpus contains:
- 1 002 361 words total, consisting of ~100 000-word samples from each century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and lemmatized.
- The entire parse, pos-tagging, and lemmata for every sentence have been *hand-corrected*.
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the corpus for research and/or profit with appropriate citation.
The corpus is distributed as raw UTF-8 data in labeled bracketing format and it is therefore compatible with various existing programs, including CorpusSearch (http://corpussearch.sourceforge.net/).
A plain text version without markup and a set of info files containing philological information accompany the corpus download.
The entire corpus may be downloaded in a plain text version, a platform-independent GUI, and a Windows-compatible GUI for ease of searching.
Further information on the annotation guidelines and project organization can be found on the project wiki:
www.linguist.is/icelandic_treebank/
Joel C. Wallenberg (
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
)
Anton Karl Ingason (
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
)
Einar Freyr Sigurðsson (
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
)
Eiríkur Rögnvaldsson (
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
)
University of Iceland
We were grateful to receive support for this project through the following grants:
Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language Technology beyond English – Icelandic as a test case".
U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English".
University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)



