Anton Karl Ingason

  • Increase font size
  • Default font size
  • Decrease font size

IcePaHC 0.9. 1 million words of syntactically parsed (hand-corrected) Icelandic

Print
We are very pleased to announce that version 0.9 of the Icelandic Parsed Historical Corpus (IcePaHC) is now available for free download.

The corpus can be downloaded from:
www.linguist.is/icelandic_treebank/Download

The corpus is a treebank of over 1 million words in size, annotated for full phrase structure parse, and hand-corrected, using an adaptation of the annotation scheme used by the Penn Treebank and the Penn parsed corpora of historical English (http://www.ling.upenn.edu/hist-corpora/). Note that this release contains all of the text for version 1.0, but some minor corrections remain to be finished.

The corpus contains:

- 1 002 361 words total, consisting of ~100 000-word samples from each century from the 12th to the beginnng of the 21st century.
- Annotated with a phrase structure parse, part-of-speech-tagged, and lemmatized.
- The entire parse, pos-tagging, and lemmata for every sentence have been *hand-corrected*.
- Text samples are balanced for genre within each century.
- LGPL license: You are free to copy, modify and redistribute the corpus for research and/or profit with appropriate citation.

The corpus is distributed as raw UTF-8 data in labeled bracketing format and it is therefore compatible with various existing programs, including CorpusSearch (http://corpussearch.sourceforge.net/).

A plain text version without markup and a set of info files containing philological information accompany the corpus download.

The entire corpus may be downloaded in a plain text version, a platform-independent GUI, and a Windows-compatible GUI for ease of searching.

Further information on the annotation guidelines and project organization can be found on the project wiki:
www.linguist.is/icelandic_treebank/


Joel C. Wallenberg ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
Anton Karl Ingason ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
Einar Freyr Sigurðsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
Eiríkur Rögnvaldsson ( This e-mail address is being protected from spambots. You need JavaScript enabled to view it )
University of Iceland

We were grateful to receive support for this project through the following grants:

Icelandic Research Fund (RANNÍS), grant nr. 090662011,"Viable Language Technology beyond English – Icelandic as a test case".

U.S. National Science Foundation (NSF) International Research Fellowship Program (IRFP), grant #OISE-0853114, "Evolution of Language Systems: a comparative study of grammatical change in Icelandic and English".

University of Iceland Research Fund (Rannsóknasjóður Háskóla Íslands), grant Icelandic Diachronic Treebank (Sögulegur íslenskur trjábanki)

Last Updated on Monday, 29 August 2011 14:04