Download

From Icelandic Parsed Historical Corpus (IcePaHC)
Revision as of 14:18, 6 January 2011 by Anton (Talk | contribs) (Download Version 0.2, (LGPL))

Jump to: navigation, search

Introduction

The Icelandic Parsed Historical Corpus (IcePaHC) is a project that aims to construct a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change.

Download Version 0.3, (LGPL)

To get access to the current version of the Icelandic Parsed Historical Corpus (IcePaHC) you can download the following zip-file, which contains the raw data of the corpus in labeled bracketing format. Since this is an early preview version you can expect to find some uncorrected mistakes. Please let us know about those so they can be corrected before our next release.

Current release

Previous releases

The corpus, as well as software developed as part of the IcePaHC project, is released under an (LGPL) license, to ensure compatibility with other LGPL-licensed NLP tools, notably the IceNLP toolkit, which is used extensively in the development of the corpus.

The corpus is free as in beer and as in speech and there is no registration wall. We recommend that people cite the latest released version when using the corpus for research to ensure that results can be replicated. However, the most up-to-date version and information on the current state of development can be accessed at our version control repository at Github.

Getting started using the corpus

Since the corpus uses the labeled bracketing format it is compatible with programs that assume such annotation. We recommend using the CorpusSearch program developed by Beth Randall at UPenn. If you have copied the corpus to the directory "/home/chomsky/icepahc" and saved the CorpusSearch jar file in "/home/chomsky/corpussearch", you can give a command like the following to search the corpus using a query in a text file named datsubj.q.

java -classpath /home/chomsky/corpussearch/CS_2.002.75.jar csearch/CorpusSearch datsubj.q /home/chomsky/icepahc/*.psd

Let us assume that datsubj.q is a query that picks out all dative subjects. The file could look like the following:

node: IP*

query: (IP* idoms NP-SBJ) AND (NP-SBJ idoms *-D)

If you run the command above using a file like that, CorpusSearch will return a file called datsubj.out with all sentences in the corpus that contain dative subjects. Read the CorpusSearch documentation and the annotation guidelines for the corpus to find out how to do more.

Note that there will be ways to simplify the commands by creating aliases etc. but this will work differently on different operating systems. Read the getting started with CorpusSearch documentation for more information.

Texts included in Version 0.2

  • 4585 words from The First Grammatical Treatise (entire text) (12th century)
  • 8179 words from Íslensk hómilíubok (Icelandic book of homilies) (12th century)
  • 3459 words from Egils saga (theta fragment) (13th century)
  • 22719 words from Sturlunga saga (13th century)
  • 20683 words from the New Testament's Gospel of John (1540)
  • 16421 words from the New Testament's Acts (1540)
  • 4521 words from Jón Indíafari's travelogue (1661)
  • 22097 words from Jón Steingrímsson's biography (1791)
  • 17837 words from Piltur og stúlka (novel by Jón Thoroddsen) (1850)

Total number of words: 120355

Citation for the Version 0.2 release (of October 1st 2010)

Wallenberg, Joel, Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2010. 
Icelandic Parsed Historical Corpus (IcePaHC). 
Version 0.2. http://www.linguist.is/icelandic_treebank

Treebank team