Issues

From Icelandic Parsed Historical Corpus (IcePaHC)
Jump to: navigation, search

Corrections

  • add missing case informaion to ADJ in piltur1
  • Remove possessive dollar signs from piltur1 since we don't use those anymore
  • Do NP-CMP thing
  • correct inconsistencies in "því að CP-ADV" and "af því að" CP-THT-PRN
  • sem here introduces CP-ADV, see SEM.
( (IP-MAT (IP-MAT-1 (BEDI var-vera)
		    (NP-SBJ (PRO-N hún-hún))
		    (ADVP (ADV því-því))
		    (HAN höfð-hafa)
		    (PP (P á-á)
			(NP (N-D baðstofugólfi-baðstofugólf))))
	  (CONJP (CONJ og-og)
		 (IP-MAT=1 (VAN gefin-gefa)
			   (NP-OB1 (N-A mjólk-mjólk))
			   (PP (P sem-sem)
			       (NP (N-D barni-barn)
				   (PP (P með-með)
				       (NP (N-D pípu-pípa)))))))
	  (. ,-,)))

Script

  •  ?VBN for VAN komið
  • When captial E, (PP (P ef-ef) (CP-ADV C 0) for CP-ADV C ef-ef (and more of the same, such as capital Þegar)
  • make nú-nú be ADVP-TMP by default
  • (ADJP (ADVR eins-eins)
  • Tag "neinn" as Q
  • project NP-POSs when needed, like for "minn"
  • (ALSO líka), attach to IP (no ADVP)
  • give "þó að" proper structure (not CP)
  • "sjálfur" is (almost) always NP-PRN
  • preserve case on (PRO ðu)
  • fix tag "VBI-MA2SP"
  • fix tag "D-PMG" for "hinna", should be "D-G" -BS/HO
  • fix tag "D-MSA" for "hinn", should be "D-A" -BS/HO
  • fix tag "D-NSN" for "hið", should be "D-N" -BS/HO
  • fix tag "D-FPN" for "hinar", should be "D-N" -BS/HO
  • fix tag "D-FSA" for "hina", should be "D-A" -BS/HO etc. for all forms of "hinn"

Sanity checks

  • Check if -SPE extension is missing in clauses dominated by other -SPE clauses (exception, -PRN)
  • (DONE in sanity checks) CP always doms IP-SUB and the other way around (neither can be missing)
  • Make sure that there is a trace where it must be (CP-QUE, CP-REL)
  • (DONE in sanity checks) One subject in IP-MAT/IP-SUB, not more, not less
  • (DONE in sanity checks) Subjects not dominated by other stuff than IPs, like no NP-VOC idoms NP-SBJ
  • Only use valid tags
  • (DONE in sanity checks) N is not sister of PRO that idoms -minn (needs NP-POS for minn)
  • check if "til að (IP-INF) is IP-INF-PRP
  • check for case agreement (e.g. inside NPs and in conjunction structures)
  • check RP words
  • check that IP-IMP idoms an imperative verb (VBI ...)
  • check that sentence final punctuation tag is not ","

Semi-automatic checking

  • Pick out all typical subjunctive contexts and check verbs
  • check case assigned by verbs against a list of known verbs

Post-processing

  • Make sure that token final punctuation is always period
  • Move punctuation to highest level
  • Assign IDs to tokens
  • Do some checking that lemmas are consistent with final PoS-tag

To discuss

  • ELLA 'else', have argument about ELSE tag
  • LENGI is tagged ADJ, but this is inconsistent with the rest of ADJs because it has no case. Can we do something about this?
  • The flat N modifier structure may need to be changed, it is kinda strange sometimes
  • the LÍTILL, MIKILL Q thing: what about dálítill?
	      (NP-VOC (NPR-N Sigga-sigga)
		      (NP-POS (PRO-N mín-minn))
		      (ADJ-N góð-góður))

... and when there are many Ds

Docs

  • Adjectives page, Comparatives in ADJP, fix NP-CMP and make page for that

Various stuff, incl. IceNLP

  • Make parentheses behave nicely in text
  • fix Tagset page (perhaps this means "delete page", but some of this info needs to be somewhere)
  • Make list of locative stuff, ADVP-LOC
  • Make list of temporal stuff, ADVP-TMP


Lemmatization

Lemmatization issues.