NIF - Brown corpus tutorial

This document gives a quick overview over important concepts of NIF and how to describe language resources with them. As an example, we will look at an XML version of the Brown corpus. The final NIF LInked Data version can be found here.

The core classes of NIF important to this tutorial are:

Looking at our mirror of the Brown corpus, we can see there are 500 XML documents. Every document contains some metadata as well as a number of sentences with annotated words, including their part of speech. Consider this example, which is the first sentence of the first document (http://brown.nlp2rdf.org/corpus/a01.xml) of the corpus:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

URI syntax

The idea behind NIF is to allow NLP tools to exchange annotations about text in RDF. Hence, the main prerequisite is that text becomes referenceable by URIs, so that they can be used as resources in RDF statements. In NIF, we distinguish between the document, the text contained in the document and possible substrings of this text. To create URIs adressing these strings, we require the document URI, a separator (like #) and the character indices (begin and end index) of the strings. The canonical URI scheme of NIF is based on RFC 5147. According to RFC 5147, the following URI can address the substring "Fulton" in our example sentence of the document http://brown.nlp2rdf.org/corpus/a01.xml:

http://brown.nlp2rdf.org/corpus/a01.xml#char=4,10

However, to use NIF as a Linked Data standoff annotation format, this URI does not work, as it links to the original document, not our NIF annotations. We want to use it at parameter for a web service, whose output will be our NIF annotations. These web service URIs should follow the NIF API Specification. In the case of our corpus, our web service is located here:

http://brown.nlp2rdf.org/linkeddata.php
and expects the parameters So the final URL for our example word "Fulton" is:
http://brown.nlp2rdf.org/linkeddata.php?t=url&f=xml&i=http://brown.nlp2rdf.org/corpus/a01.xml#char=4,10

Context

The Context represents the whole document. It serves as a reference point to all other substrings. It necessarily has a nif:isString property which contains the string content of the document, cleaned from any markup. In our case it looks like this:

@base <http://brown.nlp2rdf.org/linkeddata.php?t=url&f=xml&i=> .
:http://brown.nlp2rdf.org/corpus/a01.xml#char=0,164
        a nif:String , nif:Context , nif:RFC5147String ;
        nif:isString """The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . [...]"""^^xsd:string ;
        nif:beginIndex "0"^^xsd:nonNegativeInteger ;
        nif:endIndex "164"^^xsd:nonNegativeInteger ;
        nif:sourceUrl <http://icame.uib.no/brown/bcm.html> .

It is important to note, that the nif:beginIndex of the Context always is 0, because it represents the whole document. The nif:endIndex simply is the length of the string.

Sentences, Words and other strings

Substrings of the Context can be anything from a single word to sentences and paragraphs. They link to the relevant Context resource via nif:referenceContext. Beginning and end indexes always refer to the string represented by the Context's nif:isString property.

The first sentence of our document would be presented like this:

@base <http://brown.nlp2rdf.org/linkeddata.php?t=url&f=xml&i=> .
:http://brown.nlp2rdf.org/corpus/a01.xml#char=0,158
        a nif:String , nif:Sentence , nif:RFC5147String ;
        nif:anchorOf """The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . """^^xsd:string ;
        nif:referenceContext :http://brown.nlp2rdf.org/corpus/a01.xml#char=0,164 ;
        nif:nextSentence :http://brown.nlp2rdf.org/corpus/a01.xml#char=158,164 ;
        nif:beginIndex "0"^^xsd:nonNegativeInteger ;
        nif:endIndex "158"^^xsd:nonNegativeInteger .

The nif:nextSentence property is the only one in here not mandatory. Nevertheless, it is encouraged to use it to provide an easy way of traversing sentences in order.

The same is generally true for words:

@base <http://brown.nlp2rdf.org/linkeddata.php?t=url&f=xml&i=> .
:http://brown.nlp2rdf.org/corpus/a01.xml#char=0,3
        a nif:String , nif:Word , nif:RFC5147String ;
        nif:anchorOf "The"^^xsd:string ;
        nif:referenceContext :http://brown.nlp2rdf.org/corpus/a01.xml#char=0,164 ;
        nif:oliaLink brown:AT ;
        nif:nextWord :http://brown.nlp2rdf.org/corpus/a01.xml#char=4,10 ;
        nif:sentence :http://brown.nlp2rdf.org/corpus/a01.xml#char=0,158 ;
        nif:beginIndex "0"^^xsd:nonNegativeInteger ;
        nif:endIndex "3"^^xsd:nonNegativeInteger .

A noteworthy difference is the nif:oliaLink. You can read about OLiA here. In general, it is a set of ontologies that map corpus or tool specific annotations to a reference model. We use it to provide additional interoperability between disparate tagsets. To include these, visit the OLiA page and search for the annotation model matching your tagset. You can assign its URI to a namespace of your choice:

@prefix brown: <http://purl.org/olia/brown.owl#> .

and simply add the tags from your corpus or tool via nif:oliaLink, like in the example above.

Named entities

Depending on how much information is available, named entities have to be annotated in different ways. In most cases, you will have the coarse-grained class the entity belongs to (i.e. if it is a Person, a Location, an Organization etc). To annotate these, please refer to the NERD ontology. Like in the case of OLiA, it maps different named entity types to single resources, increasing interoperability. The relevant property for annotation is nif:taNerdCoreClassRef

On the other hand, if you have a direct link to, for example, the respective DBpedia resource, you should of course link this one, too. Please use itsrdf:taIdentRef from the Internationalization Tag Set (ITS) Version 2.0 for this purpose.
Although there are no named entities in the Brown corpus itself, we will add one for the sake of the example:

@base <http://brown.nlp2rdf.org/linkeddata.php?t=url&f=xml&i=> .
:http://brown.nlp2rdf.org/corpus/a01.xml#char=4,10
	a nif:String , nif:Word , nif:RFC5147String ;
	nif:anchorOf "Fulton"^^xsd:string ;
	nif:referenceContext :http://brown.nlp2rdf.org/corpus/a01.xml#char=0,164 ;
	nif:oliaLink brown:NP ;
	nif:previousWord :http://brown.nlp2rdf.org/corpus/a01.xml#char=0,3 ;
	nif:nextWord :http://brown.nlp2rdf.org/corpus/a01.xml#char=11,17 ;
	nif:sentence :http://brown.nlp2rdf.org/corpus/a01.xml#char=0,158 ;
	nif:beginIndex "623"^^xsd:nonNegativeInteger ;
	nif:endIndex "629"^^xsd:nonNegativeInteger ;
	nif:taNerdCoreClassRef nerd:Location ;
	itsrdf:taIdentRef <http://dbpedia.org/page/Fulton_County,_Georgia> .