Annotation
From WebGenreWiki
Contents |
Annotation standards
XML, but which??
Embedded annotation vs stand-off annotation
For some years a strong argument has been that annotations should not be embedded in the texts which they describe. This respects the authenticity of the documents. It also makes parallell annotations more easy to handle.
If embedded markup is describing documents that are not well-formed (from an XML point of view), the documents have to be embedded in CDATA sections in order for XML parsers to accept the documents.
Using stand-off annotation is a good solution. One of the most plausible solutions is probably to use some RDF implementation, which also makes it possible to link the annotation to definitions of genres.
For instance, suppose we want to describe an e-shop document and have defined a scheme 'gc' for the labels.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:gc="put an identifier for the scheme definition here">
<rdf:Description rdf:about=" put an identifier for the document here">
<gc:genre>eshop</gc:genre>
</rdf:Description>
</rdf:RDF>
which presupposes a definition of gc that may include something like
<rdf:Property rdf:about="put an identifier for the genre label here"> <rdfs:label xml:lang="en">eshop</rdfs:label> <rdfs:comment xml:lang="en"> description of eshop category </rdfs:comment> </rdf:Property>
defining the eshop label.
All descriptions of a corpus may very well go into one large file. (I'm not very well versed in using RDF, so don't take this as necessary being formally correct) //Mikael
If something more simple is wanted for start, here is a sample annotation in XML with an embedded DTD. There is no problem in transforming the annotated data, the class definitions and the DTD to other formats (such as RDF, XCES) if only the annotation is consistently applied. The DTD may be discussed here.
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE collection [
<!ELEMENT collection (class*,instance*)>
<!ELEMENT class (label,name,description)>
<!ELEMENT instance (id,title,assign*)>
<!ELEMENT label (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT description (function*,community*)>
<!ELEMENT function (#PCDATA)>
<!ELEMENT community (#PCDATA)>
<!ELEMENT id (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT assign (#PCDATA)>
]>
<collection>
<class>
<label>t1</label>
<name>PhD Thesis</name>
<description>
<function>Gaining a PhD Degree</function>
<function>Disseminating research</function>
<community>Computer Science</community>
</description>
</class>
<instance>
<id>/article_2367.pdf</id>
<title>Automated Genre Identification</title>
<assign>t1</assign>
</instance>
</collection>
Andrea: I suggest the following changes (only in the example, you can imagine how the DTD would look like). We should make use of the id-attribute which, if I remember that correctly, guarantees uniqueness. Mikael: True. The only reason for not doing that is that editing becomes a little bit less straightforward. Andrea: Also, I wouldn't include the title of the instance. That's redundant, because we have it in the document itself. Mikael: The reason for that element is mostly for mnemonic purpose. Having a stand-alone file to edit without no other reference to the document than the identifier may be cumbersome. Andrea: The community-element: I really like it, but it's difficult/impossible to assign a community to all of the genres. Let's first focus on the genre classes! Mikael: Partly agree. We may also tie community/discipline with the instances in stead, or topic, if that is preferred. Andrea: I'd add it to the instance than... and make it optional.
<collection>
<class id="t1">
<name>PhD Thesis</name>
<description>
<function>Gaining a PhD Degree</function>
<function>Disseminating research</function>
<definition>I think we should add a definition here, even though it's similar to function.</definition>
</description>
</class>
<instance id="uri of the document"> <!-- should we use relative paths here or the url? -->
<genre>t1</genre> <!-- this name is easier to understand -->
<genre>t2</genre> <!-- another genre -->
<url>maybe we might add this element, as we don't have this information anymore if we store the documents locally.</url>
</instance>
</collection>
Mikael: No objections, but it should be tried out. I'll do that.
Multiple annotation
If we have a document and it belongs to mulitple genres, how can we annotate it? Can we design a header (in XML) where we encode the type of information we need for genre analysis? For example,
<mainGenre = eshop>, <otherGenre2=product list>, <purpose1=informational>, <purpose2=instructional> etc.
Mikael: As can be seen above, the element 'assign' may occur several times in an instance-element. So, if some annotation would like to reflect an ambivalence, just repeat the element. However, as Marina's example seem to indicate, there may be a wish to express a hierarchy. This points to traditional problems of library classification. The recommendations are nearly almost to be as specific as possible. If a 'product list' is considered a facet of 'eshop', then 'product list' is to be preferred and if someone wants to use the annotation with a more coarse-grained approach, 'eshop' may be inferred from 'product list' as long as the relationship is recorded and as long as 'product list' is not a facet of another broader category. In the latter case we have to choose between precoordination and postcoordination. Andrea: Should we add the hierarchical relationship to the XML/DTD above? We just need to add a <is_a> element to <class>
Suggestion
Based on Andrea's comments, here is a DTD
Andrea: Thanks for the DTD. I still have some objections...
- use "sup" only for hyponymy, as the meronomy-relation is not n:1 but n:m (one genre can be part of many genres), and I think it's not a fix and well defined relation. Maybe someone writes a poem some day containing statisitics. Also, as each attribute can appear only once it would be impossible to encode both relations.
Mikael:This is a matter of how we want the ontology to be specified. We may however do the same as for 'class_assignment' below.
- if we use class as an attribute of instance, it's not longer possible to assign multiple genres per file.
Mikael: True, I did forget that wish, since I work with partitioning, not non-disjoint classification. It's changed now.
<!-- Common elements -->
<!ELEMENT collection (classdefs,instance*)>
<!ELEMENT note (#PCDATA)>
<!-- Class definitions. I suggest the mnemonic 'class' in place of genre
to make the DTD more flexible, allowing for classes that are not considered
genres by everyone -->
<!ELEMENT classdefs (class*)>
<!ELEMENT class (name,description?)>
<!ATTLIST class id ID #REQUIRED>
<!ELEMENT relation EMPTY>
<!ATTLIST relation class IDREF #IMPLIED
type CDATA "sup | part">
<!ELEMENT name (#PCDATA)>
<!-- I suggest letting all the elements in the description of
a class to be optional -->
<!ELEMENT description (definition?,function*,note?)>
<!ELEMENT function (#PCDATA)>
<!ELEMENT definition (#PCDATA)>
<!-- Annotations -->
<!ELEMENT instance (class_assignment+,title?,community?,topic*,note?,url?)>
<!ATTLIST instance id ID #REQUIRED>
<!ELEMENT class_assignment EMPTY>
<!ATTLIST class_assignment class IDREF #IMPLIED
<!ELEMENT community (#PCDATA)>
<!ELEMENT topic (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT url (#PCDATA)>
Note how the use of class labels as attributes serves to constrain the annotation by allowing only for labels that have been defined as an 'id' for a class. The same goes for 'sup'. Note, however, that identifiers need to start not with a digit, but with a letter. This also have consequences for file naming conventions.
Here is a sample annotation:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE collection SYSTEM "coll.dtd" [
<!ENTITY classdefs SYSTEM "classdefs.xml">
]
>
<collection>
&classdefs;
<instance id="article_0029736937.html">
<class_assignment class="ab"/>
<title>
lower bounds for searching in streets and generalized streets
</title>
</instance>
<instance id="article_0125348145.html">
<class_assignment class="pr"/>
<title>press release
</title>
</instance>
</collection>
Note how class definitions are supposed to be in a stand-alone file (or other kind of repository) referenced by the entity reference just after the collection tag is opening up the contents. (The drawback of this transclusion method is that the class definition file cannot be real-time validated in most editing environments, since it is not a valid xml file) Here is the contents of that file.
<classdefs>
<class id="ab">
<name>Abstract</name>
<description>
<function>Presenting contents of another document</function>
</description>
</class>
<class id="pr">
<name>Press Release</name>
</class>
</classdefs>
Here is a simple XSL stylesheet that can be used to withdraw necessary data from the annotation. (Note that it only works well when there is one class_assignment)
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <xsl:template match="/"> <xsl:apply-templates select="//instance"/> </xsl:template> <xsl:template match="instance"> <xsl:value-of select="@id"/> <xsl:text>,</xsl:text> <xsl:value-of select="class_assignment/@class"/> <xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet>
which would generate
article_0029736937.html,ab article_0125348145.html,pr
Editing tools
If an XML-based format is chosen, then there is the question of what editor to use for the actual annotation. In order to take advantage of XML it should be a validating editor. I, myself, am using jEdit, since that is a non-commercial one and supports real-time validation and xsl transformations. However, it has some drawbacks and if someone know of something better it would be good to know...
