Document Collection

From WebGenreWiki
Jump to: navigation, search

The aim of this page is to create a list of all known genre corpora and to filter out, which of those we plan to process. Furthermore, the principles of corpus construction can be discussed here. Feel free to add corpora or comment on them. Please also add answers to the open questions on this page.

Contents

Principles For Corpus Construction

This summarizes it quite well: "We tried to gather a broad distribution of topics, authors, and websites for each class to avoid corpora biasing towards these features and to guarantee generalizability. Hardly more than two files in each class agree in any of these other features. That leads to a much greater effort than taking several examples from one website, but is necessary if the classifiers generated by these training files should be transferable to pages from other websites or subjects." Stubbe, Ringlstetter: Recognizing Genres


Specification of our Corpus

  • Number of Documents: at least 100 per class
  • Format: HyGraph and collection of plain HTML documents (maybe annotated with XML; maybe with editing of the HTML format). Should we include images? It would be nice, if someone wants to use the type, size or content of images for genre classification. How could we include images? How about sound, video, swf etc.?
  • Availability: The corpus should be made accessible for research purposes without charges. What about copyrights of the authors? Do we have to ask them? How to find out? /Mikael: This is not easy. Strictly speaking, republication of a document is a violation of copyright law if consent is not explicitly given. Of this there is no doubt. However, in some countries the notion of "fair use" may apply. /
  • Languages: English, German, Russian, Italian (really?)
  • How should we handle multiple labels per page? (i.e. page is at the same time a bulletin board and a code listing)
  • Who will do the annotation? The draft says "The annotation will be carried out by as diverse a group of web users as possible so that real users (in contrast to the researchers themselves) construct this part of the resource; inter-coder reliability should be taken into account." - But how do we find them?

Existing Corpora

Author #Categories #Files Language may we use it?
Stubbe 32 32 * 40 english yes
Santini  ?  ? english  ?
Eissen/Stein  ?  ? english  ?
Boese  ?  ? english  ?
Karlgren  ?  ? english  ?
Mehler  ?  ? german  ?
Braslavski  ?  ? russian  ?
Vidulin 20 1539 english yes
Tavosanis 1 1 italian yes

Corpora to Include

To be discussed.

Personal tools