Genre Collection Repository

From WebGenreWiki

Jump to: navigation, search

See also: Genre Benchmark Under Construction


Mapping between some of the collections are available from a separate page: Mapping between genres and macrogenres‎

Contents

English Genre Collections

DISCLAIMER: Copyright is held by the author/owner(s) of the web documents included in the genre collections below.

***The material available from this page is for research purposes ONLY***

SANTINIS

  • 7-webgenre collection built by Marina Santini. Download The 7-web genre collection has been built following the criteria of 'annotation by objective sources' and 'consistent genre granularity'. These criteria are explained in Santini 2006
   1. Personal Blogs (200 web pages)                    5. Listings (200 web pages)
   2. Eshop (200 web pages                              6. Personal Home Page (200 web pages)
   3. FAQs (200 web pages)                              7. Search Pages (200 web pages)
   4. Online Newspaper Front Pages (200 web pages)


  • The Web Corpus built by Marina Santini Download The Web Corpus has been created to approximate one of the possible compositions of the web in this way: the BBC corpus and the 7-web-genre collection represent the known part of the web, i.e. about 60% of the sample (1480 web pages); the SPIRIT collection amounts to about 40% of the sample (1,000 web pages)and represents the unknown part of the web. The composition and the rationale of the Web Corpus is briefly explained in Santini (2007). The Web Corpus includes:
    • A small BBC corpus, namely four BBC web genres:
      • EDITORIALS (20 web pages), SHORT BIOGRAPHIES (20 web pages), DIY MINI-GUIDES(20 web pages) and FEATURE ARTICLES (20 web pages)
    • Seven novel web genres (i.e. the 7-web-genre collection described above).
    • The SPIRIT collection (described in Joho and Sanderson, 2004), which contains random and unclassified web pages.

KI-04

  • The KI-04 corpus (a.k.a. Meyer-zu-Eissen-web-page collection) built by Sven Meyer zu Eissen. Download. The KI-04 corpus was built following a palette of eight genres suggested by a user study on genre usefulness (Meyer zu Eissen and Stein, 2004). It includes 1,295 English web pages (HTML documents), but only 800 web pages (100 per genre) were used in the experiment described in Meyer zu Eissen and Stein (2004). The KI-04 corpus was collected using bookmarks from about five people. Some genres were extended to get a better balance. The corpus was sorted by three people, one of whom wrote a bachelor thesis (in German) on the corpus building process. One of the creators (S. Meyer zu Eissen) checked many of the pages, and most of the sorting complied with his understanding of the genre categories. The download date was January 26th, 2004.
  1. ARTICLE (127 web pages)           5. LINK COLLECTION (205 web pages) 
  2. DISCUSSION (127 web pages)        6. NON-PERSONAL HOME PAGE (it was PORTRAYAL (NON-PRIV) (163 web pages)
  3. DOWNLOAD (151 web pages)          7. PERSONAL HOME PAGE (it was PORTRAYAL (PRIV.) (126 web pages)
  4. HELP (139 web pages)              8. SHOP (167 web pages)

Hierachical Webgenre Collection

  i.  Journalism                             iv. Documentation
      1.  Commentary                             21. Law
      2.  Review                                 22. Official Report     
      3.  Portrait                               23. Protocol
      4.  Marginal Note                                            
      5.  Interview                           v. Dictionary
      6.  News                                   24. Person
      7.  Feature Story                          25. Catalog
      8.  Reportage                              26. Resources
                                                 27. Timeline
  ii. Literature         
      9.  Poem                                vi. Communcation
      10. Prose                                   28. Mail, Talk
      11. Drama                                   29. Forum, Guestbook
                                                  30. Blog
  iii.Information                                 31. Form
      12. Science Report 
      13. Explanation                         vii. Nothing
      14. Receipt                                  32. Nothing
      15. FAQ
      16. Lexicon, Word List
      17. Bilingual Dictionary
      18. Presentation
      19. Statistics
      20. Code

Multi-Labelled Genre Collection

  1.  Personal                                 11. Index
  2.  Informative                              12. Gateway
  3.  Journalistic                             13. Community
  4.  Commercial/promotional                   14. Content Delivery  
  5.  Shopping                                 15. User input
  6.  Official                                 16. Entertainment
  7.  Scientific                               17. Adult
  8.  Prose fiction                            18. Children's
  9.  Poetry                                   19. Blog
  10. FAQs                                     20. Error message

KRYS I Corpus

CMU World Wide Knowledge Base

  • CMU World Wide Knowledge Base (Web->KB) project (1997). 8,282 pages were manually classified into the following TOPICAL categories:
   student (1641)                       course (930) 
   faculty (1124)                       project (504)
   staff (137)                          other (3764)  
   department (182)

TREC Tracks

Multilingual Genre Collections

Italian

English and Russian

  • I-EN-Sample and I-RU-Sample built by Serge Sharoff Download. Each corpus consists of manually validated samples of 250 webpages for English and Russian, as well as predicted classes produced by SVM-based classifiers, 65,177 pages for English (from the I-EN corpus), 29,650 for Russian (from the I-RU corpus). Another set of automatically classified pages in English is for 1,202,039 pages of ukWac.

The classification includes the following macrogenres:

  1. discussion - all texts expressing positions and discussing a state of affairs
     (journalism and academic articles, blogs, forums)
  2. information - catalogues, lists (mostly containing incomplete sentences), as well 
     as home pages and reference materials
  3. instruction - how-tos, FAQs, tutorials
  4. propaganda - adverts, shopping
  5. recreation - fiction and popular lore
  6. regulations - laws, small print, rules
  7. reporting - newswires and informative broadcasts, police reports
  8. unknown - pages designed not for reading, but for interaction, e.g., portals, 
     index pages, applications, videos

Each class corresponds to generalised aims of text production. When assigning a text to a class, think "What is the main purpose for creating this text?"

Personal tools