Genre Collection Repository
From WebGenreWiki
See also: Genre Benchmark Under Construction
Mapping between some of the collections are available from a separate page:
Mapping between genres and macrogenres
Contents |
English Genre Collections
DISCLAIMER: Copyright is held by the author/owner(s) of the web documents included in the genre collections below.
***The material available from this page is for research purposes ONLY***
SANTINIS
- 7-webgenre collection built by Marina Santini. Download The 7-web genre collection has been built following the criteria of 'annotation by objective sources' and 'consistent genre granularity'. These criteria are explained in Santini 2006
1. Personal Blogs (200 web pages) 5. Listings (200 web pages) 2. Eshop (200 web pages 6. Personal Home Page (200 web pages) 3. FAQs (200 web pages) 7. Search Pages (200 web pages) 4. Online Newspaper Front Pages (200 web pages)
- The Web Corpus built by Marina Santini Download The Web Corpus has been created to approximate one of the possible compositions of the web in this way: the BBC corpus and the 7-web-genre collection represent the known part of the web, i.e. about 60% of the sample (1480 web pages); the SPIRIT collection amounts to about 40% of the sample (1,000 web pages)and represents the unknown part of the web. The composition and the rationale of the Web Corpus is briefly explained in Santini (2007). The Web Corpus includes:
- A small BBC corpus, namely four BBC web genres:
- EDITORIALS (20 web pages), SHORT BIOGRAPHIES (20 web pages), DIY MINI-GUIDES(20 web pages) and FEATURE ARTICLES (20 web pages)
- Seven novel web genres (i.e. the 7-web-genre collection described above).
- The SPIRIT collection (described in Joho and Sanderson, 2004), which contains random and unclassified web pages.
- A small BBC corpus, namely four BBC web genres:
KI-04
- The KI-04 corpus (a.k.a. Meyer-zu-Eissen-web-page collection) built by Sven Meyer zu Eissen. Download. The KI-04 corpus was built following a palette of eight genres suggested by a user study on genre usefulness (Meyer zu Eissen and Stein, 2004). It includes 1,295 English web pages (HTML documents), but only 800 web pages (100 per genre) were used in the experiment described in Meyer zu Eissen and Stein (2004). The KI-04 corpus was collected using bookmarks from about five people. Some genres were extended to get a better balance. The corpus was sorted by three people, one of whom wrote a bachelor thesis (in German) on the corpus building process. One of the creators (S. Meyer zu Eissen) checked many of the pages, and most of the sorting complied with his understanding of the genre categories. The download date was January 26th, 2004.
1. ARTICLE (127 web pages) 5. LINK COLLECTION (205 web pages) 2. DISCUSSION (127 web pages) 6. NON-PERSONAL HOME PAGE (it was PORTRAYAL (NON-PRIV) (163 web pages) 3. DOWNLOAD (151 web pages) 7. PERSONAL HOME PAGE (it was PORTRAYAL (PRIV.) (126 web pages) 4. HELP (139 web pages) 8. SHOP (167 web pages)
Hierachical Webgenre Collection
- Hierachical Webgenre Collection, built by Andrea Stubbe. Download It contains 32 genre classes, 40 files per class, English, HTML, utf-8 encoded, collected in 2005/2006. This collection is described in Stubbe and Ringlstetter (2007) and Stubbe, Ringlstetter, and Schulz (2007)
i. Journalism iv. Documentation
1. Commentary 21. Law
2. Review 22. Official Report
3. Portrait 23. Protocol
4. Marginal Note
5. Interview v. Dictionary
6. News 24. Person
7. Feature Story 25. Catalog
8. Reportage 26. Resources
27. Timeline
ii. Literature
9. Poem vi. Communcation
10. Prose 28. Mail, Talk
11. Drama 29. Forum, Guestbook
30. Blog
iii.Information 31. Form
12. Science Report
13. Explanation vii. Nothing
14. Receipt 32. Nothing
15. FAQ
16. Lexicon, Word List
17. Bilingual Dictionary
18. Presentation
19. Statistics
20. Code
Multi-Labelled Genre Collection
- 20-Multi-Labelled Genre Collection built by Mitja Luštrek and Andrej Bratko. Download Description of the genres [1] Language: English. This collection is described in Vidulin, Luštrek, and Gams (2007)
1. Personal 11. Index 2. Informative 12. Gateway 3. Journalistic 13. Community 4. Commercial/promotional 14. Content Delivery 5. Shopping 15. User input 6. Official 16. Entertainment 7. Scientific 17. Adult 8. Prose fiction 18. Children's 9. Poetry 19. Blog 10. FAQs 20. Error message
KRYS I Corpus
- KRYS I Corpus KRYS I announcement and website The description of the corpus is here.
CMU World Wide Knowledge Base
- CMU World Wide Knowledge Base (Web->KB) project (1997). 8,282 pages were manually classified into the following TOPICAL categories:
student (1641) course (930) faculty (1124) project (504) staff (137) other (3764) department (182)
TREC Tracks
- TREC Tracks Web Search collection; [2];
Multilingual Genre Collections
Italian
- Sample of 400 Italian blog posts built by Mirko Tavosanis Download. This collection is described in Tavosanis (2007)
English and Russian
- I-EN-Sample and I-RU-Sample built by Serge Sharoff Download. Each corpus consists of manually validated samples of 250 webpages for English and Russian, as well as predicted classes produced by SVM-based classifiers, 65,177 pages for English (from the I-EN corpus), 29,650 for Russian (from the I-RU corpus). Another set of automatically classified pages in English is for 1,202,039 pages of ukWac.
The classification includes the following macrogenres:
1. discussion - all texts expressing positions and discussing a state of affairs
(journalism and academic articles, blogs, forums)
2. information - catalogues, lists (mostly containing incomplete sentences), as well
as home pages and reference materials
3. instruction - how-tos, FAQs, tutorials
4. propaganda - adverts, shopping
5. recreation - fiction and popular lore
6. regulations - laws, small print, rules
7. reporting - newswires and informative broadcasts, police reports
8. unknown - pages designed not for reading, but for interaction, e.g., portals,
index pages, applications, videos
Each class corresponds to generalised aims of text production. When assigning a text to a class, think "What is the main purpose for creating this text?"
