Automatic classification

From WebGenreWiki

Jump to: navigation, search

When you work with a large collection of webpages, you have to classify them according their genre automatically using features that have to be also automatically extracted.

Features

While topics are usually detected using keywords, genres are much more difficult to define in lexical terms. Features successfully used for detecting genres include:

  • function words (or simply the most frequent words)
  • punctuation
  • POS trigrams
  • trigrams of function words only (all other words are represented by their POS codes or as NON-FUNCTION)
  • more complex syntactic features (e.g. the number of that clauses)
  • links to other pages with similar properties
  • page structure and html (e.g. length, headlines, lists, avg. line length)
  • specific wordlists (e.g. names, cities, keywords for programming languages)
  • specific POS (e.g. positive ADJ, female pronouns)
  • non-textual items: dates, ordinal numbers, numbers, images, emoticons
  • character n-grams
  • visual representation of a page (e.g., the number and relative position of columns)

Any overview of what works and what does not?

ML methods

The standard method involves supervised machine learning from a set of webpages with known genres. ML methods frequently used for this task are Naive Bayes (NB) and Support Vector Machines (SVM).

Multiple classification

What happens when a document can get several genre codes? The method should allow for multiple labels per document. The standard ML techniques currently don't do this.

Personal tools