purl.org/peter.turney

Keyphrase Extraction - Applications

Definition of Keyphrase Extraction
Many journals ask their authors to provide a list of key words for their articles. We call these keyphrases, rather than key words, because they are often phrases of two or more words, rather than single words. We define a keyphrase list as a short list of phrases (typically five to fifteen phrases) that capture the main topics discussed in a given document. We define automatic keyphrase extraction as the automatic selection of important, topical phrases from within the body of a document. Automatic keyphrase extraction is a special case of the more general task of automatic keyphrase generation, in which the generated phrases do not necessarily appear in the body of the given document.
Keyphrases for Metadata
Many researchers believe that metadata is essential to address the problems of document management. Metadata is meta-information about a document or set of documents. There are several standards for document metadata, including the Dublin Core Metadata Element Set (championed by the US Online Computer Library Center), the MARC (Machine-Readable Cataloging) format (maintained by the US Library of Congress), the GILS (Government Information Locator Service) standard (from the US Office of Social and Economic Data Analysis), and the CSDGM (Content Standards for Digital Geospatial Metadata) standard (from the US Federal Geographic Data Committee). All of these standards include a field for keyphrases (although they have different names for this field).
Keyphrases for Highlighting
When we skim a document, we scan for keyphrases, to quickly determine the topic of the document. Highlighting is the practice of emphasizing keyphrases and key passages (e.g., sentences or paragraphs) by underlining the key text, using a special font, or marking the key text with a special colour. The purpose of highlighting is to facilitate skimming. Automatic keyphrase extraction can be used for highlighting and also to enable text-to-speech software to provide audio skimming capability.
Keyphrases for Indexing
An alphabetical list of keyphrases, taken from a collection of documents or from parts of a single long document (chapters in a book), can serve as an index.
Keyphrases for Interactive Query Refinement
Using a search engine is often an iterative process. The user enters a query, examines the resulting hit list, modifies the query, then tries again. Most search engines do not have any special features that support the iterative aspect of searching. One approach to interactive query refinement is to take the user's query, fetch the first round of documents, extract keyphrases from them, and then display the first round of documents to the user, along with suggested refinements to the first query, based on combinations of the first query with the extracted keyphrases.
Keyphrases for Web Log Analysis
Web site managers often want to know what visitors to their site are seeking. Most web servers have log files that record information about visitors, including the Internet address of the client machine, the file that was requested by the client, and the date and time of the request. There are several commercial products that analyze these logs for web site managers. Typically these tools will give a summary of general traffic patterns and produce an ordered list of the most popular files on the web site. A web log analysis program can use keyphrases to provide a deeper view of traffic. Instead of producing an ordered list of the most popular files on the web site, a log analysis tool can produce a list of the most popular keyphrases on the site. This can give web site managers insight into which topics on their web site are most popular.

Updated: February 3, 2007.