Information extraction: Difference between revisions
Appearance
Content deleted Content added
m Fixing a grammatical mistake |
No edit summary |
||
Line 1: | Line 1: | ||
'''Information extraction''' ('''IE''') is a type of [[information retrieval]] whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. |
'''Information extraction''' ('''IE''') is a type of [[information retrieval]] whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. |
||
A typical application of IE is to scan a set of documents written in a [[natural language]] and populate a database with the information extracted. Current approaches to IE use [[natural language processing]] techniques that focus on very restricted domains. For example, the ''Message Understanding Conference (MUC)'' is a competition-based conference that focused on the following domains in the past: |
A typical application of IE is to scan a set of documents written in a [[natural language]] and populate a database with the information extracted. Current approaches to IE use [[natural language processing]] techniques that focus on very restricted domains. For example, the ''[[Message Understanding Conference]] (MUC)'' is a competition-based conference that focused on the following domains in the past: |
||
*MUC-1 (1987), MUC-2 (1989): Naval operations messages. |
*MUC-1 (1987), MUC-2 (1989): Naval operations messages. |
||
*MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. |
*MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. |
Revision as of 10:22, 24 January 2006
Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents.
A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted. Current approaches to IE use natural language processing techniques that focus on very restricted domains. For example, the Message Understanding Conference (MUC) is a competition-based conference that focused on the following domains in the past:
- MUC-1 (1987), MUC-2 (1989): Naval operations messages.
- MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
- MUC-5 (1993): Joint ventures and microelectronics domain.
- MUC-6 (1995): News articles on management changes.
- MUC-7 (1998): Satellite launch reports.
Typical subtasks of IE are:
- Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
- Coreference: identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference.