Talk:Automatic taxonomy construction
Topic notes
Following are reminders for topics to look into for potential coverage within the article.
- Approaches
- Pattern- or rule-based approach
- Clustering-based or distributional approach
- OntoLearn
- OntoLearn Reloaded
- NLP methods used in ATC
- ATC algorythms
- ATC pdfs
- ATC projects
Removing second Further Reading Item
I'm doing some editing on this article. The first thing I'm doing is clean up any obvious problems. The second Further Reading item doesn't go to a paper it goes to a web site for a conference on federated systems and data mining. I'm going to delete it and any other items that are no longer relevant. If they are dead links I'll try wayback machine but the URL for the reading I'm deleting is just for the site of the conference. I'm not going to document every additional item I delete just wanted to post something here in case people want to discuss deletions or anything about the article. One thing that jumps out at me is the discussion of parents and children seems too colloquial to me and really misses the main point which is logical subsumption, one class/set being a superset of another. I think I'll change that but will try to still say it in a way understandable to non techies. If anyone has ideas or pointers to good articles please let me know. --MadScientistX11 (talk) 19:23, 7 March 2017 (UTC)
- One minor correction: there was an actual URL for that item but when I tried the way back machine every archived version (there were only 3) went to a page not found and redirected. --MadScientistX11 (talk) 19:27, 7 March 2017 (UTC)
- MadScientistX11 I agree that the subset/superset discussion could be made clearer. Two suggestions there - one is to make the distinction between subsets and instances (tokens vs types) more explicit. Labrador is a subclass of dog, but Fido is an instance. In some schemes (most?), that means using a different relation. In others, I imagine, it could all be handled with one "is-a" relation, with the distinction made implicitly by metadata about the items involved (both labrador and Fido "are" dogs, but labrador additionally is labelled as a type whereas Fido is labelled as a token). For example, I'm not certain if Wikipedia really has a notion of subcategories as distinct from categories that happen to be 'in' other categories in the same way as individual articles are 'in' them. The description of linguistic hyponymy seems to go that way too; it doesn't say explicitly that Fido is a hyponym of mammal, as dog is, but that was the impression I came away with after reading it. I'm happy to have a go at this part if it would be helpful.
- The other suggestion is that the article mentions "taxonomies are often represented as is-a hierarchies" (later called an "is-a model"). Should it mention alternatives? What would the main alternatives be?
- One other, separate suggestion - the paragraph that begins by explaining that taxonymy development is knowledge-intensive and potentially biased - if it's possible to find a reference for it, it would be good to include something to note the degree to which human judgement is still required to tidy up the output of automated methods or to bootstrap one, e.g. by putting in place the top-level abstract types, or by linking the subtrees that an automated process generated. That may be hard to find a reference for, though.
- Sorry, one more - I'm interested now! It might be good to break this into sections a little. Perhaps the last two paragraphs could be split out into one (or two?) sections about 'applications' (or 'comparison to manual techniques' ('advantages') and 'example applications'). Just a thought. Mortee (talk) 20:01, 10 March 2017 (UTC)
ATC don't have to be (and typically aren't) agents
The current article states: "ATC programs are examples of software agents and intelligent agents, and may be autonomous as well (see autonomous agent)." Nothing that I saw in any of the existing references, nor in any of the references I've found since, talks about using agents for ATC. Of course any kind of complex problem can usually be amenable to an agent approach but from what I've read so far its not common. The ATC systems seem to be pretty straight forward batch algorithms. You feed them a bunch of documents and they generate a taxonomy. I think the systems that provide the corpus may sometimes be web crawlers which are agents but typically not the ATC itself. I'm going to change this but wanted to document before I do in case anyone disagrees and wants to discuss it. --MadScientistX11 (talk) 04:17, 8 March 2017 (UTC)
Finished re-write of article
I just rewrote the article. Its still a stub article but I think its now at least a fairly coherent stub with inline references. There is now significant overlap between the "Further Reading" section and the references. At a minimum I think we should delete anything in further reading that is used as a reference. The reference format gives the user more information than the link format used in further reading and we risk pissing off users by having them click on links that are duplicates of references. Actually IMO we should just completely delete the Further Reading section. I tend to be very conservative with Further Reading refs. If its useful enough to be in further reading it should be useful enough to be used as a reference. Also, in my experience people tend to inflate those sections with their own papers or friend's papers. I think they should be reserved for the (rare) case where there is a well known work on the topic which for some reason isn't used as a reference. But I'll leave it as is for now and see what others think. --MadScientistX11 (talk) 21:12, 8 March 2017 (UTC)
- I cleaned up the Further Reading section and deleted any items that were already used as references, or in one case there was the same paper listed twice. Left the remaining ones. --MadScientistX11 (talk) 19:22, 9 March 2017 (UTC)
Version before rewrite (for comparison)
Automatic taxonomy construction (ATC) is the use of autonomous or semi-autonomous software programs to create hierarchical outlines or taxonomical classifications from a body of texts (corpus). It is a branch of natural language processing, which in turn is a branch of artificial intelligence. ATC programs are examples of software agents and intelligent agents, and may be autonomous as well (see autonomous agent).
Other names for ATC include taxonomy generation, taxonomy learning, taxonomy extraction, taxonomy building, and taxonomy induction. Any of these terms may be preceded by the word "automatic", as in automatic taxonomy induction. ATC is also referred to as semantic taxonomy induction.
A taxonomy is a tree structure and includes familial (parent-offspring, sibling, etc.) relationships built-in (like in a family tree). For example, physics is an offspring of physical science, which in turn is an offspring of science.
As mentioned above, the process is also called taxonomy induction. This is because, in order for a software program to construct a taxonomy from a corpus (for example, from Wikipedia, a web page, or the World Wide Web), it must induce which terms belong to the taxonomy and what the relationships between them are. Such as by identifying hyponym-hypernym pairs, among other approaches. This is done using algorithms, including statistical algorithms. Note that deduction (deductive logic) is often also employed (e.g., if B is a sibling of A, then B has the same parent as A and gets placed under that parent in the taxonomy).
A primary application of automatic taxonomy construction is in ontology learning, a central activity within ontology engineering. In computer science and artificial intelligence, an ontology is a conceptual model of a (subject) domain. A domain is a given subject area or specifically defined sphere of interest. An ontology of a domain includes the vocabulary of that domain and the relationships between those concepts or entities. The backbone of most ontologies is a taxonomy, and taxonomical structure may be used throughout an ontology.
As building taxonomies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.