Normalized Google distance: Difference between revisions
No edit summary |
|||
Line 15: | Line 15: | ||
==References== |
==References== |
||
*R.L. Cilibrasi and P.M.B. Vitanyi ( |
*R.L. Cilibrasi and P.M.B. Vitanyi (2004/2007). [http://arxiv.org/abs/cs.CV/0312044 ArXiv.org] [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/trans/tk/&toc=comp/trans/tk/2007/03/k3toc.xml&DOI=10.1109/TKDE.2007.48 The Google similarity distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370–383.]. |
||
*R.L. Cilibrasi and P.M.B. Vitanyi ( |
*R.L. Cilibrasi and P.M.B. Vitanyi (2003/2005). [http://arxiv.org/abs/cs.CL/0412098 ArXiv.org] [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1412045&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1412045 Clustering by Compression, IEEE Trans. Information Theory, 51:4(2005), 1523 - 1545. ]. |
||
*[http://www.newscientist.com/article.ns?id=dn6924 Google's search for meaning] at Newscientist.com. |
*[http://www.newscientist.com/article.ns?id=dn6924 Google's search for meaning] at Newscientist.com. |
||
* R. Allen and Y. Wu, (2002) Generality of Texts [http://www.springerlink.com/content/h3ex0p8ky75e8d54/?MUD=MP] or [http://boballen.info/PAPERS/genconc.pdf], ICADL, Singapore, December, 2002, 111-116. R. Allen and Y. Wu (2005, submitted 2002). [http://onlinelibrary.wiley.com/doi/10.1002/asi.20202/abstract, Metrics for the Scope of a Collection] or [http://boballen.info/PAPERS/scope.pdf] JASIST, 55,(10), 2005, 1243-1249. |
* R. Allen and Y. Wu, (2002) Generality of Texts [http://www.springerlink.com/content/h3ex0p8ky75e8d54/?MUD=MP] or [http://boballen.info/PAPERS/genconc.pdf], ICADL, Singapore, December, 2002, 111-116. R. Allen and Y. Wu (2005, submitted 2002). [http://onlinelibrary.wiley.com/doi/10.1002/asi.20202/abstract, Metrics for the Scope of a Collection] or [http://boballen.info/PAPERS/scope.pdf] JASIST, 55,(10), 2005, 1243-1249. |
Revision as of 18:10, 27 November 2014
Google distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.
Specifically, the normalized Google distance between two search terms x and y is
where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(x, y) is the number of web pages on which both x and y occur.
If the two search terms x and y never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero, or equivalent to the coefficient between x squared and y squared.
The normalized Google distance is derived from the earlier normalized compression distance (Cilibrasi & Vitanyi 2003). Allen and Yu (2002) proposed a related measure.
References
- R.L. Cilibrasi and P.M.B. Vitanyi (2004/2007). ArXiv.org The Google similarity distance, IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370–383..
- R.L. Cilibrasi and P.M.B. Vitanyi (2003/2005). ArXiv.org Clustering by Compression, IEEE Trans. Information Theory, 51:4(2005), 1523 - 1545. .
- Google's search for meaning at Newscientist.com.
- R. Allen and Y. Wu, (2002) Generality of Texts [1] or [2], ICADL, Singapore, December, 2002, 111-116. R. Allen and Y. Wu (2005, submitted 2002). Metrics for the Scope of a Collection or [3] JASIST, 55,(10), 2005, 1243-1249.
- J. Poland and Th. Zeugmann (2006), Clustering the Google Distance with Eigenvectors and Semidefinite Programming
- A. Gupta and T. Oates (2007), Using Ontologies and the Web to Learn Lexical Semantics (Includes comparison of NGD to other algorithms.)
- Wong, W., Liu, W. & Bennamoun, M. (2007) Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities. In: Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages 349–381. doi:10.1007/s10618-007-0073-y (the use of NGD for term clustering)