Jump to content

Normalized Google distance: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
GreenC bot (talk | contribs)
 
(22 intermediate revisions by 12 users not shown)
Line 1: Line 1:
The '''Normalized Google Distance''' is a [[semantic similarity]] [[similarity measure|measure]] derived from the number of hits returned by the [[Google search|Google search engine]] for a given [[Set (computer science)|set]] of [[keyword (internet search)|keyword]]s.<ref name="CV07">[http://arxiv.org/abs/cs.CL/0412098 The Google similarity distance on ArXiv.org] or [http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4072748&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4072748 R.L. Cilibrasi and P.M.B. Vitanyi, The Google similarity distance , IEEE Trans. Knowledge and Data Engineering, 19:3(2007), 370–383 or http://arxiv.org/abs/cs.CL/0412098]</ref> Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Normalized Google Distance, while words with dissimilar meanings tend to be farther apart.
The '''normalized Google distance''' ('''NGD''') is a [[semantic similarity]] [[similarity measure|measure]] derived from the number of hits returned by the [[Google Search|Google search engine]] for a given [[Set (computer science)|set]] of [[keyword (internet search)|keyword]]s.<ref name="CV07">{{cite journal|author= R.L. Cilibrasi |author2=P.M.B. Vitanyi|title=The Google similarity distance|journal=IEEE Trans. Knowledge and Data Engineering|volume=19|issue=3|date=2007|pages=370–383|arxiv=cs/0412098 |doi=10.1109/TKDE.2007.48|s2cid=59777 }}</ref> Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of normalized Google distance, while words with dissimilar meanings tend to be farther apart.


Specifically, the Normalized Google Distance (NGD) between two search terms ''x'' and ''y'' is
Specifically, the NGD between two search terms ''x'' and ''y'' is


:<math>
:<math>
Line 17: Line 17:
for "Shakespeare Macbeth" gave 20,800,000 hits.
for "Shakespeare Macbeth" gave 20,800,000 hits.
The number of pages indexed by Google was estimated by the number
The number of pages indexed by Google was estimated by the number
of hits of the search term "the," which was 25,270,000,000 hits. Assuming
of hits of the search term "the" which was 25,270,000,000 hits. Assuming
there are about 1,000 search terms on the average page this gives <math>N=25,270,000,000,000</math>.
there are about 1,000 search terms on the average page this gives <math>N=25,270,000,000,000</math>.
Hence
Hence
Line 25: Line 25:


==Introduction==
==Introduction==
The normalized Google distance is derived from the earlier [[normalized compression distance]].<ref name="CV04">{{cite journal|arxiv=cs.CV/0312044|url=http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/trans/tk/&toc=comp/trans/tk/2007/03/k3toc.xml&DOI=10.1109/TKDE.2007.48 |author= R.L. Cilibrasi |author2=P.M.B. Vitanyi|title= Clustering by Compression|journal=IEEE Trans. Inf. Theory|volume= 51|page=12|date=2005| doi=10.1109/TKDE.2007.48 | s2cid=59777 }}</ref><ref name="Li04">{{cite journal |author1=M. Li |author2=X. Chen |author3=X. Li |author4=B. Ma |author5=P.M.B. Vitanyi |title=The similarity metric|journal=IEEE Trans. Inf. Theory|volume=50|issue=12|date=December 2004|pages=3250–3264 |doi=10.1109/TIT.2004.838101 |publisher=[[IEEE]] |s2cid=221927 }}</ref>
The Normalized Google Distance is derived from the earlier [[Normalized Compression Distance]].
.<ref name="CV04">[http://arxiv.org/abs/cs.CV/0312044 Clustering by Compression on ArXiv.org] or [http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/trans/tk/&toc=comp/trans/tk/2007/03/k3toc.xml&DOI=10.1109/TKDE.2007.48 R.L. Cilibrasi and P.M.B. Vitanyi, Clustering by Compression, IEEE Trans. Information Theory, 51:12(2005).]</ref><ref name="Li04">{{cite web|url=http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1362909&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1362909 |title=M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi, The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250- 3264 |doi=10.1109/TIT.2004.838101 |publisher=Ieeexplore.ieee.org |date=2011-09-27 |accessdate=2012-11-03}}</ref>
Namely, objects can be given literally, like the literal four-letter genome of a mouse,
Namely, objects can be given literally, like the literal four-letter genome of a mouse,
or the literal text of [[Macbeth]] by [[Shakespeare]]. The similarity of these objects is given by the NCD. For
or the literal text of ''[[Macbeth]]'' by [[Shakespeare]]. The similarity of these objects is given by the NCD. For
simplicity we take it that all meaning of the object
simplicity we take it that all meaning of the object
is represented by the literal object itself. Objects can also be
is represented by the literal object itself. Objects can also be
given by name, like 'the four-letter genome of a mouse,'
given by name, like 'the four-letter genome of a mouse,'
or 'the text of [[Macbeth]] by [[Shakespeare]].' There are
or 'the text of ''[[Macbeth]]'' by [[Shakespeare]].' There are
also objects that cannot be given literally, but only by name,
also objects that cannot be given literally, but only by name,
and that acquire their meaning from their contexts in background common
and that acquire their meaning from their contexts in background common
knowledge in humankind, like 'home" or "red." The similarity between names for objects is
knowledge in humankind, like "home" or "red". The similarity between names for objects is
given by the NGD.
given by the NGD.


==Google Distribution and Google Code==
==Google distribution and Google code==
The probabilities of Google search terms, conceived as
The probabilities of Google search terms, conceived as
the frequencies of page counts returned by Google divided by
the frequencies of page counts returned by Google divided by
the number of pages indexed by Google (multplied by the average number
the number of pages indexed by Google (multiplied by the average number of search terms in those pages), approximate the actual relative frequencies of those search terms as actually used in society. Based on this premise, the relations represented by the normalized Google distance approximately capture
the assumed true semantic relations governing the search terms. In the NGD, the World Wide Web and Google are used. Other text corpora include [[Wikipedia]], the King James version of the [[Bible]] or the ''[[Oxford English Dictionary]]'' together with appropriate search engines.
of search terms in those pages),
approximate the actual relative frequencies of those search terms
as actually used in society. Based on this premise,
the relations represented by the
normalized Google distance
approximately capture
the assumed true semantic
relations governing the search terms. In the NGD the World Wide Web
and Google is used. Other text corpora
can be [[Wikipedia]], the King James version of the
[[Bible]] or the [[Oxford English Dictionary]] together with appropriate search engines.


==Properties==
==Properties==
The following properties are proved in:<ref name="CV07"/>
The following properties are proved in:<ref name="CV07"/>
*The NGD is roughly in between 0 and <math>\infty</math>. It can be slightly negative. For example, "red red" gives about 20% more hits of Google on the [[World Wide Web]] than "red." (Mid 2013 there were 4.260.000.000 hits for "red" and 5.500.000.000 hits for "red red". Presently, "red red" now returns far fewer results than "red".) If the <math>NGD(x,y) \geq 1</math> then we view x and y as very dissimilar.
*The NGD is roughly in between 0 and <math>\infty</math>. It can be slightly negative. For example, "red red" gives about 20% more hits of Google on the [[World Wide Web]] than "red". (Mid 2013 there were 4.260.000.000 hits for "red" and 5.500.000.000 hits for "red red". Presently, "red red" now returns far fewer results than "red".) If the <math>NGD(x,y) \geq 1</math> then we view x and y as very dissimilar.
*The NGD is not a [[metric (mathematics)|metric]]. In the beginning we have seen that the NGD is zero for x and y that are not equal provided x and y do always occur together on the same web page. From the NGD formula we see that it is [[symmetric]]. The [[triangle]] property is not satisfied by the NGD. However, these results are theoretic. It is hard to come up with practical examples of the [[World Wide Web]] using Google that violate the [[triangle]] property.
*The NGD is not a [[metric (mathematics)|metric]]. The NGD is zero for x and y that are not equal provided x and y do always occur together on the same web page. From the NGD formula we see that it is [[symmetric]]. The triangle property is not satisfied by the NGD. However, these results are theoretic. It is hard to come up with practical examples of the [[World Wide Web]] using Google that violate the triangle property.


==Applications==
==Applications==
Applications to colors versus numbers, [[primes]] versus non-primes and so are given in,<ref name="CV07"/> as well as a randomized massive experiment using [[WordNet]] categories. In the primes versus non-primes case and the WordNet experiment the NGD method is augmented with a [[support vector machine]] classifier. The experiments consist of 25 positive examples and 25 negative ones. The WordNet experiment consisted of 100 random WordNet categories. The NGD method had a success rate of 87.25%. The mean is 0.8725 while the standard deviation was 0.1169. These rates are about agreement with the WordNet categories which represent the knowledge of researchers with PhDs which entered them. It is rare to see agreement less than 75%.
Applications to colors versus numbers, [[primes]] versus non-primes and so are given in,<ref name="CV07"/>
as well as a randomized massive experiment using [[WordNet]] categories. In the primes versus non-primes case
and the [[WordNet]] experiment the NGD method is augmented with a [[Support Vector Machine]] classifier.
The experiments consist of 25 positive examples and 25 negative ones. The [[WordNet]] experiment consisted of 100 random [[WordNet]] categories. The NGD method had a success rate of 87.25%. That is the mean is
0.8725 while the standard deviation was 0.1169. These rates are about agreement with the [[WordNet]] categories which represent the knowledge of researchers with PhD's which entered them. It is rare to see agreement less than 75%.


==References==
==References==
{{reflist}}
{{reflist}}


==Related Literature==
==Further reading==
*{{cite journal|author1= R. Allen |author2=Y. Wu|name-list-style=amp|url=https://doi.org/10.1002/asi.20202 |title=Metrics for the Scope of a Collection|journal=JASIST|date=2005|volume=55|issue=10|pages=1243–1249|doi=10.1002/asi.20202}}
* M. Li and P.M.B. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, 2008, Third Edition
*{{cite book|author= M. Li |author2=P.M.B. Vitanyi|name-list-style=amp|title=An Introduction to Kolmogorov Complexity and Its Applications|publisher=Springer|date=2019|edition=Fourth}}
*[https://www.newscientist.com/article.ns?id=dn6924 Google's search for meaning] at Newscientist.com.
*{{cite web|url=https://www.newscientist.com/article.ns?id=dn6924|title= Google's search for meaning|date=January 28, 2005 |author=Duncan Graham-Rowe|website=New Scientist|archive-url=https://web.archive.org/web/20050205090757/https://www.newscientist.com/article.ns?id=dn6924|archive-date=February 5, 2005|url-status=dead}}
*J. Poland and Th. Zeugmann (2006), [http://www-alg.ist.hokudai.ac.jp/~thomas/publications/dag_c2c_pz.pdf Clustering the Google Distance with Eigenvectors and Semidefinite Programming]
*{{cite conference|author1=J. Poland |author2= Th. Zeugmann|date=2006|name-list-style=amp|url=http://www-alg.ist.hokudai.ac.jp/~thomas/publications/dag_c2c_pz.pdf|title=Clustering the Google Distance with Eigenvectors and Semidefinite Programming|conference=Knowledge Media Technologies, First International Core-to-Core Workshop|location=Dagstuhl, Germany|number=21|pages=61–69}}
*A. Gupta and T. Oates (2007), [http://dli.iiit.ac.in/ijcai/IJCAI-2007/PDF/IJCAI07-261.pdf Using Ontologies and the Web to Learn Lexical Semantics] (Includes comparison of NGD to other algorithms.)
*{{cite conference|author1=A. Gupta |author2= T. Oates|name-list-style=amp|date=2007|url=http://dli.iiit.ac.in/ijcai/IJCAI-2007/PDF/IJCAI07-261.pdf|title= Using Ontologies and the Web to Learn Lexical Semantics|conference=IJCAI'07: Proceedings of the 20th international joint conference on Artificial intelligence|pages=1618–1623|archive-url=https://web.archive.org/web/20090219063936/http://dli.iiit.ac.in/ijcai/IJCAI-2007/PDF/IJCAI07-261.pdf|archive-date=19 February 2009}} (Includes comparison of NGD to other algorithms.)
* Wong, W., Liu, W. & Bennamoun, M. (2007) Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities. In: Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages 349–381. {{doi|10.1007/s10618-007-0073-y}} (the use of NGD for term clustering)
* {{cite journal|author1=Wong, W.|author2=Liu, W.|author3= Bennamoun, M.|date=2007|name-list-style=amp|title=Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities|journal=Data Mining and Knowledge Discovery|volume=15|issue=3|pages=349–381|doi=10.1007/s10618-007-0073-y|s2cid=14924678 }} (the use of NGD for term clustering)


[[Category:Computational linguistics]]
[[Category:Computational linguistics]]

Latest revision as of 04:32, 31 July 2024

The normalized Google distance (NGD) is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords.[1] Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of normalized Google distance, while words with dissimilar meanings tend to be farther apart.

Specifically, the NGD between two search terms x and y is

where N is the total number of web pages searched by Google multiplied by the average number of singleton search terms occurring on pages; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(xy) is the number of web pages on which both x and y occur.

If the then x and y are viewed as alike as possible, but if then x and y are very different. If the two search terms x and y never occur together on the same web page, but do occur separately, the NGD between them is infinite. If both terms always occur together, their NGD is zero.

Example: On 9 April 2013, googling for "Shakespeare" gave 130,000,000 hits; googling for "Macbeth" gave 26,000,000 hits; and googling for "Shakespeare Macbeth" gave 20,800,000 hits. The number of pages indexed by Google was estimated by the number of hits of the search term "the" which was 25,270,000,000 hits. Assuming there are about 1,000 search terms on the average page this gives . Hence

.

"Shakespeare" and "Macbeth" are very much alike according to the relative semantics supplied by Google.

Introduction

[edit]

The normalized Google distance is derived from the earlier normalized compression distance.[2][3] Namely, objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of Macbeth by Shakespeare. The similarity of these objects is given by the NCD. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like 'the four-letter genome of a mouse,' or 'the text of Macbeth by Shakespeare.' There are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like "home" or "red". The similarity between names for objects is given by the NGD.

Google distribution and Google code

[edit]

The probabilities of Google search terms, conceived as the frequencies of page counts returned by Google divided by the number of pages indexed by Google (multiplied by the average number of search terms in those pages), approximate the actual relative frequencies of those search terms as actually used in society. Based on this premise, the relations represented by the normalized Google distance approximately capture the assumed true semantic relations governing the search terms. In the NGD, the World Wide Web and Google are used. Other text corpora include Wikipedia, the King James version of the Bible or the Oxford English Dictionary together with appropriate search engines.

Properties

[edit]

The following properties are proved in:[1]

  • The NGD is roughly in between 0 and . It can be slightly negative. For example, "red red" gives about 20% more hits of Google on the World Wide Web than "red". (Mid 2013 there were 4.260.000.000 hits for "red" and 5.500.000.000 hits for "red red". Presently, "red red" now returns far fewer results than "red".) If the then we view x and y as very dissimilar.
  • The NGD is not a metric. The NGD is zero for x and y that are not equal provided x and y do always occur together on the same web page. From the NGD formula we see that it is symmetric. The triangle property is not satisfied by the NGD. However, these results are theoretic. It is hard to come up with practical examples of the World Wide Web using Google that violate the triangle property.

Applications

[edit]

Applications to colors versus numbers, primes versus non-primes and so are given in,[1] as well as a randomized massive experiment using WordNet categories. In the primes versus non-primes case and the WordNet experiment the NGD method is augmented with a support vector machine classifier. The experiments consist of 25 positive examples and 25 negative ones. The WordNet experiment consisted of 100 random WordNet categories. The NGD method had a success rate of 87.25%. The mean is 0.8725 while the standard deviation was 0.1169. These rates are about agreement with the WordNet categories which represent the knowledge of researchers with PhDs which entered them. It is rare to see agreement less than 75%.

References

[edit]
  1. ^ a b c R.L. Cilibrasi; P.M.B. Vitanyi (2007). "The Google similarity distance". IEEE Trans. Knowledge and Data Engineering. 19 (3): 370–383. arXiv:cs/0412098. doi:10.1109/TKDE.2007.48. S2CID 59777.
  2. ^ R.L. Cilibrasi; P.M.B. Vitanyi (2005). "Clustering by Compression". IEEE Trans. Inf. Theory. 51: 12. arXiv:cs.CV/0312044. doi:10.1109/TKDE.2007.48. S2CID 59777.
  3. ^ M. Li; X. Chen; X. Li; B. Ma; P.M.B. Vitanyi (December 2004). "The similarity metric". IEEE Trans. Inf. Theory. 50 (12). IEEE: 3250–3264. doi:10.1109/TIT.2004.838101. S2CID 221927.

Further reading

[edit]