Sheerer's Inequality

From formulasearchengine
Revision as of 05:41, 19 October 2011 by en>Fastily (Reverted edits by FSII (talk) to last version by Rich Farmbrough)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Google distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.

Specifically, the normalized Google distance between two search terms x and y is

NGD(x,y)=max{logf(x),logf(y)}logf(x,y)logMmin{logf(x),logf(y)}

where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(xy) is the number of web pages on which both x and y occur.

If the two search terms x and y never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero, or equivalent to the coefficient between x squared and y squared.

The normalized Google distance is derived from the earlier normalized compression distance (Cilibrasi & Vitanyi 2003). Allen and Yu (2002) proposed a related measure.

References


Template:Linguistics-stub