Jeans equations

From formulasearchengine
Revision as of 16:30, 20 April 2013 by 208.113.47.70 (talk)
Jump to navigation Jump to search

Template:Multiple issues Second-order co-occurrence pointwise mutual information is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus. PMI-IRTemplate:Clarify used AltaVista's Advanced Search query syntax to calculate probabilities. Note that the ``NEAR" search operator of AltaVista is an essential operator in the PMI-IR method.Template:Cn However, it is no longer in use in AltaVista; this means that, from the implementation point of view, it is not possible to use the PMI-IR method in the same form in new systems. In any case, from the algorithmic point of view, the advantage of using SOC-PMI is that it can calculate the similarity between two words that do not co-occur frequently, because they co-occur with the same neighboring words. For example, the British National Corpus (BNC) has been used as a source of frequencies and contexts. The method considers the words that are common in both lists and aggregate their PMI values (from the opposite list) to calculate the relative semantic similarity. We define the pointwise mutual information function for only those words having fb(ti,w)>0,

fpmi(ti,w)=log2fb(ti,w)×mft(ti)ft(w),

where ft(ti) tells us how many times the type ti appeared in the entire corpus, fb(ti,w) tells us how many times word ti appeared with word w in a context window and m is total number of tokens in the corpus. Now, for word w, we define a set of words, Xw, sorted in descending order by their PMI values with w and taken the top-most β words having fpmi(ti,w)>0.

The set Xw, contains words Xiw,

Xw={Xiw}, where i=1,2,,β and
fpmi(X1w,w)fpmi(X2w,w)fpmi(Xβ1w,w)fpmi(Xβw,w)

A rule of thumb is used to choose the value of β. The β-PMI summation function of a word is defined with respect to another word. For word w1 with respect to word w2 it is:

f(w1,w2,β)=i=1β(fpmi(Xiw1,w2))γ

where fpmi(Xiw1,w2)>0 which sums all the positive PMI values of words in the set Xw2 also common to the words in the set Xw1. In other words, this function actually aggregates the positive PMI values of all the semantically close words of w2 which are also common in w1's list. γ should have a value greater than 1. So, the β-PMI summation function for word w1 with respect to word w2 having β=β1 and the β-PMI summation function for word w2 with respect to word w1 having β=β2 are

f(w1,w2,β1)=i=1β1(fpmi(Xiw1,w2))γ

and

f(w2,w1,β2)=i=1β2(fpmi(Xiw2,w1))γ

respectively.

Finally, the semantic PMI similarity function between the two words, w1 and w2, is defined as

Sim(w1,w2)=f(w1,w2,β1)β1+f(w2,w1,β2)β2.

The semantic word similarity is normalized, so that it provides a similarity score between 0 and 1 inclusively. The normalization of semantic similarity algorithm returns a normalized score of similarity between two words. It takes as arguments the two words, ri and sj, and a maximum value, λ, that is returned by the semantic similarity function, Sim(). It returns a similarity score between 0 and 1 inclusively. For example, the algorithm returns 0.986 for words cemetery and graveyard with λ=20 (for SOC-PMI method).

References