Borel hierarchy: Difference between revisions

Revision as of 22:41, 28 July 2013

In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

The document, "a rose is a rose is a rose" can be tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (N-grams, here: 4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } = { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

r (A, B) = \frac{| S (A) \cap S (B) |}{| S (A) \cup S (B) |}

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

References

(Manber 1993) Finding Similar Files in a Large File System. Does not yet use the term "shingling". Available as PDF
(Broder, Glassman, Manasse, and Zweig 1997) Syntactic Clustering of the Web. SRC Technical Note #1997-015. Available at HTML here

External links

An implementation of the shingling algorithm in C++

@@ Line 1: / Line 1: @@
-They contact me Emilia. Body building is one of the things I adore most. For a while she's been in South Dakota. My working day occupation is a librarian.<br><br>Also visit my blog post - std home test ([http://dore.gia.ncnu.edu.tw/88ipart/node/1326254 try this web-site])
+{{lowercase|title=w-shingling}}
+In [[natural language processing]] a '''w-shingling''' is a set of unique "shingles"—contiguous [[subsequence]]s of [[token (parser)|tokens]] in a [[document]]—that can be used to gauge the [[document classification|similarity of two documents]].  The w denotes the number of tokens in each shingle in the set.
+The document, "a rose is a rose is a rose" can be [[tokenization|tokenized]] as follows:
+:(a,rose,is,a,rose,is,a,rose)
+The set of all contiguous sequences of 4 tokens ([[N-gram]]s, here: 4-grams) is
+:{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } = { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }
+== Resemblance ==
+For a given shingle size, the degree to which two documents ''A'' and ''B'' resemble each other can be expressed as the ratio of the magnitudes of their shinglings' [[intersection (set theory)|intersection]] and [[union (set theory)|union]], or
+:<math>r(A,B)={{|S(A)\cap S(B)|}\over {|S(A)\cup S(B)|}}</math>
+where |A| is the size of set A.  The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the [[Jaccard coefficient]] describing similarity and diversity of sample sets.
+== See also ==
+*[[Concept mining]] offers an alternative method for document similarity calculation with more computational complexity, but where the measure more closely models a human's perception of document similarity.
+*[[N-gram]]
+*[[k-mer]]
+*[[MinHash]]
+*[[Rolling hash]]
+*[[Rabin fingerprint]]
+*[[Vector space model]]
+*[[Bag-of-words model]]
+== References ==
+*(Manber 1993) ''Finding Similar Files in a Large File System''. Does not yet use the term "shingling". Available [http://webglimpse.net/pubs/TR93-33.pdf as PDF]
+*(Broder, Glassman, Manasse, and Zweig 1997) ''Syntactic [[Cluster analysis|Clustering]] of the Web''.  SRC Technical Note #1997-015.  Available at HTML [http://www.std.org/~msm/common/clustering.html here]
+==External links==
+*[http://codingplayground.blogspot.com/2008/06/shingling-and-text-clustering.html An implementation of the shingling algorithm in C++]
+[[Category:Natural language processing]]

Borel hierarchy: Difference between revisions

Revision as of 22:41, 28 July 2013

Contents

Resemblance

See also

References

External links

Navigation menu

Borel hierarchy: Difference between revisions

Revision as of 22:41, 28 July 2013

Resemblance

See also

References

External links

Navigation menu

Search