Statistical manifold: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Mark viking
Added wl
en>AnomieBOT
m Dating maintenance tags: {{Clarify}}
 
Line 1: Line 1:
{{Probabilistic}}
I would recommend this hotel to others and would stay there again if I return to Venice.<br>Also available in black and silver glitter, I'd love to see this particular version paired with a simple black outfit. With 4 platform soles, the design of these shoes is so lovely, you may want to have a look even if glitter shoes aren't you're cup of tea, as they come in loads of different materials and colors.<br>http://www.mathematix.com/louboutin/?p=41 <br />  http://www.mathematix.com/louboutin/?p=7 <br />  http://www.mathematix.com/louboutin/?p=82 <br />  http://www.mathematix.com/louboutin/?p=67 <br />  http://www.mathematix.com/louboutin/?p=60 <br /><br><br>Should you loved this informative article and you would like to receive more info with regards to [http://www.restaurantcalcuta.com/outlet/ugg.asp Cheap Uggs For Sale] assure visit our own web page.
 
A '''quotient filter''', introduced by Bender, Farach-Colton, Johnson, Kuszmaul, Medjedovic, Montes, Shetty, Spillane, and Zadok in 2011,<ref name="Bender">{{cite conference|last1=Bender|first1=Michael A.|last2=Farach-Colton|first2=Martin|last3=Johnson|first3=Rob|last4=Kuszmaul|first4=Bradley C.|last5=Medjedovic|first5=Dzejla|last6=Montes|first6=Pablo|last7=Shetty|first7=Pradeep|last8=Spillane|first8=Richard P.|last9=Zadok|first9=Erez|displayauthors=9|title=Don't thrash: how to cache your hash on flash|booktitle=Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems (HotStorage'11)|date=June 2011|url=http://static.usenix.org/events/hotstorage11/tech/final_files/Bender.pdf|accessdate=21 July 2012}}</ref><ref>{{cite journal|last1=Bender|first1=Michael A.|last2=Farach-Colton|first2=Martin|last3=Johnson|first3=Rob|last4=Kraner|first4=Russell|last5=Kuszmaul|first5=Bradley C.|last6=Medjedovic|first6=Dzejla|last7=Montes|first7=Pablo|last8=Shetty|first8=Pradeep|last9=Spillane|first9=Richard P.|last10=Zadok|first10=Erez|title=Don't thrash: how to cache your hash on flash|journal=The Proceedings of the VLDB Endowment (PVLDB)|volume=5|number=11|year=2012|pages=1627–1637|url=http://vldb.org/pvldb/vol5/p1627_michaelabender_vldb2012.pdf}}</ref> is a kind of [[approximate membership query]] (AMQ).
An AMQ is a space-efficient [[probabilistic]] [[data structure]] used to test whether an [[element (mathematics)|element]] is a member of a [[set (computer science)|set]]. A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; ''i.e.'', the test does not generate [[Type I and type II errors|false negative]]s.  But with the latter result there is some probability, ε, of the test returning "not in the set" when in fact the element is present in the set (''i.e.'', a [[Type I and type II errors|false positive]]). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε.  Other AMQ operations include "insert" and "optionally delete". The more elements that are added to the set, the larger the probability of false positives.
 
[[File:Bloom filter speed.svg|thumb|360px| An approximate member query (AMQ) filter used to speed up answers in a key-value storage system. Key-value pairs are stored on a disk which has slow access times.  AMQ filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the AMQ filter than without it. Use of an AMQ filter for this purpose, however, does increase memory usage.]]
 
Each AMQ is associated with a more space-consuming set, such as a [[B-tree]], and its contents are reflective of the associated set. As elements&nbsp;– key/value pairs&nbsp;– are added to the set, their keys are also added to the AMQ.  However the AMQ stores only a few bits per key, whereas the set stores the entire key, which can be of arbitrary size; therefore, an AMQ can often be memory-resident while the associated set is stored in slower secondary storage. Thus this association can dramatically improve the performance of membership tests, because a test that results in "absent" can be resolved by the AMQ without necessitating any I/Os to access the set itself.
 
A quotient filter has the usual AMQ operations of insert and query.  In addition it can also be merged and re-sized without having to re-hash the original keys (thereby avoiding the need to access those keys from secondary storage).  This property benefits certain kinds of [[log-structured merge-tree]]s.
 
==Algorithm description==
The quotient filter is a ''compact'' [[hash table]].  Cleary<ref name=Cleary>{{cite journal|last=Cleary|first=J.G.|title=Compact hash tables using bidirectional linear probing|journal=IEEE T. Comput.|year=1984|volume=33|issue=9|pages=828–834|doi=10.1109/TC.1984.1676499}}</ref> defines a compact hash table as one in which the table entries contain only a portion of the key plus some additional meta-data bits.  These bits are used to deal with the case when distinct keys happen to hash to the same table entry. By way of contrast, other types of hash tables that deal with such collisions by linking to overflow areas are not compact because the overhead due to linkage can exceed the storage used to store the key.<ref name="Cleary"/> In a quotient filter a [[hash function]] generates a ''p''-bit fingerprint. The ''r'' least significant bits is called the remainder while the ''q'' = ''p'' - ''r'' most significant bits is called the quotient, hence the name ''quotienting'' (coined by [[Donald Knuth|Knuth]].<ref>{{cite book|last=Knuth|first=Donald|title=The Art of Computer Programming:Searching and Sorting, volume 3|year=1973|publisher=Addison Wesley|location=Section 6.4, exercise 13}}</ref>)
The hash table has 2<sup>q</sup> slots.
 
For some key ''d'' which hashes to the fingerprint ''d<sub>H</sub>'', let its quotient be ''d<sub>Q</sub>'' and the remainder be ''d<sub>R</sub>''.
QF will try to store the remainder in slot d<sub>Q</sub>, which is known as the ''canonical slot''.
However the canonical slot might already be occupied because multiple keys can hash to the same fingerprint—a ''hard collision''—or because even when the keys' fingerprints are distinct they can have the same quotient—a ''soft collision''. If the canonical slot is occupied then the remainder is stored in some slot to the right.
 
As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots. Such a set of fingerprints is defined as a ''run''.<ref name="Bender"/>  Note that a run's first fingerprint might not occupy its canonical slot if the run has been forced right by some run to the left.
 
However a run whose first fingerprint occupies its canonical slot indicates the start of a ''cluster''.<ref name="Bender"/> The initial run and all subsequent runs comprise the cluster, which terminates at an unoccupied slot or the start of another cluster.
 
The three additional bits are used to reconstruct a slot's fingerprint.  They have the following function:
* '''is_occupied''' is set when a slot is the canonical slot for some key stored (somewhere) in the filter (but not necessarily in this slot).
* '''is_continuation''' is set when a slot is occupied but not by the first remainder in a run.
* '''is_shifted''' is set when the remainder in a slot is not in its canonical slot.
 
The various combinations have the following meaning:
is_occupied
  is_continuation
    is_shifted
0 0 0 : Empty Slot
0 0 1 : Slot is holding start of run that has been shifted from its canonical slot.
0 1 0 : not used.
0 1 1 : Slot is holding continuation of run that has been shifted from its canonical slot.
1 0 0 : Slot is holding start of run that is in its canonical slot.
1 0 1 : Slot is holding start of run that has been shifted from its canonical slot.<br>       Also the run for which this is the canonical slot exists but is shifted right.
1 1 0 : not used.
1 1 1 : Slot is holding continuation of run that has been shifted from its canonical slot.<br>        Also the run for which this is the canonical slot exists but is shifted right.
 
===Lookup===
We can test if a quotient filter contains some key, d, as follows.<ref name="Bender"/>
 
We hash the key to produce its fingerprint, d<sub>H</sub>, which we then partition into its high-order q bits, d<sub>Q</sub>, which comprise its quotient, and its low-order r bits, d<sub>R</sub>, which comprise its remainder. Slot d<sub>Q</sub> is the key's canonical slot. That slot is empty if its three meta-data bits are false.  In that case the filter does not contain the key.
 
If the canonical slot is occupied then we must locate the quotient's run.  The set of slots that hold remainders belonging to the same quotient are stored contiguously and these comprise the quotient's run. The first slot in the run might be the canonical slot but it is also possible that the entire run has been shifted to the right by the encroachment from the left of another run.
 
To locate the quotient's run we must first locate the start of the cluster.  The cluster consists of a contiguous set of runs.  Starting with the quotient's canonical slot we can scan left to locate the start of the cluster, then scan right to locate the quotient's run.
 
We scan left looking for a slot with ''is_shifted'' is false.  This indicates the start of the cluster.  Then we scan right keeping a running count of the number of runs we must skip over.  Each slot to the left of the canonical slot having ''is_occupied'' '''set''' indicates another run to be skipped, so we increment the running count.  Each slot having ''is_continuation'' '''clear''' indicates the start of another run, thus the end of the previous run, so we decrement the running count.  When the running count reaches zero, we are scanning the quotient's run.  We can compare the remainder in each slot in the run with d<sub>R</sub>.  If found, we report that the key is (probably) in the filter otherwise we report that the key is definitely not in the filter.
 
===Lookup example===
[[File:Quotient Filter States.svg|thumb|540px|An example quotient filter showing in order the insertion of elements ''b'', ''e'', ''f'', ''c'', ''d'' and ''a'']]
Take, for example, looking up element ''e''. See state 3 in the figure.  We would compute ''hash(e)'', partition it into its remainder, e<sub>R</sub> and its quotient e<sub>Q</sub>, which is 4. Scanning left from slot 4 we encounter three ''is_occupied'' slots, at indexes 4, 2 and 1, indicating e<sub>Q</sub>'s run is the 3<sup>rd</sup> run in the cluster. The scan stops at slot 1, which we detect as the cluster's start because it is not empty and not shifted. Now we must scan right to the 3<sup>rd</sup> run. The start of a run is indicated by ''is_continuation'' being false. The 1<sup>st</sup> run is found at index 1, the 2<sup>nd</sup> at 4 and the 3<sup>rd</sup> at 5.  We compare the remainder held in each slot in the run that starts at index 5.  There is only one slot in that run but, in our example, its remainder equals e<sub>R</sub>, indicating that ''e'' is indeed a member of the filter, with probability 1 - ε.
 
===Insertion===
Insertion follows a path similar to lookup until we ascertain that the key is definitely not in the filter.<ref name="Bender"/> At that point we insert the remainder in a slot in the current run, a slot chosen to keep the run in sorted order. We shift forward the remainders in any slots in the cluster at or after the chosen slot and update the slot bits.
 
* Shifting a slot's remainder does not affect the slot's ''is_occupied'' bit because it pertains to the slot, not the remainder contained in the slot.
 
* If we insert a remainder at the start of an existing run, the previous remainder is shifted and becomes a continuation slot, so we set its ''is_continuation'' bit.
 
* We set the ''is_shifted'' bit of any remainder that we shift.
 
===Insertion example===
The figure shows a quotient filter proceeding through a series of states as elements are added.  In state 1 three elements have been added.  The slot each one occupies forms a one-slot run which is also a distinct cluster.
 
In state 2 elements ''c'' and ''d'' have been added.  Element ''c'' has a quotient of 1, the same as ''b''.  We assume b<sub>R</sub> < c<sub>R</sub> so c<sub>R</sub> is shifted into slot 2, and is marked as both a ''continuation'' and ''shifted''. Element ''d'' has a quotient of 2. Since its canonical slot is in use, it is shifted into slot 3, and is marked as ''shifted''.  In addition its canonical slot is marked as ''occupied''.  The runs for quotients 1 and 2 now comprise a cluster.
 
In state 3 element ''a'' has been added.  Its quotient is 1.  We assume a<sub>R</sub> < b<sub>R</sub> so the remainders in slots 1 through 4 must be shifted.  Slot 2 receives b<sub>R</sub> and is marked as a ''continuation'' and ''shifted''. Slot 5 receives e<sub>R</sub> and is marked as ''shifted''.  The runs for quotients 1, 2 and 4 now comprise a cluster, and the presence of those three runs in the cluster is indicated by having slots 1, 2 and 4 being marked as ''occupied''.
 
==Cost/performance==
 
===Cluster length===
 
Bender<ref name="Bender"/> argues that clusters are small. This is important because lookups and inserts require locating the start and length of an entire cluster. If the hash function generates uniformly distributed fingerprints then the length of most runs is ''O''(1) and it is highly likely that ''all'' runs have length ''O''(log ''m'') where ''m'' is the number of slots in the table.<ref name="Bender"/>
 
===Probability of false positives===
Bender<ref name="Bender"/> calculates the probability of a false positive (i.e. when the hash of two keys results in the same fingerprint) in terms of the hash table's remainder size and load factor. Recall that a ''p'' bit fingerprint is partitioned into a ''q'' bit quotient, which determines the table size of ''m'' = 2<sup>q</sup> slots, and a ''r'' bit remainder. The load factor <math>\alpha</math> is the proportion of occupied slots ''n'' to total slots ''m'': <math>\alpha = n/m</math>. Then, for a good hash function, <math> 1-e^{\alpha/2^r} \leq 2^{-r}</math> is approximately the probability of a hard collision.
 
===Space/performance===
A quotient filter requires 10–25% more space than a comparable Bloom filter but is faster because each access requires evaluating only a single hash function.<ref name="Spillane"/>
 
==Application==
 
Quotient filters are AMQs and, as such, provide many of the same benefits as [[Bloom filter]]s. A large database, such as Webtable<ref>{{cite journal|last=Chang|first=Fay|coauthors=et al.|title=Bigtable: A Distributed Storage System for Structured Data|journal=OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation|year=2006|pages=15–15|url=http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf|accessdate=21 July 2012}}</ref> may be composed of smaller sub-tables each of which has an associated filter.  Each query is distributed concurrently to all sub-tables.  If a sub-table does not contain the requested element, its filter can quickly complete the request without incurring any I/O.
 
Quotient filters offer two benefits in some applications.
 
#Two quotient filters can be efficiently merged without affecting their false positive rates.  This is not possible with Bloom filters.
#A few duplicates can be tolerated efficiently and can be deleted.
 
The space used by quotient filters is comparable to that of Bloom filters.  However quotient filters can be efficiently merged within memory without having to re-insert the original keys.
 
This is particularly important in some log structured storage systems that use the [[log-structured merge-tree]] or LSM-tree.<ref>{{cite journal|last=O'Neil|first=Patrick|coauthors=et al.|title=The log-structured merge-tree (LSM-tree)|journal=Acta Informatica|year=1996|volume=33|issue=4|pages=351–385|doi=10.1007/s002360050048}}</ref> The LSM-tree is actually a collection of trees but which is treated as a single key-value store.  One variation of the LSM-Tree is the [[Sorted Array Merge Tree]] or SAMT.<ref name="Spillane">{{cite journal|last=Spillane|first=Richard|title=Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers|date=May 2012|url=http://www.fsl.cs.sunysb.edu/~rick/richard_spillane.pdf|accessdate=21 July 2012}}</ref>  In this variation, a SAMT's component trees are called [[Wanna-B-tree]]s. Each Wanna-''B''-tree has an associated quotient filter.  A query on the SAMT is directed at only select Wanna-''B''-trees as evidenced by their quotient filters.
 
The storage system in its normal operation compacts the SAMT's Wanna-''B''-trees, merging smaller Wanna-''B''-trees into larger ones and merging their quotient filters.  An essential property of quotient filters is that they can be efficiently merged without having to re-insert the original keys.  Given that for large data sets the Wanna-''B''-trees may not be in memory, accessing them to retrieve the original keys would incur many I/Os.
 
By construction the values in a quotient filter are stored in sorted order.  Each run is associated with a specific quotient value, which provides the most significant portion of the fingerprint, the runs are stored in order and each slot in the run provides the least significant portion of the fingerprint.
 
So, by working from left to right, one can reconstruct all the fingerprints and the resulting list of integers will be in sorted order. Merging two quotient filters is then a simple matter of converting each quotient filter into such a list, merging the two lists and using it to populate a new larger quotient filter. Similarly, we can halve or double the size of a quotient filter without rehashing the keys since the fingerprints can be recomputed using just the quotients and remainders.<ref name="Bender"/>
 
==See also==
*[[MinHash]]
*[[Bloom filter]]
 
==Notes==
{{reflist|30em}}
 
{{DEFAULTSORT:Quotient Filter}}
[[Category:Hashing]]
[[Category:Probabilistic data structures]]

Latest revision as of 01:38, 1 December 2014

I would recommend this hotel to others and would stay there again if I return to Venice.
Also available in black and silver glitter, I'd love to see this particular version paired with a simple black outfit. With 4 platform soles, the design of these shoes is so lovely, you may want to have a look even if glitter shoes aren't you're cup of tea, as they come in loads of different materials and colors.
http://www.mathematix.com/louboutin/?p=41
http://www.mathematix.com/louboutin/?p=7
http://www.mathematix.com/louboutin/?p=82
http://www.mathematix.com/louboutin/?p=67
http://www.mathematix.com/louboutin/?p=60


Should you loved this informative article and you would like to receive more info with regards to Cheap Uggs For Sale assure visit our own web page.