|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| {{Distinguish2|[[Bloom (shader effect)|Bloom shader effect]]}}
| | The title of the author is Luther. My house is now in Kansas. The favorite hobby for him and his children is to play badminton but he is struggling to find time for it. She is presently a cashier but quickly she'll be on her personal.<br><br>Check out my web page; [http://www.atvriders.tv/uprofile.php?UID=270428 automobile extended warranty] |
| {{Probabilistic}}
| |
| A '''Bloom filter''' is a space-efficient [[probabilistic]] [[data structure]], conceived by [[Burton Howard Bloom]] in 1970, that is used to test whether an [[element (mathematics)|element]] is a member of a [[set (computer science)|set]]. [[Type I and type II errors|False positive]] matches are possible, but [[Type I and type II errors|false negatives]] are not; i.e. a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives.
| |
| | |
| Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a [[hyphenation algorithm]] for a dictionary of 500,000 words, of which 90% could be hyphenated by following simple rules but all the remaining 50,000 words required expensive disk access to retrieve their specific patterns. With unlimited core memory, an error-free hash could be used to eliminate all the unnecessary disk access. But if core memory was insufficient, a smaller hash area could be used to eliminate most of the unnecessary accesses. For example, a hash area only 15% of the error-free size would still eliminate 85% of the disk accesses ({{harvtxt |Bloom |1970}}).
| |
| | |
| More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set ({{harvtxt |Bonomi|Mitzenmacher|Panigrahy|Singh|2006}}).
| |
| | |
| ==Algorithm description==
| |
| [[File:Bloom filter.svg|thumb|360px|An example of a Bloom filter, representing the set {''x'', ''y'', ''z''}. The colored arrows show the positions in the bit array that each set element is mapped to. The element ''w'' is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.]]
| |
| An '''empty Bloom filter''' is a [[bit array]] of ''m'' bits, all set to 0. There must also be ''k'' different [[hash function]]s defined, each of which [[map (mathematics)|maps]] or hashes some set element to one of the ''m'' array positions with a uniform random distribution.
| |
| | |
| To '''add''' an element, feed it to each of the ''k'' hash functions to get ''k'' array positions. Set the bits at all these positions to 1.
| |
| | |
| To '''query''' for an element (test whether it is in the set), feed it to each of the ''k'' hash functions to get ''k'' array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a [[false positive]]. In a simple Bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.
| |
| | |
| The requirement of designing ''k'' different independent hash functions can be prohibitive for large ''k''. For a good [[hash function]] with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass ''k'' different initial values (such as 0, 1, ..., ''k'' − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger ''m'' and/or ''k'', independence among the hash functions can be relaxed with negligible increase in false positive rate ({{harvtxt|Dillinger|Manolios|2004a}}, {{harvtxt|Kirsch|Mitzenmacher|2006}}). Specifically, {{harvtxt|Dillinger|Manolios|2004b}} show the effectiveness of deriving the ''k'' indices using [[enhanced double hashing]] or [[triple hashing]], variants of [[double hashing]] that are effectively simple random number generators seeded with the two or three hash values. | |
| | |
| Removing an element from this simple Bloom filter is impossible because false negatives are not permitted. An element maps to ''k'' bits, and although setting any one of those ''k'' bits to zero suffices to remove the element, it also results in removing any other elements that happen to map onto that bit. Since there is no way of determining whether any other elements have been added that affect the bits for an element to be removed, clearing any of the bits would introduce the possibility for false negatives.
| |
| | |
| One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the "removed" filter.
| |
| | |
| It is often the case that all the keys are available but are expensive to enumerate (for example, requiring many disk reads). When the false positive rate gets too high, the filter can be regenerated; this should be a relatively rare event.
| |
| | |
| ==Space and time advantages==
| |
| [[File:Bloom filter speed.svg|thumb|360px|Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage.]]
| |
| While risking false positives, Bloom filters have a strong space advantage over other data structures for representing sets, such as [[self-balancing binary search tree]]s, [[trie]]s, [[hash table]]s, or simple [[Array data structure|arrays]] or [[linked list]]s of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings ([[trie]]s<!--Yes, tries, NOT trees--> are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of ''k'', in contrast, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element.
| |
| | |
| However, if the number of potential values is small and many of them can be in the set, the Bloom filter is easily surpassed by the deterministic [[bit array]], which requires only one bit for each potential element. Note also that hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with ''k'' = 1.<ref>{{harvtxt|Mitzenmacher|Upfal|2005}}.</ref>
| |
| | |
| Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(''k''), completely independent of the number of items already in the set. No other constant-space set data structure has this property, but the average access time of sparse [[hash table]]s can make them faster in practice than some Bloom filters. In a hardware implementation, however, the Bloom filter shines because its ''k'' lookups are independent and can be parallelized.
| |
| | |
| To understand its space efficiency, it is instructive to compare the general Bloom filter with its special case when ''k'' = 1. If ''k'' = 1, then in order to keep the false positive rate sufficiently low, a small fraction of bits should be set, which means the array must be very large and contain long runs of zeros. The [[information content]] of the array relative to its size is low. The generalized Bloom filter (''k'' greater than 1) allows many more bits to be set while still maintaining a low false positive rate; if the parameters (''k'' and ''m'') are chosen well, about half of the bits will be set, and these will be apparently random, minimizing redundancy and maximizing information content.
| |
| | |
| ==Probability of false positives==
| |
| [[File:Bloom filter fp probability.svg|thumb|360px|The false positive probability <math>p</math> as a function of number of elements <math>n</math> in the filter and the filter size <math>m</math>. An optimal number of hash functions <math>k= (m/n) \ln 2</math> has been assumed.]]
| |
| | |
| Assume that a [[hash function]] selects each array position with equal probability. If ''m'' is the number of bits in the array, and ''k'' is the number of hash functions, then the probability that a certain bit is not set to 1 by a certain hash function during the insertion of an element is then
| |
| | |
| :<math>1-\frac{1}{m}.</math>
| |
| | |
| The probability that it is not set to 1 by any of the hash functions is
| |
| | |
| :<math>\left(1-\frac{1}{m}\right)^k.</math>
| |
| | |
| If we have inserted ''n'' elements, the probability that a certain bit is still 0 is
| |
| | |
| :<math>\left(1-\frac{1}{m}\right)^{kn};</math>
| |
| | |
| the probability that it is 1 is therefore
| |
| | |
| :<math>1-\left(1-\frac{1}{m}\right)^{kn}.</math>
| |
| | |
| Now test membership of an element that is not in the set. Each of the ''k'' array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the [[algorithm]] to erroneously claim that the element is in the set, is often given as
| |
| | |
| :<math>\left(1-\left[1-\frac{1}{m}\right]^{kn}\right)^k \approx \left( 1-e^{-kn/m} \right)^k.</math>
| |
| | |
| This is not strictly correct as it assumes independence for the probabilities of each bit being set. However, assuming it is a close approximation we have that the probability of false positives decreases as ''m'' (the number of bits in the array) increases, and increases as ''n'' (the number of inserted elements) increases.
| |
| | |
| An alternative analysis arriving at the same approximation without the assumption of independence is given by Mitzenmacher and Upfal.<ref>{{harvtxt|Mitzenmacher|Upfal|2005}}, pp. 109–111, 308.</ref> After all ''n'' items have been added to the Bloom filter, let ''q'' be the fraction of the ''m'' bits that are set to 0. (That is, the number of bits still set to 0 is ''qm''.) Then, when testing membership of an element not in the set, for the array position given by any of the ''k'' hash functions, the probability that the bit is found set to 1 is <math>1-q</math>. So the probability that all ''k'' hash functions find their bit set to 1 is <math>(1 - q)^k</math>. Further, the expected value of ''q'' is the probability that a given array position is left untouched by each of the ''k'' hash functions for each of the ''n'' items, which is (as above)
| |
| : <math>E[q] = \left(1 - \frac{1}{m}\right)^{kn}</math>.
| |
| It is possible to prove, without the independence assumption, that ''q'' is very strongly concentrated around its expected value. In particular, from the [[Azuma–Hoeffding inequality]], they prove that<ref>{{harvtxt|Mitzenmacher|Upfal|2005}}, p. 308.</ref>
| |
| : <math> \Pr(\left|q - E[q]\right| \ge \frac{\lambda}{m}) \le 2\exp(-2\lambda^2/m) </math>
| |
| Because of this, we can say that the exact probability of false positives is
| |
| : <math> \sum_{t} \Pr(q = t) (1 - t)^k \approx (1 - E[q])^k = \left(1-\left[1-\frac{1}{m}\right]^{kn}\right)^k \approx \left( 1-e^{-kn/m} \right)^k</math>
| |
| as before.
| |
| | |
| ===Optimal number of hash functions===
| |
| For a given ''m'' and ''n'', the value of ''k'' (the number of hash functions) that minimizes the probability is
| |
| | |
| :<math>k = \frac{m}{n} \ln 2,</math>
| |
| | |
| which gives
| |
| | |
| :<math>2^{-k} \approx {0.6185}^{m/n}.</math>
| |
| | |
| The required number of bits ''m'', given ''n'' (the number of inserted elements) and a desired false positive probability ''p'' (and assuming the optimal value of ''k'' is used) can be computed by substituting the optimal value of ''k'' in the probability expression above:
| |
| :<math>p = \left( 1-e^{-(m/n\ln 2) n/m} \right)^{(m/n\ln 2)}</math>
| |
| which can be simplified to:
| |
| :<math>\ln p = -\frac{m}{n} \left(\ln 2\right)^2.</math>
| |
| This results in:
| |
| :<math>m=-\frac{n\ln p}{(\ln 2)^2}.</math>
| |
| | |
| This means that for a given false positive probability ''p'', the length of a Bloom filter ''m'' is proportionate to the number of elements being filtered ''n''.<ref>{{harvtxt|Starobinski|Trachtenberg|Agarwal|2003}}.</ref> While the above formula is asymptotic (i.e. applicable as ''m'',''n'' → ∞), the agreement with finite values of ''m'',''n'' is also quite good; the false positive probability for a finite bloom filter with ''m'' bits, ''n'' elements, and ''k'' hash functions is at most
| |
| | |
| :<math>\left( 1-e^{-k(n+0.5)/(m-1)} \right)^k.</math>
| |
| | |
| So we can use the asymptotic formula if we pay a penalty for at most half an extra element and at most one fewer bit.<ref>{{harvtxt|Goel|Gupta|2010}}.</ref>
| |
| | |
| ==Approximating the number of items in a Bloom filter==
| |
| | |
| {{harvtxt|Swamidass|Baldi|2007}} showed that the number of items in a Bloom filter can be approximated with the following formula,
| |
| | |
| :<math> X^* = - \tfrac{ N \ln \left[ 1 - \tfrac{X}{N} \right] } { k} </math>
| |
| | |
| where <var><math>X^*</math></var> is an estimate of the number of items in the filter, <var>N</var> is length of the filter, <var>k</var> is the number of hash functions per item, and <var>X</var> is the number of bits set to one.
| |
| | |
| ==The union and intersection of sets==
| |
| | |
| Bloom filters are a way of compactly representing a set of items. It is common to try and compute the size of the intersection or union between two sets. Bloom filters can be used to approximate the size of the intersection and union of two sets. {{harvtxt|Swamidass|Baldi|2007}} showed that for two bloom filters of length <math>N</math>, their counts, respectively can be estimated as
| |
| | |
| :<math> A^* = -N \ln \left[ 1 - A / N \right] / k</math>
| |
| | |
| and
| |
| | |
| :<math> B^* = -N \ln \left[ 1 - B / N \right]/k</math>.
| |
| | |
| The size of their union can be estimated as
| |
| | |
| :<math> A^*\cup B^* = -N \ln \left[ 1 - A \cup B / N \right]/k</math>,
| |
| | |
| where <math>A \cup B</math> is the number of bits set to one in either of the two bloom filters. And the intersection can be estimated as
| |
| | |
| :<math> A^*\cap B^* = A^* + B^* - A^*\cup B^*</math>,
| |
| | |
| Using the three formulas together.
| |
| | |
| ==Interesting properties==
| |
| *Unlike a standard [[hash table]], a Bloom filter of a fixed size can represent a set with an arbitrary large number of elements; adding an element never fails due to the data structure "filling up." However, the false positive rate increases steadily as elements are added until all bits in the filter are set to 1, at which point ''all'' queries yield a positive result.
| |
| | |
| *[[Union (set theory)|Union]] and [[intersection (set theory)|intersection]] of Bloom filters with the same size and set of hash functions can be implemented with [[bitwise operation|bitwise]] OR and AND operations, respectively. The union operation on Bloom filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of the two sets. The intersect operation satisfies a weaker property: the false positive probability in the resulting Bloom filter is at most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive probability in the Bloom filter created from scratch using the intersection of the two sets.
| |
| | |
| * Some kinds of [[superimposed code]] can be seen as a Bloom filter implemented with physical [[edge-notched card]]s. An example is [[Zatocoding]], invented by [[Calvin Mooers]] in 1947, in which the set of categories associated with a piece of information is represented by notches on a card, with a random pattern of four notches for each category.
| |
| | |
| ==Examples==
| |
| Google [[BigTable]] and [[Apache Cassandra]] use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.<ref>{{harv|Chang|Dean|Ghemawat|Hsieh|2006}}.</ref>
| |
| | |
| The [[Google Chrome]] web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit a full check of the URL is performed.<ref>http://blog.alexyakunin.com/2010/03/nice-bloom-filter-application.html</ref>
| |
| | |
| The [[Squid (software)|Squid]] [[World Wide Web|Web]] Proxy [[web cache|Cache]] uses Bloom filters for [http://wiki.squid-cache.org/SquidFaq/CacheDigests cache digests].<ref name="Wessels172">{{Citation| last=Wessels | first=Duane |date=January 2004 | chapter=10.7 Cache Digests | title=Squid: The Definitive Guide | edition=1st | publisher=O'Reilly Media | isbn=0-596-00162-2 | page=172 | quote=Cache Digests are based on a technique first published by Pei Cao, called Summary Cache. The fundamental idea is to use a Bloom filter to represent the cache contents.}}</ref>
| |
| | |
| [[Bitcoin]] uses Bloom filters to verify payments without running a full network node.<ref>[http://sourceforge.net/projects/bitcoin/files/Bitcoin/bitcoin-0.8.0/ Bitcoin 0.8.0]</ref><ref>[https://bitcoinfoundation.org/blog/?p=16 Core Development Status Report #1]</ref>
| |
| | |
| The [[Venti]] archival storage system uses Bloom filters to detect previously stored data.<ref>http://plan9.bell-labs.com/magic/man2html/8/venti</ref>
| |
| | |
| The [[SPIN model checker]] uses Bloom filters to track the reachable state space for large verification problems.<ref>http://spinroot.com/</ref>
| |
| | |
| The [[Cascading]] analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join<ref>{{harvtxt|Mullin|1990}}</ref> in the database literature).<ref>http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/</ref>
| |
| | |
| ==Alternatives==
| |
| | |
| Classic Bloom filters use <math>1.44\log_2(1/\epsilon)</math> bits of space per inserted key, where <math>\epsilon</math> is the false positive rate of the Bloom filter. However, the space that is strictly necessary for any data structure playing the same role as a Bloom filter is only <math>\log_2(1/\epsilon)</math> per key {{harv|Pagh|Pagh|Rao|2005}}. Hence Bloom filters use 44% more space than a hypothetical equivalent optimal data structure. The number of hash functions used to achieve a given false positive rate <math>\epsilon</math> is proportional to <math>1/\epsilon</math> which is not optimal as it has been proved that an optimal data structure would need only a constant number of hash functions independent of the false positive rate.
| |
| | |
| {{harvtxt|Stern|Dill|1996}} describe a probabilistic structure based on [[hash table]]s, [[hash compaction]], which {{harvtxt|Dillinger|Manolios|2004b}} identify as significantly more accurate than a Bloom filter when each is configured optimally. Dillinger and Manolios, however, point out that the reasonable accuracy of any given Bloom filter over a wide range of numbers of additions makes it attractive for probabilistic enumeration of state spaces of unknown size. Hash compaction is, therefore, attractive when the number of additions can be predicted accurately; however, despite being very fast in software, hash compaction is poorly suited for hardware because of worst-case linear access time.
| |
| | |
| {{harvtxt|Putze|Sanders|Singler|2007}} have studied some variants of Bloom filters that are either faster or use less space than classic Bloom filters. The basic idea of the fast variant is to locate the k hash values associated with each key into one or two blocks having the same size as processor's memory cache blocks (usually 64 bytes). This will presumably improve performance by reducing the number of potential memory [[cache misses]]. The proposed variants have however the drawback of using about 32% more space than classic Bloom filters.
| |
| | |
| The space efficient variant relies on using a single hash function that generates for each key a value in the range <math>\left[0,n/\varepsilon\right]</math> where <math>\epsilon</math> is the requested false positive rate. The sequence of values is then sorted and compressed using [[Golomb coding]] (or some other compression technique) to occupy a space close to <math>n\log_2(1/\epsilon)</math> bits. To query the Bloom filter for a given key, it will suffice to check if its corresponding value is stored in the Bloom filter. Decompressing the whole Bloom filter for each query would make this variant totally unusable. To overcome this problem the sequence of values is divided into small blocks of equal size that are compressed separately. At query time only half a block will need to be decompressed on average. Because of decompression overhead, this variant may be slower than classic Bloom filters but this may be compensated by the fact that a single hash function need to be computed.
| |
| | |
| Another alternative to classic Bloom filter is the one based on space efficient variants of [[cuckoo hashing]]. In this case once the hash table is constructed, the keys stored in the hash table are replaced with short signatures of the keys. Those signatures are strings of bits computed using a hash function applied on the keys.
| |
| | |
| ==Extensions and applications==
| |
| | |
| ===Counting filters===
| |
| | |
| Counting filters provide a way to implement a ''delete'' operation on a Bloom filter without recreating the filter afresh. In a counting filter the array positions (buckets) are extended from being a single bit to being an n-bit counter. In fact, regular Bloom filters can be considered as counting filters with a bucket size of one bit. Counting filters were introduced by {{harvtxt|Fan|Cao|Almeida|Broder|1998}}.
| |
| | |
| The insert operation is extended to ''increment'' the value of the buckets and the lookup operation checks that each of the required buckets is non-zero. The delete operation, obviously, then consists of decrementing the value of each of the respective buckets.
| |
| | |
| [[Arithmetic overflow]] of the buckets is a problem and the buckets should be sufficiently large to make this case rare. If it does occur then the increment and decrement operations must leave the bucket set to the maximum possible value in order to retain the properties of a Bloom filter.
| |
| | |
| The size of counters is usually 3 or 4 bits. Hence counting Bloom filters use 3 to 4 times more space than static Bloom filters. In theory, an optimal data structure equivalent to a counting Bloom filter should not use more space than a static Bloom filter.
| |
| | |
| Another issue with counting filters is limited scalability. Because the counting Bloom filter table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.
| |
| | |
| {{harvtxt|Bonomi|Mitzenmacher|Panigrahy|Singh|2006}} introduced a data structure based on d-left hashing that is functionally equivalent but uses approximately half as much space as counting Bloom filters. The scalability issue does not occur in this data structure. Once the designed capacity is exceeded, the keys could be reinserted in a new hash table of double size.
| |
| | |
| The space efficient variant by {{harvtxt|Putze|Sanders|Singler|2007}} could also be used to implement counting filters by supporting insertions and deletions.
| |
| | |
| {{harvtxt|Rottenstreich|Kanizo|Keslassy|2012}} introduced a new general method based on variable increments that significantly improves the false positive probability of counting Bloom filters and their variants, while still supporting deletions. Unlike counting Bloom filters, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit increment. To query an element, the exact values of the counters are considered and not just their positiveness. If a sum represented by a counter value cannot be composed of the corresponding variable increment for the queried element, a negative answer can be returned to the query.
| |
| | |
| ===Data synchronization===
| |
| Bloom filters can be used for approximate [[data synchronization]] as in {{harvtxt|Byers|Considine|Mitzenmacher|Rost|2004}}. Counting Bloom filters can be used to approximate the number of differences between two sets and this approach is described in {{harvtxt|Agarwal|Trachtenberg|2006}}.
| |
| | |
| ===Bloomier filters===
| |
| | |
| {{harvtxt|Chazelle|Kilian|Rubinfeld|Tal|2004}} designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an [[associative array]]. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a ''false positive'' is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that ''is'' in the map.
| |
| <!-- too wordy for an alternative to the article topic. commented out in case someone wants to move it to a new article.
| |
| The simplest Bloomier filter is near-optimal and fairly simple to describe. Suppose initially that the only possible values are 0 and 1. We create a pair of Bloom filters ''A''<sub>0</sub> and ''B''<sub>0</sub> which contain, respectively, all keys mapping to 0 and all keys mapping to 1. Then, to determine which value a given key maps to, we look it up in both filters. If it is in neither, then the key is not in the map. If the key is in ''A''<sub>0</sub> but not ''B''<sub>0</sub>, then it does not map to 1, and has a high probability of mapping to 0. Conversely, if the key is in ''B''<sub>0</sub> but not ''A''<sub>0</sub>, then it does not map to 0 and has a high probability of mapping to 1.
| |
| | |
| A problem arises, however, when ''both'' filters claim to contain the key. We never insert a key into both, so one or both of the filters is lying (producing a false positive), but we don't know which. To determine this, we have another, smaller pair of filters ''A''<sub>1</sub> and ''B''<sub>1</sub>. ''A''<sub>1</sub> contains keys that map to 0 and which are false positives in ''B''<sub>0</sub>; ''B''<sub>1</sub> contains keys that map to 1 and which are false positives in ''A''<sub>0</sub>. But whenever ''A''<sub>0</sub> and ''B''<sub>0</sub> both produce positives, at most one of these cases must occur, and so we simply have to determine which if any of the two filters ''A''<sub>1</sub> and ''B''<sub>1</sub> contains the key, another instance of our original problem.
| |
| | |
| It may so happen again that both filters produce a positive; we apply the same idea recursively to solve this problem. Because each pair of filters only contains keys that are in the map ''and'' produced false positives on all previous filter pairs, the number of keys is extremely likely to quickly drop to a very small quantity that can be easily stored in an ordinary deterministic map, such as a pair of small arrays with linear search. Moreover, the average total search time is low, because almost all queries will be resolved by the first pair, almost all remaining queries by the second pair, and so on. The total space required is in practice independent of ''n'', and is almost entirely occupied by the first filter pair.
| |
| | |
| Now that we have the structure and a search algorithm, we also need to know how to insert new key/value pairs. The program must not attempt to insert the same key with both values. If the value is 0, insert the key into ''A''<sub>0</sub> and then test if the key is in ''B''<sub>0</sub>. If so, this is a false positive for ''B''<sub>0</sub>, and the key must also be inserted into ''A''<sub>1</sub> recursively in the same manner. If we reach the last level, we simply insert it. When the value is 1, the operation is similar but with ''A'' and ''B'' reversed.
| |
| | |
| Now that we can map a key to the value 0 or 1, how does this help us map to general values? This is simple. We create a single such Bloomier filter for each bit of the result. If the values are large, we can instead map keys to hash values that can be used to retrieve the actual values. The space required for a Bloomier filter with ''n''-bit values is typically slightly more than the space for 2''n'' Bloom filters.
| |
| | |
| A very simple way to implement Bloomier filters is by means of minimal [[perfect hashing]]. A minimal perfect hash function h is first generated for the set of n keys. Then an array is filled with n pairs (signature,value) associated with each key at the positions given by function h when applied on each key. The signature of a key is a string of r bits computed by applying a hash function g of range <math>2^r</math> on the key. The value of r is chosen such that <math>2^r>=1/\epsilon</math>, where <math>\epsilon</math> is the requested false positive rate. To query for a given key, hash function h is first applied on the key. This will give a position into the array from which we retrieve a pair (signature,value). Then we compute the signature of the key using function g. If the computed signature is the same as retrieved signature we return the retrieved value. The probability of false positive is <math>1/2^r</math>.
| |
| | |
| Another alternative to implement static bloomier and bloom filters based on matrix solving has been simultaneously proposed in {{harvtxt|Porat|2008}}, {{harvtxt|Dietzfelbinger|Pagh|2008}} and {{harvtxt|Charles|Chellapilla|2008}}. The space usage of this method is optimal as it needs only <math>\log_2(\epsilon)</math> bits per key for a bloom filter. However time to generate the bloom or bloomier filter can be very high. The generation time can be reduced to a reasonable value at the price of a small increase in space usage.
| |
| | |
| Dynamic Bloomier filters have been studied by {{harvtxt|Mortensen|Pagh|Pătraşcu|2005}}. They proved that any dynamic Bloomier filter needs at least around
| |
| <math>\log(l)</math> bits per key where l is the length of the key. A simple dynamic version of Bloomier filters can be implemented using two dynamic data structures. Let the two data structures be noted S1 and S2. S1 will store keys with their associated data while S2 will only store signatures of keys with their associated data. Those signatures are simply hash values of keys in the range <math>[0,n/\varepsilon]</math> where n is the maximal number of keys to be stored in the Bloomier filter and <math>\epsilon</math> is the requested false positive rate. To insert a key in the Bloomier filter, its hash value is first computed. Then the algorithm checks if a key with the same hash value already exists in S2. If this is not the case, the hash value is inserted in S2 along with data associated with the key. If the same hash value already exists in S2 then the key is inserted into S1 along with its associated data. The deletion is symmetric: if the key already exists in S1 it will be deleted from there, otherwise the hash value associated with the key is deleted from S2. An issue with this algorithm is on how to store efficiently S1 and S2. For S1 any hash algorithm can be used. To store S2 [[golomb coding]] could be applied to compress signatures to use a space close to <math>\log2(1/\epsilon)</math> per key. -->
| |
| | |
| ===Compact approximators===
| |
| | |
| {{harvtxt|Boldi|Vigna|2005}} proposed a [[lattice (order)|lattice]]-based generalization of Bloom filters. A '''compact approximator''' associates to each key an element of a lattice (the standard Bloom filters being the case of the Boolean two-element lattice). Instead of a bit array, they have an array of lattice elements. When adding a new association between a key and an element of the lattice, they compute the maximum of the current contents of the <var>k</var> array locations associated to the key with the lattice element. When reading the value associated to a key, they compute the minimum of the values found in the <var>k</var> locations associated to the key. The resulting value approximates from above the original value.
| |
| | |
| ===Stable Bloom filters===
| |
| {{harvtxt|Deng|Rafiei|2006}} proposed Stable Bloom filters as a variant of Bloom filters for streaming data. The idea is that since there is no way to store the entire history of a stream (which can be infinite), Stable Bloom filters continuously evict stale information to make room for more recent elements. Since stale information is evicted, the Stable Bloom filter introduces false negatives, which do not appear in traditional bloom filters. The authors show that a tight upper bound of false positive rates is guaranteed, and the method is superior to standard bloom filters in terms of false positive rates and time efficiency when a small space and an acceptable false positive rate are given.
| |
| | |
| ===Scalable Bloom filters===
| |
| {{harvtxt|Almeida|Baquero|Preguica|Hutchison|2007}} proposed a variant of Bloom filters that can adapt dynamically to the number of elements stored, while assuring a minimum false positive probability. The technique is based on sequences of standard bloom filters with increasing capacity and tighter false positive probabilities, so as to ensure that a maximum false positive probability can be set beforehand, regardless of the number of elements to be inserted.
| |
| | |
| ===Attenuated Bloom filters===
| |
| An attenuated bloom filter of depth D can be viewed as an array of D normal bloom filters. In the context of service discovery in a network, each node stores regular and attenuated bloom filters locally. The regular or local bloom filter indicates which services are offered by the node itself. The attenuated filter of level i indicates which services can be found on nodes that are i-hops away from the current node. The i-th value is constructed by taking a union of local bloom filters for nodes i-hops away from the node.<ref name="kgsb09">{{harvtxt|Koucheryavy|Giambene|Staehle|Barcelo-Arroyo|2009}}</ref>
| |
| | |
| [[File:AttenuatedBloomFilter.png|thumb|Attenuated Bloom Filter Example]]
| |
| | |
| Let's take a small network shown on the graph below as an example. Say we are searching for a service A whose id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node to be the starting point. First, we check whether service A is offered by n1 by checking its local filter. Since the patterns don't match, we check the attenuated bloom filter in order to determine which node should be the next hop. We see that n2 doesn't offer service A but lies on the path to nodes that do. Hence, we move to n2 and repeat the same procedure. We quickly find that n3 offers the service, and hence the destination is located.<ref>{{harvtxt|Kubiatowicz|Bindel|Czerwinski|Geels|2000}}</ref>
| |
| | |
| By using attenuated Bloom filters consisting of multiple layers, services at more than one hop distance can be discovered while avoiding saturation of the Bloom filter by attenuating (shifting out) bits set by sources further away.<ref name="kgsb09"/>
| |
| | |
| ===Chemical structure searching===
| |
| | |
| Bloom filters are commonly used to search large databases of chemicals (see [[chemical similarity]]). Each molecule is represented with a bloom filter (called a fingerprint in this field) which stores substructures of the molecule. Commonly, the [[Jaccard index|tanimoto]] similarity is used to quantify the similarity between molecules' bloom filters.
| |
| | |
| ==See also==
| |
| *[[Feature hashing]]
| |
| *[[MinHash]]
| |
| *[[Quotient filter]]
| |
| *[[Skip list]]
| |
| | |
| ==Notes==
| |
| {{Reflist|colwidth=30em}}
| |
| | |
| ==References==
| |
| {{More footnotes|date=November 2009}}
| |
| {{Refbegin|colwidth=30em}}
| |
| *{{Citation
| |
| | first1 = Y. | last1 = Koucheryavy
| |
| | first2 = G. | last2 = Giambene
| |
| | first3 = D. | last3 = Staehle
| |
| | first4 = F. | last4 =Barcelo-Arroyo
| |
| | first5 = T. | last5 = Braun
| |
| | first6 = V. | last6 = Siris
| |
| | title = Traffic and QoS Management in Wireless Multimedia Networks
| |
| | journal = COST 290 Final Report
| |
| | location = USA
| |
| | year = 2009
| |
| | doi = <!-- XVI, 312 -->
| |
| | pages = 111}}
| |
| *{{Citation
| |
| | first1 = J. | last1 = Kubiatowicz
| |
| | first2 = D. | last2 = Bindel
| |
| | first3 = Y. | last3 = Czerwinski
| |
| | first4 = S. | last4 = Geels
| |
| | first5 = D. | last5 = Eaton
| |
| | first6 = R. | last6 = Gummadi
| |
| | first7 = S. | last7 = Rhea
| |
| | first8 = H. | last8 = Weatherspoon
| |
| | first9 = W. | last9 = Weimer
| |
| | title = Oceanstore: An architecture for global-scale persistent storage
| |
| | journal = ACM SIGPLAN Notices
| |
| | location = USA
| |
| | year = 2000
| |
| | doi = <!-- 2000, VOL 35; PART 11 -->
| |
| | url = http://ftp.csd.uwo.ca/courses/CS9843b/papers/OceanStore.pdf
| |
| | pages = 190–201}}
| |
| *{{Citation
| |
| | first1 = Sachin | last1 = Agarwal
| |
| | first2 = Ari | last2 = Trachtenberg
| |
| | title = Approximating the number of differences between remote sets
| |
| | journal = IEEE Information Theory Workshop
| |
| | location = Punta del Este, Uruguay
| |
| | year = 2006
| |
| | doi = 10.1109/ITW.2006.1633815
| |
| | url = http://www.deutsche-telekom-laboratories.de/~agarwals/publications/itw2006.pdf
| |
| | pages = 217
| |
| | isbn = 1-4244-0035-X}}
| |
| *{{Citation
| |
| | doi = 10.1109/ICON.2007.4444089
| |
| | first1 = Mahmood| last1 = Ahmadi
| |
| | first2 = Stephan| last2 = Wong | contribution = A Cache Architecture for Counting Bloom Filters
| |
| | title = 15th international Conference on Networks (ICON-2007)
| |
| | year = 2007
| |
| | pages = 218 | url = http://www.ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4444031&arnumber=4444089&count=113&index=57
| |
| | isbn = 978-1-4244-1229-7}}
| |
| *{{Citation
| |
| | first1 = Paulo | last1 = Almeida
| |
| | first2 = Carlos | last2 = Baquero
| |
| | first3 = Nuno | last3 = Preguica
| |
| | first4 = David | last4 = Hutchison
| |
| | title = Scalable Bloom Filters
| |
| | journal = Information Processing Letters
| |
| | volume = 101
| |
| | issue = 6
| |
| | year = 2007
| |
| | doi = 10.1016/j.ipl.2006.10.007
| |
| | url = http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf
| |
| | pages = 255–261}}
| |
| *{{Citation
| |
| | first1 = John W. | last1 = Byers
| |
| | first2 = Jeffrey | last2 = Considine
| |
| | first3 = Michael | last3 = Mitzenmacher
| |
| | first4 = Stanislav | last4 = Rost
| |
| | journal = [[IEEE/ACM Transactions on Networking]]
| |
| | title = Informed content delivery across adaptive overlay networks
| |
| | volume = 12
| |
| | year = 2004
| |
| | doi = 10.1109/TNET.2004.836103
| |
| | pages = 767
| |
| | issue = 5}}
| |
| *{{Citation
| |
| | first = Burton H. | last = Bloom
| |
| | title = Space/Time Trade-offs in Hash Coding with Allowable Errors
| |
| | url= https://dl.acm.org/citation.cfm?doid=362686.362692
| |
| | journal = [[Communications of the ACM]]
| |
| | volume = 13 | issue = 7 | year = 1970 | pages = 422–426
| |
| | doi = 10.1145/362686.362692}}
| |
| *{{Citation
| |
| | first1 = Paolo | last1 = Boldi
| |
| | first2 = Sebastiano | last2 = Vigna
| |
| | title = Mutable strings in Java: design, implementation and lightweight text-search algorithms
| |
| | journal = Science of Computer Programming
| |
| | volume = 54 | issue = 1 | year = 2005 | pages = 3–23
| |
| | doi = 10.1016/j.scico.2004.05.003}}
| |
| *{{Citation
| |
| | first1 = Flavio | last1 = Bonomi
| |
| | first2 = Michael | last2 = Mitzenmacher
| |
| | first3 = Rina | last3 = Panigrahy
| |
| | first4 = Sushil | last4 = Singh
| |
| | first5 = George |last5 = Varghese
| |
| | contribution = An Improved Construction for Counting Bloom Filters
| |
| | title = Algorithms – ESA 2006, 14th Annual European Symposium
| |
| | year = 2006 | pages = 684–695 | doi = 10.1007/11841036_61
| |
| | url = http://theory.stanford.edu/~rinap/papers/esa2006b.pdf
| |
| | volume = 4168
| |
| | series = Lecture Notes in Computer Science
| |
| | isbn = 978-3-540-38875-3}}
| |
| *{{Citation
| |
| | first1 = Andrei | last1 = Broder | authorlink1 = Andrei Broder
| |
| | first2 = Michael | last2 = Mitzenmacher
| |
| | title = Network Applications of Bloom Filters: A Survey
| |
| | journal = Internet Mathematics
| |
| | volume = 1 | issue = 4 | pages = 485–509 | year = 2005
| |
| | url = http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
| |
| | doi = 10.1080/15427951.2004.10129096}}
| |
| *{{Citation
| |
| | first1 = Fay| last1 = Chang
| |
| | first2 = Jeffrey| last2 = Dean
| |
| | first3 = Sanjay | last3 = Ghemawat
| |
| | first4 = Wilson | last4 = Hsieh
| |
| | first5 = Deborah | last5 = Wallach
| |
| | first6 = Mike | last6 = Burrows
| |
| | first7 = Tushar | last7 = Chandra
| |
| | first8 = Andrew | last8 = Fikes
| |
| | first9 = Robert | last9 = Gruber
| |
| | contribution = Bigtable: A Distributed Storage System for Structured Data
| |
| | title = Seventh Symposium on Operating System Design and Implementation
| |
| | year = 2006 | url = http://research.google.com/archive/bigtable.html}}
| |
| *{{Citation
| |
| | first1 = Denis| last1 = Charles
| |
| | first2 = Kumar| last2 = Chellapilla
| |
| | contribution = Bloomier Filters: A second look
| |
| | title = The Computing Research Repository (CoRR)
| |
| | year = 2008 | arxiv = 0807.0928
| |
| }}
| |
| *{{Citation
| |
| | first1 = Bernard | last1 = Chazelle | authorlink1 = Bernard Chazelle
| |
| | first2 = Joe | last2 = Kilian
| |
| | first3 = Ronitt | last3 = Rubinfeld
| |
| | first4 = Ayellet | last4 = Tal
| |
| | contribution = The Bloomier filter: an efficient data structure for static support lookup tables
| |
| | title = Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms
| |
| | year = 2004 | pages = 30–39
| |
| | url = http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf}}
| |
| *{{Citation
| |
| | first1 = Saar | last1 = Cohen
| |
| | first2 = Yossi | last2 = Matias
| |
| | contribution = Spectral Bloom Filters
| |
| | title = Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data
| |
| | year = 2003 | doi = 10.1145/872757.872787 | pages = 241–252
| |
| | url = http://www.sigmod.org/sigmod03/eproceedings/papers/r09p02.pdf
| |
| | isbn = 158113634X
| |
| }} {{Dead link|date=June 2010}}
| |
| *{{Citation
| |
| | first1 = Fan | last1 = Deng
| |
| | first2 = Davood | last2 = Rafiei
| |
| | contribution = Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters
| |
| | title = Proceedings of the ACM SIGMOD Conference
| |
| | year = 2006 | pages = 25–36 | url = http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf}}
| |
| *{{Citation
| |
| | first1 = Sarang | last1 = Dharmapurikar
| |
| | first2 = Haoyu | last2 = Song
| |
| | first3 = Jonathan | last3 = Turner
| |
| | first4 = John | last4 = Lockwood
| |
| | contribution = Fast packet classification using Bloom filters
| |
| | title = Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems
| |
| | year = 2006 | pages = 61–70 | doi = 10.1145/1185347.1185356
| |
| | url = http://www.arl.wustl.edu/~sarang/ancs6819-dharmapurikar.pdf
| |
| | isbn = 1595935800}}
| |
| *{{Citation
| |
| | first1 = Martin| last1 = Dietzfelbinger
| |
| | first2 = Rasmus| last2 = Pagh
| |
| | contribution = Succinct Data Structures for Retrieval and Approximate Membership
| |
| | title = The Computing Research Repository (CoRR)
| |
| | year = 2008 | arxiv = 0803.3693
| |
| }}
| |
| *{{Citation
| |
| | first1=S. Joshua| last1 = Swamidass | first2 = Pierre | last2 = Baldi
| |
| | title=Mathematical correction for fingerprint similarity measures to improve chemical retrieval| journal=Journal of chemical information and modeling| year=2007| volume=47| number=3| pages=952–964| publisher=ACS Publications| accessdate=28 March 2013}}
| |
| *{{Citation
| |
| | first1 = Peter C. | last1 = Dillinger
| |
| | first2 = Panagiotis | last2 = Manolios
| |
| | contribution = Fast and Accurate Bitstate Verification for SPIN
| |
| | title = Proceedings of the 11th International Spin Workshop on Model Checking Software
| |
| | publisher = Springer-Verlag, Lecture Notes in Computer Science 2989
| |
| | year = 2004a
| |
| | url = http://www.ccs.neu.edu/home/pete/research/spin-3spin.html}}
| |
| *{{Citation
| |
| | first1 = Peter C. | last1 = Dillinger
| |
| | first2 = Panagiotis | last2 = Manolios
| |
| | contribution = Bloom Filters in Probabilistic Verification
| |
| | title = Proceedings of the 5th International Conference on Formal Methods in Computer-Aided Design
| |
| | publisher = Springer-Verlag, Lecture Notes in Computer Science 3312
| |
| | year = 2004b
| |
| | url = http://www.ccs.neu.edu/home/pete/research/bloom-filters-verification.html}}
| |
| *{{Citation
| |
| | first1 = Benoit | last1 = Donnet
| |
| | first2 = Bruno | last2 = Baynat
| |
| | first3 = Timur | last3 = Friedman
| |
| | contribution = Retouched Bloom Filters: Allowing Networked Applications to Flexibly Trade Off False Positives Against False Negatives
| |
| | title = CoNEXT 06 – 2nd Conference on Future Networking Technologies | year = 2006
| |
| | url = http://www.adetti.iscte.pt/events/CONEXT06/Conext06_Proceedings/papers/13.html}}
| |
| *{{Citation
| |
| | first1 = David | last1 = Eppstein | authorlink1 = David Eppstein
| |
| | first2 = Michael T. | last2 = Goodrich | author2-link = Michael T. Goodrich
| |
| | contribution = Space-efficient straggler identification in round-trip data streams via Newton's identities and invertible Bloom filters
| |
| | title = [[Workshop on Algorithms and Data Structures|Algorithms and Data Structures, 10th International Workshop, WADS 2007]]
| |
| | publisher = Springer-Verlag, Lecture Notes in Computer Science 4619
| |
| | year = 2007 | pages = 637–648
| |
| | arxiv = 0704.3313
| |
| }}
| |
| *{{Citation
| |
| | first1 = Li | last1 = Fan | first2 = Pei | last2 = Cao
| |
| | first3 = Jussara | last3 = Almeida
| |
| | first4 = Andrei | last4 = Broder | authorlink4 = Andrei Broder
| |
| | title = Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol
| |
| | journal = IEEE/ACM Transactions on Networking
| |
| | volume = 8 | issue = 3 | year = 2000 | pages = 281–293 | doi = 10.1109/90.851975}}. A preliminary version appeared at SIGCOMM '98.
| |
| *{{Citation
| |
| | first1 = Ashish | last1 = Goel
| |
| | first2 = Pankaj | last2 = Gupta
| |
| | title = Small subset queries and bloom filters using ternary associative memories, with applications
| |
| | journal = ACM Sigmetrics 2010
| |
| | volume = 38
| |
| | pages = 143
| |
| | year = 2010
| |
| | doi = 10.1145/1811099.1811056
| |
| }}
| |
| *{{Citation
| |
| | first1 = Adam |last1 = Kirsch | first2 = Michael | last2 = Mitzenmacher
| |
| | contribution = Less Hashing, Same Performance: Building a Better Bloom Filter
| |
| | title = Algorithms – ESA 2006, 14th Annual European Symposium
| |
| | publisher = Springer-Verlag, Lecture Notes in Computer Science 4168
| |
| | year = 2006 | pages = 456–467 | doi = 10.1007/11841036
| |
| | url = http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf
| |
| | volume = 4168
| |
| | editor1-last = Azar
| |
| | editor1-first = Yossi
| |
| | editor2-last = Erlebach
| |
| | editor2-first = Thomas
| |
| | series = Lecture Notes in Computer Science
| |
| | isbn = 978-3-540-38875-3}}
| |
| *{{Citation
| |
| | first1 = Christian Worm | last1 = Mortensen
| |
| | first2 = Rasmus | last2 = Pagh
| |
| | first3 = Mihai | last3 = Pătraşcu | author3-link = Mihai Pătraşcu
| |
| | contribution = On dynamic range reporting in one dimension
| |
| | title = Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing
| |
| | year = 2005 | pages = 104–111 | doi = 10.1145/1060590.1060606
| |
| | isbn = 1581139608}}
| |
| *{{Citation
| |
| | first1 = Anna | last1 = Pagh
| |
| | first2 = Rasmus | last2 = Pagh
| |
| | first3 = S. Srinivasa | last3 = Rao
| |
| | contribution = An optimal Bloom filter replacement
| |
| | title = Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms
| |
| | year = 2005 | pages = 823–829
| |
| | url = http://www.it-c.dk/people/pagh/papers/bloom.pdf}}
| |
| *{{Citation
| |
| | first1 = Ely| last1 = Porat
| |
| | contribution = An Optimal Bloom Filter Replacement Based on Matrix Solving
| |
| | title = The Computing Research Repository (CoRR)
| |
| | year = 2008 | arxiv = 0804.1845
| |
| }}
| |
| *{{Citation
| |
| | first1 = F. | last1 = Putze
| |
| | first2 = P. | last2 = Sanders
| |
| | first3 = J. | last3 = Singler
| |
| | contribution = Cache-, Hash- and Space-Efficient Bloom Filters
| |
| | title = Experimental Algorithms, 6th International Workshop, WEA 2007
| |
| | publisher = Springer-Verlag, Lecture Notes in Computer Science 4525
| |
| | year = 2007 | pages = 108–121 | doi = 10.1007/978-3-540-72845-0
| |
| | url = http://algo2.iti.uni-karlsruhe.de/singler/publications/cacheefficientbloomfilters-wea2007.pdf
| |
| | volume = 4525
| |
| | editor1-last = Demetrescu
| |
| | editor1-first = Camil
| |
| | series = Lecture Notes in Computer Science
| |
| | isbn = 978-3-540-72844-3}}
| |
| *{{Citation
| |
| | first1 = Simha | last1 = Sethumadhavan
| |
| | first2 = Rajagopalan | last2 = Desikan
| |
| | first3 = Doug | last3 = Burger
| |
| | first4 = Charles R. | last4 = Moore
| |
| | first5 = Stephen W. | last5 = Keckler
| |
| | contribution = Scalable hardware memory disambiguation for high ILP processors
| |
| | title = 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003, MICRO-36
| |
| | year = 2003 | pages = 399–410 | doi = 10.1109/MICRO.2003.1253244
| |
| | url = http://www.cs.utexas.edu/users/simha/publications/lsq.pdf
| |
| | isbn = 0-7695-2043-X}}
| |
| *{{Citation
| |
| | first1 = Kulesh | last1 = Shanmugasundaram
| |
| | first2 = Hervé | last2 = Brönnimann
| |
| | first3 = Nasir | last3 = Memon
| |
| | contribution = Payload attribution via hierarchical Bloom filters
| |
| | title = Proceedings of the 11th ACM Conference on Computer and Communications Security
| |
| | year = 2004 | pages = 31–41 | doi = 10.1145/1030083.1030089
| |
| | isbn = 1581139616}}
| |
| *{{Citation
| |
| | first1 = David | last1 = Starobinski
| |
| | first2 = Ari | last2 = Trachtenberg
| |
| | first3 = Sachin | last3=Agarwal
| |
| | title = Efficient PDA Synchronization
| |
| | journal = IEEE Transactions on Mobile Computing
| |
| | volume = 2
| |
| | year = 2003
| |
| | doi = 10.1109/TMC.2003.1195150
| |
| | pages = 40
| |
| | issue = 1}}
| |
| *{{Citation
| |
| | first1 = Ulrich | last1 = Stern
| |
| | first2 = David L. | last2 = Dill
| |
| | contribution = A New Scheme for Memory-Efficient Probabilistic Verification
| |
| | title = Proceedings of Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification: IFIP TC6/WG6.1 Joint International Conference
| |
| | publisher = Chapman & Hall, IFIP Conference Proceedings
| |
| | year = 1996
| |
| | id = {{citeseerx|10.1.1.47.4101}}
| |
| | pages = 333–348}}
| |
| *{{Citation
| |
| | first1 = Mohammad Hashem | last1 = Haghighat
| |
| | first2 = Mehdi | last2 = Tavakoli
| |
| | first3 = Mehdi | last3 = Kharrazi
| |
| | title = Payload Attribution via Character Dependent Multi-Bloom Filters
| |
| | journal = Transaction on Information Forensics and Security, IEEE
| |
| | volume = 99
| |
| | issue =
| |
| | year = 2013
| |
| | doi = 10.1109/TIFS.2013.2252341
| |
| | url = http://dx.doi.org/10.1109/TIFS.2013.2252341
| |
| | pages = }}
| |
| *{{Citation
| |
| | publisher = Cambridge University Press
| |
| | last1 = Mitzenmacher
| |
| | first1 = Michael
| |
| | last2 = Upfal
| |
| | first2 = Eli
| |
| | title = Probability and computing: Randomized algorithms and probabilistic analysis
| |
| | year = 2005
| |
| | url = http://books.google.com/books?id=0bAYl6d7hvkC&pg=PA110
| |
| | pages = 107–112
| |
| }}
| |
| *{{Citation
| |
| | volume = 16
| |
| | issue = 5
| |
| | pages = 558–560
| |
| | last = Mullin
| |
| | first = James K.
| |
| | title = Optimal semijoins for distributed database systems
| |
| | journal = Software Engineering, IEEE Transactions on
| |
| | year = 1990
| |
| }}
| |
| *{{Citation
| |
| | first1 = Ori | last1 = Rottenstreich
| |
| | first2 = Yossi | last2 = Kanizo
| |
| | first3 = Isaac | last3 = Keslassy
| |
| | contribution = The Variable-Increment Counting Bloom Filter
| |
| | title = 31st Annual IEEE International Conference on Computer Communications, 2012, Infocom 2012
| |
| | year = 2012 | pages = 1880-1888 | doi = 10.1109/INFCOM.2012.6195563
| |
| | url = http://webee.technion.ac.il/people/or/publications/Infocom12_VICBF.pdf
| |
| | isbn = 978-1-4673-0773-4}}
| |
| {{Refend}}
| |
| | |
| ==External links==
| |
| {{Commons category|Bloom filter}}
| |
| *[http://www.michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/ Why Bloom filters work the way they do (Michael Nielsen, 2012)]
| |
| *[http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html Table of false-positive rates for different configurations] from a [[University of Wisconsin–Madison]] website
| |
| *[http://tr.ashcan.org/2008/12/bloomers.html Interactive Processing demonstration] from ashcan.org
| |
| *[http://www.youtube.com/watch?v=947gWqwkhu0 "More Optimal Bloom Filters," Ely Porat (Nov/2007) Google TechTalk video] on [[YouTube]]
| |
| *[http://www.perl.com/pub/2004/04/08/bloom_filters.html "Using Bloom Filters"] Detailed Bloom Filter explanation using [[Perl]]
| |
| *[http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/ "A Garden Variety of Bloom Filters] - Explanation and Analysis of Bloom filter variants
| |
| | |
| === Implementations ===
| |
| {{External links|date=April 2013}}
| |
| {{div col}}
| |
| *[http://en.literateprograms.org/Bloom_filter_(C) Implementation in C] from literateprograms.org
| |
| *[https://github.com/mavam/libbf/ Implementation in C++11] on github.com
| |
| *[http://codeplex.com/bloomfilter Implementation in C#] from codeplex.com
| |
| *[http://sites.google.com/site/scalablebloomfilters/ Implementation in Erlang] from sites.google.com
| |
| *[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bloomfilter Implementation in Haskell] from haskell.org
| |
| *[https://github.com/DivineTraube/Orestes-Bloomfilter Implementation in Java] on github.com
| |
| *[http://la.ma.la/misc/js/bloomfilter/ Implementation in Javascript] from la.ma.la
| |
| *[https://github.com/wiedi/node-bloem Implementation in JS for node.js] on github.com
| |
| *[http://lemonodor.com/archives/000881.html Implementation in Lisp] from lemonodor.com
| |
| *[https://metacpan.org/module/Bloom::Filter Implementation in Perl] from [[CPAN]]
| |
| *[http://code.google.com/p/php-bloom-filter/ Implementation in PHP] from code.google.com
| |
| *[https://pypi.python.org/pypi/drs-bloom-filter/ Implementation in Python, Traditional Bloom Filter] from pypi.python.org
| |
| *[http://pypi.python.org/pypi/pybloom/1.0.2 Implementation in Python, Scalable Bloom Filter] from pypi.python.org
| |
| *[http://www.rubyinside.com/bloom-filters-a-powerful-tool-599.html Implementation in Ruby] from rubyinside.com
| |
| *[http://www.codecommit.com/blog/scala/bloom-filters-in-scala Implementation in Scala] from codecommit.com
| |
| *[http://www.kocjan.org/tclmentor/61-bloom-filters-in-tcl.html Implementation in Tcl] from kocjan.org
| |
| {{div col end}}
| |
| | |
| {{DEFAULTSORT:Bloom Filter}}
| |
| [[Category:Hashing]]
| |
| [[Category:Probabilistic data structures]]
| |
| [[Category:Lossy compression algorithms]]
| |