|
|
Line 1: |
Line 1: |
| In [[computer science]], '''tabulation hashing''' is a method for constructing [[Universal hashing|universal families of hash functions]] by combining [[lookup table|table lookup]] with [[exclusive or]] operations. It is simple and fast enough to be usable in practice, and has theoretical properties that (in contrast to some other universal hashing methods) make it usable with [[linear probing]], [[cuckoo hashing]], and the [[MinHash]] technique for estimating the size of set intersections. The first instance of tabulation hashing is [[Zobrist hashing]] (1969). It was later rediscovered by {{harvtxt|Carter|Wegman|1979}} and studied in more detail by {{harvtxt|Pătraşcu|Thorup|2011}}.
| | Let me first start by introducing myself. My name is Maurine Cadet. For years I've been living in Puerto Rico and my family members enjoys it. The factor she adores most is solving puzzles and she would by no means quit performing it. Supervising is what I do in my day occupation. You can always discover her web site here: http://brandshoeshunter.tumblr.com<br><br>My web page - [http://brandshoeshunter.tumblr.com Nike Womens Shoes] |
| | |
| ==Method==
| |
| Let ''p'' denote the number of [[bit]]s in a key to be hashed, and ''q'' denote the number of bits desired in an output hash function. Let ''r'' be a number smaller than ''p'', and let ''t'' be the smallest integer that is at least as large as ''p''/''r''. For instance, if ''r'' = 8, then an ''r''-bit number is a [[byte]], and ''t'' is the number of bytes per key.
| |
| | |
| The key idea of tabulation hashing is to view a key as a [[Row vector|vector]] of ''t'' ''r''-bit numbers, use a [[lookup table]] filled with random values to compute a hash value for each of the ''r''-bit numbers representing a given key, and combine these values with the bitwise binary [[exclusive or]] operation. The choice of ''r'' should be made in such a way that this table is not too large; e.g., so that it fits into the computer's [[cache memory]]. | |
| | |
| The initialization phase of the algorithm creates a two-dimensional array ''T'' of dimensions 2<sup>''r''</sup> by ''q'', and fills the array with random numbers. Once the array ''T'' is initialized, it can be used to compute the hash value ''h''(''x'') of any given key ''x''. To do so, partition ''x'' into ''r''-bit values, where ''x''<sub>0</sub> consists of the low order ''r'' bits of ''x'', ''x''<sub>1</sub> consists of the next ''r'' bits, etc. (E.g., again, with ''r'' = 8, ''x''<sub>''i''</sub> is just the ''i''th byte of ''x'').
| |
| Then, use these values as indices into ''T'' and combine them with the exclusive or operation:
| |
| :''h''(''x'') = ''T''[''x''<sub>0</sub>] ⊕ ''T''[''x''<sub>1</sub>] ⊕ ''T''[''x''<sub>2</sub>] ⊕ ...
| |
| | |
| ==Universality==
| |
| {{harvtxt|Carter|Wegman|1979}} define a randomized scheme for generating hash functions to be [[universal hashing|universal]] if, for any two keys, the probability that they [[Collision (computer science)|collide]] (that is, they are mapped to the same value as each other) is 1/''m'', where ''m'' is the number of values that the keys can take on. They defined a stronger property in the subsequent paper {{harvtxt|Wegman|Carter|1981}}: a randomized scheme for generating hash functions is [[K-independent hashing|''k''-independent'']] if, for every ''k''-tuple of keys, and each possible ''k''-tuple of values, the probability that those keys are mapped to those values is 1/''m''<sup>''k''</sup>. 2-independent hashing schemes are automatically universal, and any universal hashing scheme can be converted into a 2-independent scheme by storing a random number ''x'' in the initialization phase of the algorithm and adding ''x'' to each hash value, so universality is essentially the same as 2-independence, but ''k''-independence for larger values of ''k'' is a stronger property, held by fewer hashing algorithms.
| |
| | |
| As {{harvtxt|Pătraşcu|Thorup|2011}} observe, tabulation hashing is 3-independent but not 4-independent. For any single key ''x'', ''T''[''x''<sub>0</sub>,0] is equally likely to take on any hash value, and the exclusive or of ''T''[''x''<sub>0</sub>,0] with the remaining table values does not change this property. For any two keys ''x'' and ''y'', ''x'' is equally likely to be mapped to any hash value as before, and there is at least one position ''i'' where ''x<sub>i</sub>'' ≠ ''y<sub>i</sub>''; the table value ''T''[''y''<sub>''i''</sub>,''i''] is used in the calculation of ''h''(''y'') but not in the calculation of ''h''(''x''), so even after the value of ''h''(''x'') has been determined, ''h''(''y'') is equally likely to be any valid hash value. Similarly, for any three keys ''x'', ''y'', and ''z'', at least one of the three keys has a position ''i'' where its value ''z''<sub>''i''</sub> differs from the other two, so that even after the values of ''h''(''x'') and ''h''(''y'') are determined, ''h''(''z'') is equally likely to be any valid hash value.
| |
| | |
| However, this reasoning breaks down for four keys because there are sets of keys ''w'', ''x'', ''y'', and ''z'' where none of the four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes each, and ''w'', ''x'', ''y'', and ''z'' are the four keys that have either zero or one as their byte values, then each byte value in each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation hashing will always satisfy the equation {{nowrap|1=''h''(''w'') ⊕ ''h''(''x'') ⊕ ''h''(''y'') ⊕ ''h''(''z'') = 0}}, whereas for a 4-independent hashing scheme the same equation would only be satisfied with probability 1/''m''. Therefore, tabulation hashing is not 4-independent.
| |
| | |
| {{harvtxt|Siegel|2004}} uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on [[expander graph]]s for transforming the key bits into table indices, to define hashing schemes that are ''k''-independent for any constant or even logarithmic value of ''k''. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive.
| |
| | |
| One limitation of tabulation hashing is that it assumes that the input keys have a fixed number of bits. {{harvtxt|Lemire|2012}} has studied variations of tabulation hashing that can be applied to variable-length strings, and shown that they can be universal (2-independent) but not 3-independent.
| |
| | |
| ==Application==
| |
| Because tabulation hashing is a universal hashing scheme, it can be used in any hashing-based algorithm in which universality is sufficient. For instance, in [[Hash table#Separate chaining|hash chaining]], the expected time per operation is proportional to the sum of collision probabilities, which is the same for any universal scheme as it would be for truly random hash functions, and is constant whenever the load factor of the hash table is constant. Therefore, tabulation hashing can be used to compute hash functions for hash chaining with a theoretical guarantee of constant expected time per operation.<ref name="cw79">{{harvtxt|Carter|Wegman|1979}}.</ref>
| |
| | |
| However, universal hashing is not strong enough to guarantee the performance of some other hashing algorithms. For instance, for [[linear probing]], 5-independent hash functions are strong enough to guarantee constant time operation, but there are 4-independent hash functions that fail.<ref>For the sufficiency of 5-independent hashing for linear probing, see {{harvtxt|Pagh|Pagh|Ružić|2009}}. For examples of weaker hashing schemes that fail, see {{harvtxt|Pătraşcu|Thorup|2010}}.</ref> Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing.<ref name="pt11">{{harvtxt|Pătraşcu|Thorup|2011}}.</ref>
| |
| | |
| [[Cuckoo hashing]], another technique for implementing [[hash table]]s, guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a cuckoo hash table for a static set of keys that does not change as the table is used.<ref name="pt11"/> | |
| | |
| Algorithms such as [[Karp-Rabin]] requires the efficient computation of hashing all consecutive sequences of <math>k</math> characters. We typically use [[rolling hash]] functions for these problems. Tabulation hashing is used to construct families of strongly [[universal hashing|universal functions]] (for example, [[rolling hash#Cyclic_polynomial|hashing by cyclic polynomials]]).
| |
| | |
| ==Notes==
| |
| {{reflist}}
| |
| | |
| ==References==
| |
| *{{citation
| |
| | last1 = Carter | first1 = J. Lawrence
| |
| | last2 = Wegman | first2 = Mark N. | author2-link = Mark N. Wegman
| |
| | doi = 10.1016/0022-0000(79)90044-8
| |
| | issue = 2
| |
| | journal = Journal of Computer and System Sciences
| |
| | mr = 532173
| |
| | pages = 143–154
| |
| | title = Universal classes of hash functions
| |
| | volume = 18
| |
| | year = 1979}}.
| |
| *{{citation
| |
| | last = Lemire | first = Daniel
| |
| | arxiv = 1008.1715
| |
| | journal = Discrete Applied Mathematics
| |
| | volume = 160
| |
| | pages = 604–617
| |
| | doi = 10.1016/j.dam.2011.11.009
| |
| | title = The universality of iterated hashing over variable-length strings
| |
| | year = 2012}}.
| |
| *{{citation
| |
| | last1 = Pagh | first1 = Anna
| |
| | last2 = Pagh | first2 = Rasmus
| |
| | last3 = Ružić | first3 = Milan
| |
| | doi = 10.1137/070702278
| |
| | issue = 3
| |
| | journal = SIAM Journal on Computing
| |
| | mr = 2538852
| |
| | pages = 1107–1120
| |
| | title = Linear probing with constant independence
| |
| | volume = 39
| |
| | year = 2009}}.
| |
| *{{citation
| |
| | last1 = Pătraşcu | first1 = Mihai | author1-link = Mihai Pătraşcu
| |
| | last2 = Thorup | first2 = Mikkel | author2-link = Mikkel Thorup
| |
| | contribution = On the k-independence required by linear probing and minwise independence
| |
| | doi = 10.1007/978-3-642-14165-2_60
| |
| | pages = 715–726
| |
| | publisher = Springer
| |
| | series = Lecture Notes in Computer Science
| |
| | title = Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I
| |
| | url = http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf
| |
| | volume = 6198
| |
| | year = 2010}}.
| |
| *{{citation
| |
| | last1 = Pătraşcu | first1 = Mihai | author1-link = Mihai Pătraşcu
| |
| | last2 = Thorup | first2 = Mikkel | author2-link = Mikkel Thorup
| |
| | arxiv = 1011.5200
| |
| | contribution = The power of simple tabulation hashing
| |
| | doi = 10.1145/1993636.1993638
| |
| | pages = 1–10
| |
| | title = Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11)
| |
| | year = 2011}}.
| |
| *{{citation
| |
| | last = Siegel | first = Alan
| |
| | doi = 10.1137/S0097539701386216
| |
| | issue = 3
| |
| | journal = SIAM Journal on Computing
| |
| | mr = 2066640
| |
| | pages = 505–543 (electronic)
| |
| | title = On universal classes of extremely random constant-time hash functions
| |
| | volume = 33
| |
| | year = 2004}}.
| |
| *{{citation
| |
| | last1 = Wegman | first1 = Mark N. | author1-link = Mark N. Wegman
| |
| | last2 = Carter | first2 = J. Lawrence
| |
| | doi = 10.1016/0022-0000(81)90033-7
| |
| | issue = 3
| |
| | journal = Journal of Computer and System Sciences
| |
| | mr = 633535
| |
| | pages = 265–279
| |
| | title = New hash functions and their use in authentication and set equality
| |
| | volume = 22
| |
| | year = 1981}}.
| |
| | |
| [[Category:Hashing]]
| |
| [[Category:Hash functions]]
| |