Luneburg lens: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Catslash
m Path of a ray within the lens: rm extraneous 'is'
en>Srleffler
No need to pipe
 
Line 1: Line 1:
{| class="infobox" style="width: 22em"
Emilia Shryock is my title but you can contact me anything you like. One of the issues she loves most is to study comics and she'll be beginning something else along with it. Supervising is my profession. Minnesota is exactly where he's been residing for many years.<br><br>Here is my web page ... [http://bonetoob.com/user/LV40 bonetoob.com]
! colspan="3" style="font-size: 125%; text-align: center;" | Suffix array
|-
! [[List of data structures|Type]]
| colspan="2" | [[Array data structure|Array]]
|-
! Invented by
{{!}} colspan="2" {{!}}  {{harvtxt|Manber|Myers|1990}}
|-
! colspan="3" class="navbox-abovebelow" | [[Time complexity]]<br />in [[big O notation]]
|-
|
| Average
| Worst case
|-
! Space
| <math>\mathcal{O}(n)</math>
| <math>\mathcal{O}(n)</math>
|-
! Construction
| <math>\mathcal{O}(n)</math>
| <math>\mathcal{O}(n)</math>
|}
 
In [[computer science]], a '''suffix array''' is a sorted [[Array data structure|array]] of all [[Suffix (computer science)|suffixes]] of a [[String (computer science)|string]]. It is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms and within the field of [[bioinformatics]].{{sfn|Abouelhoda|Kurtz|Ohlebusch|2002}}
 
Suffix arrays were introduced by {{harvtxt|Manber|Myers|1990}} as a simple, space efficient alternative to [[suffix tree]]s. They have independently been discovered by {{harvtxt|Gonnet|Baeza-Yates|Snider|1992}} under the name ''PAT array''.
 
== Definition ==
Let <math>S=s_1,s_2, ..., s_n</math> be a string and let <math>S[i,j]</math> denote the substring of <math>S</math> ranging from <math>i</math> to <math>j</math>.
 
The suffix array <math>A</math> of <math>S</math> is now defined to be an array of integers providing the starting positions of [[Suffix (computer science)|suffixes]] of <math>S</math> in [[lexicographical order]]. This means, an entry <math>A[i]</math> contains the starting position of the <math>i</math>-th smallest suffix in <math>S</math> and thus for all <math>1 < i \leq n</math>: <math>S[A[i-1],n] < S[A[i],n]</math>.
 
== Example ==
 
Consider the text {{mvar|S}}=<code>banana$</code> to be indexed:
{| class="wikitable"
|-
! {{left header}} | i
| 1 || 2 || 3 || 4 || 5 || 6 || 7
|-
! {{left header}} | S[i]
| b || a || n || a || n || a || $
|}
 
The text ends with the special sentinel letter <code>$</code> that is unique and lexicographically smaller than any other character. The text has the following suffixes:
 
{|  class="wikitable"
!  align="left" | Suffix !!  align="left" | i
|-  class="odd"
|  align="left" | banana$ ||  align="left" | 1
|-  class="even"
|  align="left" | anana$ ||  align="left" | 2
|-  class="odd"
|  align="left" | nana$ ||  align="left" | 3
|-  class="even"
|  align="left" | ana$ ||  align="left" | 4
|-  class="odd"
|  align="left" | na$ ||  align="left" | 5
|-  class="even"
|  align="left" | a$ ||  align="left" | 6
|-  class="odd"
|  align="left" | $ ||  align="left" | 7
|}
 
These suffixes can be sorted:
 
{|  class="wikitable"
!  align="left" | Suffix !!  align="left" | i
|-  class="odd"
|  align="left" | $ ||  align="left" | 7
|-  class="even"
|  align="left" | a$ ||  align="left" | 6
|-  class="odd"
|  align="left" | ana$ ||  align="left" | 4
|-  class="even"
|  align="left" | anana$ ||  align="left" | 2
|-  class="odd"
|  align="left" | banana$ ||  align="left" | 1
|-  class="even"
|  align="left" | na$ ||  align="left" | 5
|-  class="odd"
|  align="left" | nana$ ||  align="left" | 3
|}
 
The suffix array {{mvar|A}} contains the starting positions of these sorted suffixes:
 
{| class="wikitable"
! {{left header}} | i
| 1 || 2 || 3 || 4 || 5 || 6 || 7
|-
! {{left header}} | A[i]
| 7 || 6 || 4 || 2 || 1 || 5 || 3
|}
 
Complete array with suffixes itself :
 
{| class="wikitable"
! {{left header}} | i
| 1 || 2 || 3 || 4 || 5 || 6 || 7
|-
! {{left header}} | A[i]
| 7 || 6 || 4 || 2 || 1 || 5 || 3
|-
! {{left header}} | 1
| $ || a || a || a || b || n || n
|-
! {{left header}} | 2
|  || $ || n || n || a || a || a
|-
! {{left header}} | 3
|  ||  || a || a || n || $ || n
|-
! {{left header}} | 4
|  ||  || $ || n || a ||  || a
|-
! {{left header}} | 5
|  ||  ||  || a || n ||  || $
|-
! {{left header}} | 6
|  ||  ||  || $ || a ||  ||
|-
! {{left header}} | 7
|  ||  ||  ||  || $ ||  ||
 
|}
 
So for example, <code>A[3]</code> contains the value 4, and therefore refers to the suffix starting at position 4 within {{mvar|S}}, which is the suffix <code>ana$</code>.
 
== Correspondence to suffix trees ==
 
Suffix arrays are closely related to [[suffix tree]]s:
 
* Suffix arrays can be constructed by performing a [[depth-first traversal]] of a suffix tree. The suffix array corresponds to the leaf-labels given in the order in which these are visited during the traversal, if edges are visited in the lexicographical order of their first character.
* A suffix tree can be constructed in linear time by using a combination of suffix and [[LCP array]]. For a description of the algorithm, see the [[LCP array#Suffix Tree Construction|corresponding section]] in the [[LCP array]] article.
 
It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information (such as the [[LCP array]]) and solves the same problem in the same time complexity.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2004}}
Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared to [[Ukkonen's algorithm]]) and improved cache locality.{{sfn|Abouelhoda|Kurtz|Ohlebusch|2002}}
 
== Space Efficiency ==
 
Suffix arrays were introduced by {{harvtxt|Manber|Myers|1990}} in order to improve over the space requirements of [[suffix tree]]s: Suffix arrays store <math>n</math> integers. Assuming an integer requires <math>4</math> bytes, a suffix array requires <math>4n</math> bytes in total. This is significantly less than the <math>20n</math> bytes which are required by a careful suffix tree implementation.{{sfn|Kurtz|1999}}
 
However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a suffix array requires <math>\mathcal{O}(n \log n)</math> space, whereas the original text over an alphabet of size <math>\sigma</math> does only require <math>\mathcal{O}(n \log \sigma)</math> bits.
For a human genome with <math>\sigma = 4</math> and <math>n = 3.4 \times 10^9</math> the suffix array would therefore occupy about 16 times more memory than the genome itself.
 
Such discrepancies motivated a trend towards [[compressed suffix array]]s and [[BWT]]-based compressed full-text indices such as the [[FM-index]]. These data structures require only space within the size of the text or even less.
 
== Construction Algorithms ==
 
A naive approach to construct a suffix array is to use a [[Comparison sort|comparison-based sorting algorithm]]. These algorithms require <math>\mathcal{O}(n \log n)</math> suffix comparisons, but a suffix comparison runs in <math>\mathcal{O}(n)</math> time, so the overall runtime of this approach is <math>\mathcal{O}(n^2 \log n)</math>.
 
More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals:{{sfn|Puglisi|Smyth|Turpin|2007}}
* minimal asymptotic complexity <math>\Theta(n)</math>
* lightweight in space, meaning little or no working memory beside the text and the suffix array itself is needed
* fast in practice
 
One of the first algorithms to achieve all goals is the SA-IS algorithm of {{harvtxt|Nong|Zhang|Chan|2009}}. The algorithm is also rather simple (&lt; 100 [[Source lines of code|LOC]]) and can be enhanced to simultaneously construct the [[LCP array]].{{sfn|Fischer|2011}} The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A careful [https://sites.google.com/site/yuta256/sais implementation by Yuta Mori] outperforms most other linear or super-linear construction approaches.
 
Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported [[Alphabet (computer science)|alphabet]]:  ''constant alphabets'' where the alphabet size is bound by a constant, ''integer alphabets'' where characters are integers in a range depending on <math>n</math> and ''general alphabets'' where only character comparisons are allowed.{{sfn|Burkhardt|Kärkkäinen|2003}}
 
Most suffix array construction algorithms are based on one of the following approaches:{{sfn|Puglisi|Smyth|Turpin|2007}}
* ''Prefix doubling'' algorithms are based on a strategy of {{harvtxt|Karp|Miller|Rosenberg|1972}}. The idea is to find prefixes that honor the lexicographic ordering of suffixes. The assessed prefix length doubles in each iteration of the algorithm until a prefix is unique and provides the rank of the associated suffix.
* ''Recursive'' algorithms follow the approach of the suffix tree construction algorithm by {{harvtxt|Farach|1997}} to recursively sort a subset of suffixes. This subset is then used to infer a suffix array of the remaining suffixes. Both of these suffix arrays are then merged to compute the final suffix array.
* ''Induced copying'' algorithms are similar to recursive algorithms in the sense that they use an already sorted subset to induce a fast sort of the remaining suffixes. The difference is that these algorithms favor iteration over recursion to sort the selected suffix subset. A survey of this diverse group of algorithms has been put together by {{harvtxt|Puglisi|Smyth|Turpin|2007}}.
 
A well-known recursive algorithm for integer alphabets is the ''DC3 / skew'' algorithm of {{harvtxt|Kärkkäinen|Sanders|2003}}. It runs in linear time and has successfully been used as the basis for parallel{{sfn|Kulla|Sanders|2007}} and [[External memory algorithm|external memory]]{{sfn|Dementiev|Kärkkäinen|Mehnert|Sanders|2008}} suffix array construction algorithms.
 
Recent work by {{harvtxt|Salson|Lecroq|Léonard|Mouchard|2009}} proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is <math>\mathcal{O}(n \log n)</math>, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.
 
== Applications ==
 
The suffix array of a string can be used as an [[Index (search engine)|index]] to quickly locate every occurrence of a substring pattern <math>P</math> within the string <math>S</math>. Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two [[binary search]]es. The first search locates the starting position of the interval, and the second one determines the end position:
 
<source lang="python">
    def search(P):
        l = 0; r = n
        while l < r:
            mid = (l+r) / 2
            if P > suffixAt(A[mid]):
                l = mid + 1
            else:
                r = mid
        s = l; r = n
        while l < r:
            mid = (l+r) / 2
            if P < suffixAt(A[mid]):
                r = mid
            else:
                l = mid + 1
        return (s, r)</source>
Finding the substring pattern <math>P</math> of length <math>m</math> in the string <math>S</math> of length <math>n</math> takes <math>\mathcal{O}(m \log n)</math> time, given that a single suffix comparison needs to compare <math>m</math> characters. {{harvtxt|Manber|Myers|1990}} describe how this bound can be improved to <math>\mathcal{O}(m + \log n)</math> time using [[LCP array|LCP]] information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. {{harvtxt|Abouelhoda|Kurtz|Ohlebusch|2004}} improve the bound even further and achieve a search time of <math>\mathcal{O}(m)</math> as known from [[suffix tree]]s.
 
Suffix sorting algorithms can be used to compute the [[Burrows–Wheeler transform|Burrows–Wheeler transform (BWT)]]. The [[Burrows–Wheeler transform|BWT]] requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated [[Burrows–Wheeler transform|BWT]] matrix corresponds to the order of suffixes in a suffix array. The [[Burrows–Wheeler transform|BWT]] can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the [[Burrows–Wheeler transform|BWT]] string: <math>BWT[i] = S[A[i]-1]</math>.
 
Suffix arrays can also be used to look up substrings in [[Example-Based Machine Translation]], demanding much less storage than a full [[phrase table]] as used in [[Statistical machine translation]].
 
Many additional applications of the suffix array require the [[LCP array]]. Some of these are detailed in the [[LCP array#Applications|application section]] of the latter.
 
== Notes ==
{{Reflist}}
 
== References ==
* {{cite journal|ref=harv
| doi=10.1016/S1570-8667(03)00065-0
| title=Replacing suffix trees with enhanced suffix arrays
| year=2004
| last1=Abouelhoda | first1=Mohamed Ibrahim
| last2=Kurtz | first2=Stefan
| last3=Ohlebusch | first3=Enno
| journal=Journal of Discrete Algorithms
| volume=2
| pages=53}}
* {{cite conference|ref=harv
| title = Suffix arrays: a new method for on-line string searches
| year = 1990
| conference = First Annual ACM-SIAM Symposium on Discrete Algorithms
| pages = 319–327
| url = http://dl.acm.org/citation.cfm?id=320176.320218
| last1 = Manber | first1 =  Udi | author1-link = Udi_Manber
| last2 = Myers | first2 =  Gene | author2-link = Gene_Myers
}}
* {{cite journal|ref=harv
| title = Suffix arrays: a new method for on-line string searches
| year = 1993
| journal = SIAM Journal on Computing
| volume = 22
| pages = 935-948
| doi = 10.1137/0222058
| url = http://dl.acm.org/citation.cfm?id=320176.320218
| last1 = Manber | first1 =  Udi | author1-link = Udi_Manber
| last2 = Myers | first2 =  Gene | author2-link = Gene_Myers
}}
* {{cite journal|ref=harv
| title = New indices for text: PAT trees and PAT arrays
| year = 1992
| journal = Information retrieval: data structures and algorithms
| last1 = Gonnet | first1 =  G.H
| last2 =  Baeza-Yates | first2 =  R.A
| last3 = Snider        | first3 = T
}}
* {{cite journal|ref=harv
| title = Reducing the space requirement of suffix trees
| year = 1999
| journal = Software-Practice and Experience
| volume = 29
| issue = 13
| last1 = Kurtz | first1 =  S
| doi=10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O
| pages=1149
}}
* {{cite journal|ref=harv
| doi = 10.1007/3-540-45784-4_35|chapter=The Enhanced Suffix Array and Its Applications to Genome Analysis|title=Algorithms in Bioinformatics|series=Lecture Notes in Computer Science|year=2002|last1=Abouelhoda|first1=Mohamed Ibrahim|last2=Kurtz|first2=Stefan|last3=Ohlebusch|first3=Enno|isbn=978-3-540-44211-0|volume=2452|pages=449
}}
* {{cite journal|ref=harv
| doi = 10.1145/1242471.1242472|title=A taxonomy of suffix array construction algorithms|year=2007|last1=Puglisi|first1=Simon J.|last2=Smyth|first2=W. F.|last3=Turpin|first3=Andrew H.|journal=ACM Computing Surveys|volume=39|issue=2|pages=4
}}
* {{cite journal|ref=harv
| doi = 10.1109/DCC.2009.42|chapter=Linear Suffix Array Construction by Almost Pure Induced-Sorting|title=2009 Data Compression Conference|year=2009|last1=Nong|first1=Ge|last2=Zhang|first2=Sen|last3=Chan|first3=Wai Hong|isbn=978-0-7695-3592-0|pages=193
}}
* {{cite journal|ref=harv
| doi = 10.1007/978-3-642-22300-6_32|chapter=Inducing the LCP-Array|title=Algorithms and Data Structures|series=Lecture Notes in Computer Science|year=2011|last1=Fischer|first1=Johannes|isbn=978-3-642-22299-3|volume=6844|pages=374
}}
* {{cite journal|ref=harv
| doi = 10.1016/j.jda.2009.02.007|title=Dynamic extended suffix arrays|year=2010|last1=Salson|first1=M.|last2=Lecroq|first2=T.|last3=Léonard|first3=M.|last4=Mouchard|first4=L.|journal=Journal of Discrete Algorithms|volume=8|issue=2|pages=241
}}
* {{cite journal|ref=harv
  | doi = 10.1007/3-540-44888-8_5|chapter=Fast Lightweight Suffix Array Construction and Checking|title=Combinatorial Pattern Matching|series=Lecture Notes in Computer Science|year=2003|last1=Burkhardt|first1=Stefan|last2=Kärkkäinen|first2=Juha|isbn=978-3-540-40311-1|volume=2676|pages=55
}}
* {{cite journal|ref=harv
  | doi = 10.1145/800152.804905|chapter=Rapid identification of repeated patterns in strings, trees and arrays|title=Proceedings of the fourth annual ACM symposium on Theory of computing  - STOC '72|year=1972|last1=Karp|first1=Richard M.|last2=Miller|first2=Raymond E.|last3=Rosenberg|first3=Arnold L.|pages=125
}}
* {{cite journal|ref=harv
  | doi = 10.1109/SFCS.1997.646102|chapter=Optimal suffix tree construction with large alphabets|title=Proceedings 38th Annual Symposium on Foundations of Computer Science|year=1997|last1=Farach|first1=M.|isbn=0-8186-8197-7|pages=137
}}
* {{cite journal|ref=harv
  | doi = 10.1007/3-540-45061-0_73|chapter=Simple Linear Work Suffix Array Construction|title=Automata, Languages and Programming|series=Lecture Notes in Computer Science|year=2003|last1=Kärkkäinen|first1=Juha|last2=Sanders|first2=Peter|isbn=978-3-540-40493-4|volume=2719|pages=943
}}
* {{cite journal|ref=harv
  | doi = 10.1145/1227161.1402296|title=Better external memory suffix array construction|year=2008|last1=Dementiev|first1=Roman|last2=Kärkkäinen|first2=Juha|last3=Mehnert|first3=Jens|last4=Sanders|first4=Peter|journal=Journal of Experimental Algorithmics|volume=12|pages=1
}}
* {{cite journal|ref=harv
  | doi = 10.1016/j.parco.2007.06.004|title=Scalable parallel suffix array construction|year=2007|last1=Kulla|first1=Fabian|last2=Sanders|first2=Peter|journal=Parallel Computing|volume=33|issue=9|pages=605
}}
 
== External links ==
* [http://algs4.cs.princeton.edu/63suffix/SuffixArray.java.html Suffix Array in Java]
* [http://code.google.com/p/compression-code/downloads/list Suffix sorting module for BWT in C code]
* [http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845 Suffix Array Implementation in Ruby]
* [http://sary.sourceforge.net/index.html.en Suffix array library and tools]
* [http://pizzachili.dcc.uchile.cl/ Project containing various Suffix Array c/c++ Implementations with a unified interface]
* [http://code.google.com/p/libdivsufsort/ A fast, lightweight, and robust C API library to construct the suffix array]
* [http://code.google.com/p/pysuffix/ Suffix Array implementation in Python]
 
[[Category:Arrays]]
[[Category:Substring indices]]
[[Category:String data structures]]

Latest revision as of 04:22, 17 April 2014

Emilia Shryock is my title but you can contact me anything you like. One of the issues she loves most is to study comics and she'll be beginning something else along with it. Supervising is my profession. Minnesota is exactly where he's been residing for many years.

Here is my web page ... bonetoob.com