Five-dimensional space: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>JohnBlackburne
Rm section; far too long and unsourced. Possibly could be sorted out with sources to filter out OR, but none available
 
Line 1: Line 1:
[[File:Las Vegas slot machines.jpg|thumb|right|A row of slot machines in Las Vegas.]]
Oscar is what my spouse loves to contact me and I completely dig that name. Bookkeeping is what I do. My family members life in Minnesota and my family members enjoys it. The preferred pastime for my children and me is to play baseball but I haven't made a dime with it.<br><br>Have a look at my web blog ... at home std testing ([http://www.article-galaxy.com/profile.php?a=165109 simply click the up coming site])
 
In [[probability theory]], the '''multi-armed bandit problem''' (sometimes called the '''''K''-<ref name="doi10.1023/A:1013689704352">{{cite doi|10.1023/A:1013689704352}}</ref> or ''N''-armed bandit problem'''<ref>{{cite doi|10.1287/moor.12.2.262}}</ref>) is the problem a gambler faces at a row of [[slot machines]], sometimes known as "one-armed bandits", when deciding which machines to play, how many times to play each machine and in which order to play them.<ref name="weber">{{cite jstor|2959678}}</ref> When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.<ref name="Gittins89"/><ref Name="BF"/>
 
[[Herbert Robbins|Robbins]] in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "Some aspects of the sequential design of experiments".<ref>{{Cite doi|10.1090/S0002-9904-1952-09620-8}}</ref>
 
A theorem, the [[Gittins index]] published first by [[John C. Gittins]] gives an optimal policy in the [[Markov chain|Markov setting]] for maximizing the expected discounted reward.<ref>{{cite jstor|2985029}}</ref>
 
In practice, multi-armed bandits have been used to model the problem of managing research projects in a large organization, like a science foundation or a [[Pharmaceutical_industry|pharmaceutical company]]. Given a fixed budget, the problem is to allocate resources among the competing projects, whose properties are only partially known at the time of allocation, but which may become better understood as time passes.<ref name="Gittins89" /><ref name="BF"/>
 
In early versions of the multi-armed bandit problem, the gambler has no initial knowledge about the machines. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more [[Bayes' theorem|information]] about the expected payoffs of the other machines.
 
==Empirical motivation==
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge and optimize his or her decisions based on existing knowledge. There are many practical applications:
* [[clinical trial]]s investigating the effects of different experimental treatments while minimizing patient losses,<ref name="Gittins89" /><ref name="BF"/><ref name="WHP"/><ref name="KD">Press (1986)</ref>  and
* [[adaptive routing]] efforts for minimizing delays in a network.
In these practical examples, the problem requires balancing reward maximization based on the knowledge already acquired with attempting new actions to further increase knowledge. This is known as the ''exploitation vs. exploration tradeoff'' in [[reinforcement learning]].
 
The model can also be used to control dynamic allocation of resources to different projects, answering the question "which project should I work on" given uncertainty about the difficulty and payoff of each possibility.
 
Originally considered by Allied scientists in [[World War II]], it proved so intractable that, according to [[Peter Whittle]], it was proposed the problem be dropped over Germany so that German scientists could also waste their time on it.<ref name="Whittle79"/>
 
The version of the problem now commonly analyzed was formulated by [[Herbert Robbins]] in 1952.
 
==The multi-armed bandit model==
The multi-armed bandit (or just bandit for short) can be seen as a set of real distributions <math>B = \{R_1, \dots ,R_K\}</math>, each distribution being  associated with the rewards delivered by one of the ''K'' levers. Let <math>\mu_1, \dots , \mu_K</math> be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon ''H'' is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state [[Markov decision process]]. The regret <math>\rho</math> after ''T'' rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards: <math>\rho = T \mu^* - \sum_{t=1}^T \widehat{r}_t</math>, where <math>\mu^*</math> is the maximal reward mean, <math>\mu^* = \max_k \{ \mu_k \}</math>, and <math>\widehat{r}_t</math> is the reward at time ''t''. A strategy whose average regret per round <math>\rho / T</math> tends to zero with probability 1 when the number of played rounds tends to infinity is a ''zero-regret strategy''. Intuitively, zero-regret strategies are guaranteed to converge to an optimal strategy, not necessarily unique, if enough rounds are played.
 
==Variations==
A common formulation is the ''Binary multi-armed bandit'' or ''Bernoulli multi-armed bandit,'' which issues a reward of one with probability p, and otherwise a reward of zero.
 
Another formulation of the multi-armed bandit has each arm representing an independent Markov machine. Each time a particular arm is played, the state of that machine advances to a new one, chosen according to the Markov state evolution probabilities. There is a reward depending on the current state of the machine. In a generalisation called the "restless bandit problem", the states of non-played arms can also evolve over time.<ref name="Whittle88"/> There has also been discussion of systems where the number of choices (about which arm to play) increases over time.<ref name="Whittle81"/>
 
Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining positive{{Clarify|date=September 2010}} results for
finite numbers of trials with both
stochastic <ref name="doi10.1023/A:1013689704352"/> and non-stochastic<ref>{{cite doi|10.1137/S0097539701398375}}</ref> arm payoffs.
 
==Bandit strategies==
A major breakthrough was the construction of optimal population selection strategies, or policies (that possess uniformly maximum convergence rate  to the population with highest mean) in the work described below.
 
===Optimal solutions===
In the paper  "Asymptotically efficient adaptive allocation rules", Lai and Robbins<ref>{{cite journal | last1 = Lai | first1 = T.L. | last2 = Robbins | first2 = H. | year = 1985 | title = Asymptotically efficient adaptive allocation rules | url = | journal = Advances in applied mathematics | volume = 6 | issue = 1| pages =4 | doi = 10.1016/0196-8858(85)90002-8  }}</ref>  (following many papers of Robbins and his co-workers going back to Robbins (1952))  constructed convergent population selection policies that possess the fastest rate of convergence (to the population with highest mean) for the case that the population reward distributions are the one-parameter exponential family.  Then, in [[Michael Katehakis|Katehakis]] and [[Herbert Robbins|Robbins]] <ref>{{cite journal | last1 = Katehakis | first1 = M.N. | last2 = Robbins | first2 = H.  | year = 1995 | title = Sequential choice from several populations | url = | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 92 | issue = 19| pages =8584–5 | doi = 10.1073/pnas.92.19.8584 | pmid = 11607577 | pmc = 41010  }}</ref> simplifications of the policy and the main proof were given for the case of Normal populations with known variances. The next notable progress was obtained by Burnetas and [[Michael Katehakis|Katehakis]]  in the "Optimal adaptive policies for sequential allocation problems",<ref>{{cite journal | last1 = Burnetas | first1 = AN | last2 = Katehakis | first2 = MN | year = 1996 | title = Optimal adaptive policies for sequential allocation problems | url = | journal = Advances in Applied Mathematics | volume = 17 | issue = 2| pages =122 | doi = 10.1006/aama.1996.0007  }}</ref> where index based policies  with uniformly maximum convergence rate were constructed, under more general conditions that include the case in which the distributions of outcomes from each population depend on a vector of unknown parameters. Burnetas AN and Katehakis MN (1996) also provided an explicit solution for the important case in which the distributions of outcomes follow arbitrary (i.e., nonparametric) discrete, univariate distributions.
 
Later in "Optimal adaptive policies for Markov decision processes"<ref>{{cite journal | last1 = Burnetas | first1 = AN | last2 = Katehakis | first2 = MN | year = 1997 | title = Optimal adaptive policies for Markov decision processes | url = | journal = Math. Oper. Res. | volume = 22 | issue = 1| pages =222 | doi = 10.1287/moor.22.1.222  }}</ref>  Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information,  where the transition law and/or the expected one period rewards may depend on unknown parameter. In this work the explicit form for a class of adaptive policies that possess uniformly maximum convergence rate  properties for the total expected finite horizon reward, were constructed under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of these policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations. These inflations
have recently been called the optimistic approach in the work of Tewari and Bartlett,<ref>{{cite journal | last1 = Tewari | first1 = A. | last2 = Bartlett | first2 = P.L. | year = 2008 | title = Optimistic linear programming gives logarithmic regret for irreducible MDPs | url = http://books.nips.cc/papers/files/nips20/NIPS2007_0673.pdf | format=PDF| journal = Advances in Neural Information Processing Systems | volume = 20 | issue = | pages =  | id={{citeseerx|10.1.1.69.5482}} }}</ref> Ortner<ref>{{cite journal | last1 = Ortner | first1 = R. | year = 2010 | title = Online regret bounds for Markov decision processes with deterministic transitions | url = | journal = Theoretical Computer Science | volume = 411 | issue = 29| pages =2684 | doi = 10.1016/j.tcs.2010.04.005  }}</ref> Filippi,  Cappé, and Garivier,<ref>Filippi, S. and Cappé, O. and Garivier, A. (2010). "Online regret bounds for Markov decision processes with deterministic transitions", ''Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on'', pp. 115--122</ref> and Honda and Takemura.<ref>{{cite journal | last1=Honda | first1= J.|last2= Takemura  | first2= A. |year=2011|title=An asymptotically optimal policy for finite support models in the multiarmed bandit problem|journal=Machine learning|volume=85|issue=3|pages= 361–391 | arxiv=0905.2776 |doi=10.1007/s10994-011-5257-4}}</ref>
 
===Approximate solutions===
Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the four broad categories detailed below.
 
====Semi-uniform strategies====
Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a [[Greedy algorithm|greedy]] behavior where the ''best'' lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.
 
* '''Epsilon-greedy strategy''': The best lever is selected for a proportion <math>1 - \epsilon</math> of the trials, and another lever is randomly selected (with uniform probability) for a proportion <math>\epsilon</math>. A typical parameter value might be <math>\epsilon = 0.1</math>, but this can vary widely depending on circumstances and predilections.
 
* '''Epsilon-first strategy''': A pure exploration phase is followed by a pure exploitation phase. For <math>N</math> trials in total, the exploration phase occupies <math>\epsilon N</math> trials and the exploitation phase <math>(1 - \epsilon) N</math> trials. During the exploration phase, a lever is randomly selected (with uniform probability); during the exploitation phase, the best lever is always selected.
 
* '''Epsilon-decreasing strategy''': Similar to the epsilon-greedy strategy, except that the value of <math>\epsilon</math> decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.
 
* '''Adaptive epsilon-greedy strategy based on value differences (VDBE)''': Similar to the epsilon-decreasing strategy, except that  epsilon is reduced on basis of the learning progress instead of manual tuning (Tokic, 2010).<ref name="Tokic2010"/> High fluctuations in the value estimates lead to a high epsilon (exploration); low fluctuations to a low epsilon (exploitation). Further improvements can be achieved by a softmax weighted action selection in case of exploratory actions (Tokic & Palm, 2011).<ref name="TokicPalm2011"/>
 
* '''Contextual-Epsilon-greedy strategy''': Similar to the epsilon-greedy strategy, except that the value of <math>\epsilon</math> is computed regarding the situation in experiment processes, which let the algorithm be Context-Aware. It is based on dynamic exploration/exploitation and can adaptively balance the two aspects by deciding which situation is most relevant for exploration or exploitation, resulting in highly explorative behavior when the situation is not critical and highly exploitative behavior at critical situation (Bouneffouf et al., 2012).<ref name="Bouneffouf2012"/>
 
====Probability matching strategies====
Probability matching strategies reflect the idea that the number of pulls for a given lever should ''match'' its actual probability of being the optimal lever.  Probability matching strategies are also known as [[Thompson sampling]] or Bayesian Bandits<ref name="Scott2010"/> are surprisingly easy to implement if you can sample from the posterior for the mean value of each alternative.
 
Probability matching strategies also admit solutions to so-called contextual bandit problems.
 
====Pricing strategies====
Pricing strategies establish a ''price'' for each lever. For example as illustrated with the POKER algorithm,<ref name="Vermorel2005"/> the price can be the sum of the expected reward plus an estimation of extra future rewards that will gain through the additional knowledge. The lever of highest price is always pulled.
 
====Strategies with ethical constraints====
These strategies minimize the assignment of any patient to an inferior arm ([[Medical ethics|"physician's duty"]]).  In a typical case, they minimize expected successes lost (ESL), that is, the expected number of favorable outcomes that were missed because of assignment to an arm later proved to be inferior.  Another version minimizes resources wasted on any inferior, more expensive, treatment.<ref name="WHP" />
 
==See also==
* [[Gittins index]]&nbsp;— a powerful, general strategy for analyzing bandit problems.
* [[Optimal stopping]]
* [[Search theory]]
* [[Greedy algorithm]]
 
==References==
<references>
 
<ref name="Gittins89">
{{citation
| last = Gittins | first = J. C. | author-link = John C. Gittins
| isbn = 0-471-92059-2
| location = Chichester
| publisher = John Wiley & Sons, Ltd.
| series = Wiley-Interscience Series in Systems and Optimization.
| title = Multi-armed bandit allocation indices
| year = 1989}}
</ref>
 
<ref name="BF">
{{citation
| last1 = Berry | first1 = Donald A. | author1-link = Don Berry (statistician)
| last2 = Fristedt | first2 = Bert
| isbn = 0-412-24810-7
| location = London
| publisher = Chapman & Hall
| series = Monographs on Statistics and Applied Probability
| title = Bandit problems: Sequential allocation of experiments
| year = 1985}}
</ref>
 
<ref name="Whittle79">
{{citation
| last = Whittle | first = Peter | author-link = Peter Whittle
| journal = [[Journal of the Royal Statistical Society]]
| series = Series B
| page = 165
| title = Discussion of Dr Gittins' paper
| volume = 41 |issue=2
| year = 1979 | jstor=2985029}}
</ref>
 
<ref name="Whittle81">
{{citation
| last = Whittle | first = Peter | author-link = Peter Whittle
| doi = 10.1214/aop/1176994469
| journal = Annals of Probability
| pages = 284–292
| title = Arm-acquiring bandits
| volume = 9
| year = 1981
| issue = 2}}
</ref>
 
<ref name="Whittle88">
{{citation
| last = Whittle | first = Peter | author-link = Peter Whittle
| mr = 974588
| journal = Journal of Applied Probability
| pages = 287–298
| title = Restless bandits: Activity allocation in a changing world
| volume = 25A
| year = 1988}}
</ref>
 
<ref name="WHP">
{{Citation| first=William H.|last=Press|year=2009 |url=http://www.pnas.org/content/106/52/22387
|title=Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research
|journal= Proceedings of the National Academy of Sciences| volume=106|pages=22387–22392| pmid=20018711| doi=10.1073/pnas.0912378106| issue=52| pmc=2793317| postscript=.}}
</ref>
 
<ref name="Scott2010">
{{citation
| last = Scott | first = S.L.
| doi = 10.1002/asmb.874
| journal = Applied Stochastic Models in Business and Industry
| pages = 639–658
| title = A modern Bayesian look at the multi-armed bandit
| volume = 26
| year = 2010
| issue = 2}}
</ref>
 
<ref name="Vermorel2005">
{{citation
| url = http://bandit.sourceforge.net/Vermorel2005poker.pdf
| last1 = Vermorel | first1 = Joannes
| last2 = Mohri | first2 = Mehryar
| publisher = Springer
| series = In European Conference on Machine Learning
| pages = 437–448
| title = Multi-armed bandit algorithms and empirical evaluation
| year = 2005
}}
</ref>
 
<ref name="Bouneffouf2012">
{{cite doi|10.1007/978-3-642-34487-9_40}}
</ref>
<ref name="Tokic2010">
{{citation
| last = Tokic | first = Michel
| chapter = Adaptive ε-greedy exploration in reinforcement learning based on value differences
| doi = 10.1007/978-3-642-16111-7_23
| pages = 203–210
| publisher = Springer-Verlag
| series = Lecture Notes in Computer Science
| title = KI 2010: Advances in Artificial Intelligence
| volume = 6359
| year = 2010
| url = http://www.tokic.com/www/tokicm/publikationen/papers/AdaptiveEpsilonGreedyExploration.pdf
| isbn = 978-3-642-16110-0}}.
</ref>
<ref name="TokicPalm2011">
{{citation
| last1 = Tokic | first1 = Michel
| last2 = Palm | first2 = Günther
| chapter = Value-Difference Based Exploration: Adaptive Control Between Epsilon-Greedy and Softmax
| pages = 335–346
| publisher = Springer-Verlag
| series = Lecture Notes in Computer Science
| title = KI 2011: Advances in Artificial Intelligence
| volume = 7006
| year = 2011
| url = http://www.tokic.com/www/tokicm/publikationen/papers/KI2011.pdf
| isbn = 978-3-642-24455-1}}.
</ref>
 
</references>
 
==Further reading==
*{{cite doi|10.1145/1870103.1870106 }}
*{{citation
| last1 = Dayanik | first1 = S.
| last2 = Powell | first2 = W.
| last3 = Yamazaki | first3 = K.
| doi = 10.1239/aap/1214950209
| issue = 2
| journal = Advances in Applied Probability
| pages = 377–400
| title = Index policies for discounted bandit problems with availability constraints
| volume = 40
| year = 2008}}.
*{{citation
| last = Powell | first = Warren B.
| contribution = Chapter 10
| isbn = 0-470-17155-3
| location = New York
| publisher = John Wiley and Sons
| title = Approximate Dynamic Programming: Solving the Curses of Dimensionality
| year = 2007}}.
*{{citation
| last = Robbins | first = H. | author-link = Herbert Robbins
| doi = 10.1090/S0002-9904-1952-09620-8
| journal = [[Bulletin of the American Mathematical Society]]
| pages = 527–535
| title = Some aspects of the sequential design of experiments
| volume = 58  | year = 1952  | issue = 5}}.
*{{citation
| last1 = Sutton | first1 = Richard
| last2 = Barto | first2 = Andrew
| isbn = 0-262-19398-1
| publisher = MIT Press
| title = Reinforcement Learning
| url = http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html
| year = 1998}}.
*{{citation
| last = Bouneffouf  | first = Djallel
| contribution = A Contextual-Bandit Algorithm for Mobile Context-Aware Recommender System
| pages = 324–331
| publisher = Springer
| series = Lecture Notes in Computer Science
| title = Neural Information Processing - 19th International Conference, ICONIP 2012, Doha, Qatar, November 12-15,2012, Proceedings, Part III
| volume = 7665
| year = 2012
| url = http://link.springer.com/chapter/10.1007%2F978-3-642-34487-9_40
| isbn = 978-3-642-34486-2}}.
* {{citation
| last = Weber | first = Richard
| issue = 4
| journal = [[Annals of Applied Probability]]
| pages = 1024–1033
| title = On the Gittins index for multiarmed bandits
| volume = 2
| year = 1992
| jstor=2959678
| doi = 10.1214/aoap/1177005588}}.<!-- "The proof from God" according to Whittle's survey of applied probability -->
* {{Citation
|author=[[Michael N. Katehakis|Katehakis, M.]] and C. Derman
|title=Computing Optimal Sequential Allocation Rules in Clinical Trials
|journal=IMS Lecture Notes-Monograph Series
|volume=8
|year=1986
|pages=29–39
|jstor= 4355518
|postscript=.
}}
* {{Citation
|author=[[Michael N. Katehakis|Katehakis, M.]] and  A. F. Veinott, Jr.
|title=The multi-armed bandit problem: decomposition and computation
|journal=Mathematics of Operations Research
|volume=12
|year=1987
|pages=262–268
|jstor= 3689689
|issue=2
|doi=10.1287/moor.12.2.262
|postscript=.
}}
 
==External links==
*[http://mloss.org/software/view/415/ PyMaBandits], recent open source implementation of many of the best bandits strategies in Python and Matlab on mloss.org
*[http://bandit.sourceforge.net bandit.sourceforge.net Bandit project ], open source implementation of many bandit strategies at sourceforge.net
* [http://www.cs.washington.edu/research/jair/volume4/kaelbling96a-html/node6.html Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case]
* Tutorial: Introduction to Bandits: Algorithms and Theory. [http://techtalks.tv/talks/54451/ Part1]. [http://techtalks.tv/talks/54455/ Part2].
* [http://www.feynmanlectures.info/exercises/Feynmans_restaurant_problem.html Feynman's restaurant problem], a classic example (with known answer) of the exploitation vs. exploration tradeoff.
* [http://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html Bandit algorithms vs. A-B testing].
* [http://homes.di.unimi.it/~cesabian/Pubblicazioni/banditSurvey.pdf A survey on Bandits]
 
{{DEFAULTSORT:Multi-Armed Bandit}}
[[Category:Sequential methods]]
[[Category:Sequential experiments]]
[[Category:Stochastic optimization]]
[[Category:Machine learning]]

Latest revision as of 06:55, 4 December 2014

Oscar is what my spouse loves to contact me and I completely dig that name. Bookkeeping is what I do. My family members life in Minnesota and my family members enjoys it. The preferred pastime for my children and me is to play baseball but I haven't made a dime with it.

Have a look at my web blog ... at home std testing (simply click the up coming site)