|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| In [[data mining]], '''''k''-means++'''<ref>{{Cite conference
| | This particular Tv programs had been a alternative to your 2010 Samsung 32" inspired Tv programs we promoted. Purchasing the Television applied by just Amazon . com Warehouse offers, e was able to uncover the product just for pertaining to $350 after offering my favorite 2010 Samsung to make $250, we.at the. exclusively $100 for all the rise and is titanic. For $100, e moved through 32" in order to 39" and display quality is so so much better that it's making myself need to observe more videos. The two video's are usually 1080p alongside 60hz recharge. I can't think adequate with regards to how astounded e am because of the LG display quality. Information technology makes us reckon that there must have been huge improvements as part of processor design within the last few several years. On same valuable time costs come process up. As I got my 32" Samsung as part of 2011, this item fees $650. gold watches Nowadays concerning $350, i'm able to find a powerful albeit put TV that may be certainly option greater to make only $350. |
| | title=''k''-means++: the advantages of careful seeding
| |
| | url=http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
| |
| | author=Arthur, D. and Vassilvitskii, S.
| |
| | booktitle=Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
| |
| | pages=1027–1035
| |
| | year=2007
| |
| | publisher=Society for Industrial and Applied Mathematics Philadelphia, PA, USA
| |
| }}</ref><ref>http://theory.stanford.edu/~sergei/slides/BATS-Means.pdf Slides for presentation of method by Arthur, D. and Vassilvitskii, S.
| |
| </ref> is an algorithm for choosing the initial values (or "seeds") for the [[k-means clustering|''k''-means clustering]] algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the [[NP-hard]] ''k''-means problem—a way of avoiding the sometimes poor clusterings found by the standard ''k''-means algorithm. It is similar to the first of three seeding methods proposed, in independent work, in 2006<ref>{{Cite conference
| |
| | title=The Effectiveness of Lloyd-Type Methods for the k-Means Problem
| |
| | author=Ostrovsky, R., Rabani, Y., Schulman, L. J. and Swamy, C.
| |
| | booktitle=Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)
| |
| | pages=165–174
| |
| | year=2006
| |
| | organization=IEEE
| |
| }}</ref> by Rafail Ostrovsky, Yuval Rabani, Leonard Schulman and Chaitanya Swamy. (The distribution of the first seed is different.)
| |
|
| |
|
| ==Background==
| | An email here's about the manufacture date regarding the LG would be this summer 2013, only one period i collected the TV. Easily put, this 'used' [http://www.youtube.com/watch?v=qb7a5wntp88 flat tv] had been unique over producing line in United mexican states. When purchasing utilized goodies in the course of The amazon online marketplace facility deals, my favorite advice will be pay out the other $10 or even $20 to achieve the merchandise in great position however not which include Brand new. Exploit was displayed just as really Good it would be now unique about the defensive vinyl pieces remained found on the side of TV. The trouble by using fancy New is costs are essentially the comparable to Latest. If you prefer a great deal and also do not need a overused device, opt for good. <br><br> Acquired this to exchange simple Moms' familiar-fashion pipe TV. Throughout design immediately after putting in among the many principal dish service providers, winning a hot in order to connect to your outdated Tv programs ended up being via RGB bond. The unfortunate danger in this situation hook up process had been there became a "degree mp3" hassle in which advertisements and also songs played with a quite noticed high volume (besides the fact that this problem had been allegedly taken out by federal government designate). Anyway, after examining Amazons' products, we went with the LG 32LN5300 make. Some of the HDMI hookup extracted the seem problems and contains a terrific photo. She is fairly happy together with her new TV. A person know to express: should you be planning employ added in stand up, be certain the particular Phillips screwdriver is normally magnetic. Mounting the might the television can be very confusing otherwise like the two higher tighten positions are generally great and do not take care of the specific screws into their growing slots including the bottow couple might. <br><br> Here people are several weeks belated and additionally the brand-new establish came secured and sound, and additionally works well great. the went from the Samsung plasma to this very, but still use Samsung by the house, therefore I definitely will compare him or her on further foundation. I assume it does not material very much since plasmas are being phased out, the imagine to the Samsung plasma is very much better than the most important driven picture located on the VIZIO. |
| The ''k''-means problem is to find cluster centers that minimize the intra-class variance, i.e. the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it).
| |
| Although finding an exact solution to the ''k''-means problem for arbitrary input is NP-hard,<ref>{{cite journal
| |
| | doi=10.1023/B:MACH.0000033113.59016.96 | |
| | title=Clustering Large Graphs via the Singular Value Decomposition
| |
| | author=Drineas, P. and Frieze, A. and Kannan, R. and Vempala, S. and Vinay, V.
| |
| | journal=Machine Learning | |
| | volume=56
| |
| | issue=1–3
| |
| | pages=9–33
| |
| | year=2004
| |
| | publisher=Kluwer Academic Publishers Hingham, MA, USA
| |
| }}</ref> the standard approach to finding an approximate solution (often called [[Lloyd's algorithm]] or the ''k''-means algorithm) is used widely and frequently finds reasonable solutions quickly.
| |
|
| |
|
| However, the ''k''-means algorithm has at least two major theoretic shortcomings:
| | It simply a great deal more potent, smoother, blacker image. Although, each VIZIO is gold watches exactly fine to observe - the hues are generally good, each sharpness is wonderful, shine [http://www.youtube.com/watch?v=-0Hn-ExmUPg 12 volt tv] Volt tvs - [http://www.youtube.com/watch?v=DMDu9mxxsUM youtube.Com], is significantly quite a bit less versus plasma, the seem seriously is not thus horrendous, plus the tuner may seem to work efficiently by means of each of them conventional and additionally raised-def internet tells specifically from connection (simply no connection bundle). the would purchase it yet again. |
| * First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.<ref>{{Cite
| |
| | title=How slow is the ''k''-means method? | |
| | author=Arthur, D. and Vassilvitskii, S.
| |
| | booktitle=Proceedings of the twenty-second annual symposium on Computational geometry
| |
| | pages=144–153
| |
| | year=2006
| |
| | organization=ACM New York, NY, USA
| |
| }}</ref>
| |
| * Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.
| |
| | |
| The ''k''-means++ algorithm addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard ''k''-means optimization iterations.
| |
| With the ''k''-means++ initialization, the algorithm is guaranteed to find a solution that is O(log ''k'') competitive to the optimal ''k''-means solution.
| |
| | |
| ==Example bad case==
| |
| To illustrate the potential of the ''k''-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum of squared distances of cluster points to the centroid of their assigned clusters, consider the example of four points in '''R'''<sup>2</sup> that form an axis-aligned rectangle whose width is greater than its height.
| |
| | |
| If ''k'' = 2 and the two initial cluster centers lie at the midpoints of the top and bottom line segments of the rectangle formed by the four data points, the ''k''-means algorithm converges immediately, without moving these cluster centers. Consequently, the two bottom data points are clustered together and the two data points forming the top of the rectangle are clustered together—a suboptimal clustering because the width of the rectangle is greater than its height.
| |
| | |
| Now, consider stretching the rectangle horizontally to an arbitrary width. The standard ''k''-means algorithm will continue to cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm perform arbitrarily poorly with respect to the ''k''-means objective function.
| |
| | |
| ==Initialization algorithm==
| |
| The intuition that spreading out the ''k'' initial cluster centers is a good thing is behind this approach: the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.
| |
| | |
| The exact algorithm is as follows:
| |
| # Choose one center uniformly at random from among the data points.
| |
| # For each data point {{math|<var>x</var>}}, compute D({{math|<var>x</var>}}), the distance between {{math|<var>x</var>}} and the nearest center that has already been chosen.
| |
| # Choose one new data point at random as a new center, using a weighted probability distribution where a point {{math|<var>x</var>}} is chosen with probability proportional to D({{math|<var>x</var>}})<sup>2</sup>.
| |
| # Repeat Steps 2 and 3 until {{math|<var>k</var>}} centers have been chosen.
| |
| # Now that the initial centers have been chosen, proceed using standard [[k-means clustering|''k''-means clustering]].
| |
| | |
| This seeding method yields considerable improvement in the final error of {{math|<var>k</var>}}-means. Although the initial selection in the algorithm takes extra time, the {{math|<var>k</var>}}-means part itself converges very quickly after this seeding and thus the algorithm actually lowers the computation time. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets, close to 1000-fold improvements in error. In these simulations the new method almost always performed at least as well as [[Vanilla (computing)|vanilla]] {{math|<var>k</var>}}-means in both speed and error.
| |
| | |
| Additionally, the authors calculate an approximation ratio for their algorithm. The {{math|<var>k</var>}}-means++ algorithm guarantees an approximation ratio O(log ''k'') in expectation (over the randomness of the algorithm), where <math>k</math> is the number of clusters used. This is in contrast to vanilla {{math|<var>k</var>}}-means, which can generate clusterings arbitrarily worse than the optimum.<ref name=kanungo>T. Kanungo, D. Mount, N. Netanyahux, C. Piatko, R. Silverman, A. Wu [http://www.cs.umd.edu/~mount/Papers/kmlocal.pdf "A Local Search Approximation Algorithm for ''k''-Means Clustering"] 2004 Computational Geometry: Theory and Applications.</ref>
| |
| | |
| ==Applications==
| |
| The ''k''-means++ approach has been applied since its initial proposal. In a review by Shindler,<ref>http://web.archive.org/web/20110927100642/http://www.cs.ucla.edu/~shindler/shindler-kMedian-survey.pdf Approximation Algorithms for the Metric ''k''-Median
| |
| Problem</ref> which includes many types of clustering algorithms, the method is said to successfully overcome some of the problems associated with other ways of defining initial cluster-centres for ''k''-means clustering. Lee et al.<ref>http://sir-lab.usc.edu/publications/2008-ICWSM2LEES.pdf Discovering Relationships among Tags and Geotags, 2007</ref> report an application of ''k''-means++ to create geographical cluster of photographs based on the latitude and longitude information attached to the photos. An application to financial diversification is reported by Howard and Johansen.<ref>http://www.cse.ohio-state.edu/~johansek/clustering.pdf Clustering Techniques for Financial Diversification, March 2009</ref> Other support for the method and ongoing discussion is also available online.<ref>http://lingpipe-blog.com/2009/03/23/arthur-vassilvitskii-2007-kmeans-the-advantages-of-careful-seeding/ Lingpipe Blog</ref>
| |
| Since the k-means++ initialization needs k passes over the data, it does not scale very well to large data sets. Bahman Bahmani et al. have proposed a scalable variant of k-means++ called k-means|| which provides the same theoretical guarantees and yet is highly scalable. <ref name=kmeans||>B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii [http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf "Scalable K-means++"] 2012 Proceedings of the VLDB Endowment.</ref>
| |
| | |
| == Software ==
| |
| {{Wikibooks|K-Means++}}
| |
| * [[ELKI]] data-mining framework contains multiple k-means variations, including k-means++ for seeding.
| |
| * [[GNU R]] includes k-means, and the "flexclust" package can do k-means++
| |
| * [[OpenCV]] [http://docs.opencv.org/modules/core/doc/clustering.html#kmeans implementation]
| |
| * [[Weka (machine learning)|Weka]] contains k-means (with optional k-means++) and x-means clustering.
| |
| * [http://www.stanford.edu/~darthur/kmpp.zip David Arthur's implementation] {{dead link|date=May 2013}}
| |
| * [http://commons.apache.org/math/api-2.1/index.html?org/apache/commons/math/stat/clustering/KMeansPlusPlusClusterer.html Apache Commons Math Java implementation]
| |
| * [http://docs.graphlab.org/clustering.html CMU's GraphLab] Efficient, open source clustering on multicore.
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| [[Category:Data clustering algorithms]]
| |
| [[Category:Statistical algorithms]]
| |
This particular Tv programs had been a alternative to your 2010 Samsung 32" inspired Tv programs we promoted. Purchasing the Television applied by just Amazon . com Warehouse offers, e was able to uncover the product just for pertaining to $350 after offering my favorite 2010 Samsung to make $250, we.at the. exclusively $100 for all the rise and is titanic. For $100, e moved through 32" in order to 39" and display quality is so so much better that it's making myself need to observe more videos. The two video's are usually 1080p alongside 60hz recharge. I can't think adequate with regards to how astounded e am because of the LG display quality. Information technology makes us reckon that there must have been huge improvements as part of processor design within the last few several years. On same valuable time costs come process up. As I got my 32" Samsung as part of 2011, this item fees $650. gold watches Nowadays concerning $350, i'm able to find a powerful albeit put TV that may be certainly option greater to make only $350.
An email here's about the manufacture date regarding the LG would be this summer 2013, only one period i collected the TV. Easily put, this 'used' flat tv had been unique over producing line in United mexican states. When purchasing utilized goodies in the course of The amazon online marketplace facility deals, my favorite advice will be pay out the other $10 or even $20 to achieve the merchandise in great position however not which include Brand new. Exploit was displayed just as really Good it would be now unique about the defensive vinyl pieces remained found on the side of TV. The trouble by using fancy New is costs are essentially the comparable to Latest. If you prefer a great deal and also do not need a overused device, opt for good.
Acquired this to exchange simple Moms' familiar-fashion pipe TV. Throughout design immediately after putting in among the many principal dish service providers, winning a hot in order to connect to your outdated Tv programs ended up being via RGB bond. The unfortunate danger in this situation hook up process had been there became a "degree mp3" hassle in which advertisements and also songs played with a quite noticed high volume (besides the fact that this problem had been allegedly taken out by federal government designate). Anyway, after examining Amazons' products, we went with the LG 32LN5300 make. Some of the HDMI hookup extracted the seem problems and contains a terrific photo. She is fairly happy together with her new TV. A person know to express: should you be planning employ added in stand up, be certain the particular Phillips screwdriver is normally magnetic. Mounting the might the television can be very confusing otherwise like the two higher tighten positions are generally great and do not take care of the specific screws into their growing slots including the bottow couple might.
Here people are several weeks belated and additionally the brand-new establish came secured and sound, and additionally works well great. the went from the Samsung plasma to this very, but still use Samsung by the house, therefore I definitely will compare him or her on further foundation. I assume it does not material very much since plasmas are being phased out, the imagine to the Samsung plasma is very much better than the most important driven picture located on the VIZIO.
It simply a great deal more potent, smoother, blacker image. Although, each VIZIO is gold watches exactly fine to observe - the hues are generally good, each sharpness is wonderful, shine 12 volt tv Volt tvs - youtube.Com, is significantly quite a bit less versus plasma, the seem seriously is not thus horrendous, plus the tuner may seem to work efficiently by means of each of them conventional and additionally raised-def internet tells specifically from connection (simply no connection bundle). the would purchase it yet again.