Maier's theorem: Difference between revisions

Revision as of 21:29, 24 January 2014

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, Birch only requires a single scan of the database. In addition, Birch is recognized^[1] as the, "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively".

Problem with previous methods

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of addition IO (input/output) operations. Furthermore, most of Birch's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.

Advantages with BIRCH

It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.

BIRCH Clustering Algorithm

Given a set of N d-dimensional data points, the clustering feature $CF$ of the set is defined as the triple $CF=(N,LS,SS)$ , where $LS$ is the linear sum and $SS$ is the square sum of data points.

Clustering features are organized in a CF tree, which is a height balanced tree with two parameters: branching factor $B$ and threshold $T$ . Each non-leaf node contains at most $B$ entries of the form $[CF_{i},child_{i}]$ , where $child_{i}$ is a pointer to its $i$ th child node and $CF_{i}$ the clustering feature representing the associated subcluster. A leaf node contains at most $L$ entries each of the form $[CF_{i}]$ . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter T. A node is required to fit in a page of size P. B and L are determined by P. So P can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.

In the algorithm in the first step it scans all data and builds an initial memory CF tree using the given amount of memory. In the second step it scans all the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing outliers and grouping crowded subclusters into larger ones. In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their CF vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Awards

It has received ^[2] the SIGMOD 10 year test of time award.

External links

Notes

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

[1] Template:Literatur

[2] Template:Literatur

[1]

[2]

@@ Line 1: / Line 1: @@
-There are a few items that every one of the internets hottest sites have commonly. First, all of them made an approach to increase link reputation on their site. Next, they put that plan in-to action and used an expert to build link acceptance for them. Third, they all use quality permanent oneway links to their internet site in order to have more traffic and higher rankings. <br><br>Building link acceptance for your site is a very time intensive, tiresome and boring process. Dig up more about [http://www.projectwedding.com/blog_entries/118489 Project Wedding] by going to our stylish article directory. You will spend several hours of time, and often see little in the way of results. Hiring a specialist to construct your internet sites link recognition makes complete sense because you&quot;ve better items that you can do with your time. <br><br>Link reputation is the number of links that direct visitors to your website and the quality of those links. Search engines like sites which have lots of incoming links from websites where the links seem sensible. Identify further on a related web site by clicking [http://social.xfire.com/blog/linkempororysz find out more]. Link Popularity is the key to making a spot in the results pages of these search engines. <br><br>If you desire to buy links to your site, you have to make sure that you are planning to buy quality incoming links. Your links will not get you as much traffic as you want if you get links on a site that&quot;s not ranked very, o-r that the various search engines think are not of high quality. <br><br>If standing high in the major search engines is important for your site then you should definitely get links. It increases the rating given to your website, whenever you buy links which are on high ranked internet sites. There is no easier way to get on the first page of the search-engine results than whenever you buy permanent links. <br><br>Undoubtedly, the best way to accomplish what the most used websites did would be to hire someone who knows making good things happen to your site. Remember that:, In the event that you arent confident <br><br>1) It matters where your site is listed! Professional link contractors understand how to develop link popularity by creating high-quality permanent links to your site. <br><br>2) An Search Engine Optimization expert will get you and definitely locate incoming links better links and higher link reputation than you could do on your own. This is because they&quot;ve the time to do it and they learn how to do it. If you have an opinion about sports, you will probably wish to research about [http://www.iamsport.org/pg/bookmarks/weblinkemporerjrm/read/26931918/internet-site-advertising-methods-one-way-backlinks-vs-reciprocal-links linkemporer]. <br><br>3) You&quot;ve other activities regarding your time. Who doesnt actually? As opposed to spending a huge number of hours trying to build you link acceptance you can free your-self up-to do whatever else. <br><br>Hire an expert to build your link popularity when they buy permanent oneway links to your site on your part! Resemble the very best internet sites around, and as you enjoy your success relax..<br><br>Should you loved this information and you would want to receive details concerning [http://ovalranchtalk.wallinside.com group Health Insurance] i implore you to visit the web page.
+'''BIRCH''' (balanced iterative reducing and clustering using hierarchies) is an unsupervised [[data mining]] algorithm used to perform [[Data clustering|hierarchical clustering]] over particularly large data-sets. An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric [[data point]]s in an attempt to produce the best quality clustering for a given set of resources (memory and [[time constraint]]s). In most cases, Birch only requires a single scan of the database. In addition, Birch is recognized<ref>{{Literatur
+ | Autor = Tian Zhang, Raghu Ramakrishnan, Miron Livny
+ | Titel = An Efficient Data Clustering Method for Very Large Databases
+}}</ref> as the, "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively".
+==Problem with previous methods==
+Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in [[Primary storage|main memory]]. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of addition IO (input/output) operations. Furthermore, most of Birch's predecessors inspect all data points (or all currently existing clusters) equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.
+==Advantages with BIRCH==
+It is local in that each clustering decision is made without scanning all data points and currently existing clusters.
+It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important.
+It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs.
+It is also an incremental method that does not require the whole [[data set]] in advance.
+==BIRCH Clustering Algorithm==
+Given a set of N d-dimensional data points, the '''clustering feature''' <math>CF</math> of the set is defined as the triple <math>CF = (N,LS,SS)</math>, where <math>LS</math> is the linear sum and <math>SS</math> is the square sum of data points.
+Clustering features are organized in a '''CF tree''', which is a height [[Self-balancing binary search tree|balanced tree]] with two parameters: [[branching factor]] <math>B</math> and threshold <math>T</math>. Each non-leaf node contains at most <math>B</math> entries of the form <math>[CF_i,child_i]</math>, where <math>child_i</math> is a pointer to its <math>i</math>th [[Tree (data structure)|child node]] and <math>CF_i</math> the clustering feature representing the associated subcluster. A [[leaf node]] contains at most <math>L</math> entries each of the form <math>[CF_i]</math> . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter T. A node is required to fit in a page of size P. B and L are determined by P. So P can be varied for [[performance tuning]]. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.
+In the algorithm in the first step it scans all data and builds an initial memory CF tree using the given amount of memory. In the second step it scans all the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing outliers and grouping crowded subclusters into larger ones. In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their CF vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.
+==Awards==
+It has received
+<ref>{{Literatur
+ | url = http://www.sigmod.org/sigmod-awards/citations/2006-sigmod-test-of-time-award-1
+ | Titel = http://www.sigmod.org/sigmod-awards/citations/2006-sigmod-test-of-time-award-1
+}}</ref> the SIGMOD 10 year test of time award.
+==External links==
+* http://people.cs.ubc.ca/~rap/teaching/504/2005/slides/Birch.pdf
+* http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.17.2504
+==Notes==
+{{Reflist}}
+{{DEFAULTSORT:Birch (Data Clustering)}}
+[[Category:Data clustering algorithms]]