Different Types of Clustering Algorithm

Course Curriculum

Different Types of Clustering Algorithm

Different Types of Clustering Algorithm

The introduction to clustering is discussed in this article ans is advised to be understood first.

The clustering Algorithms are of many types. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized.

Distribution based methods

It is a clustering model in which we will fit the data on the probability that how it may belong to the same distribution. The grouping done may be normal or gaussian . Gaussian distribution is more prominent where we have fixed number of distributions and all the upcoming data is fitted into it such that the distribution of data may get maximized . This result in grouping which is shown in figure:-

 

This model works good on synthetic data and diversely sized clusters. But this model may have problem if the constraints are not used to limit model’s complexity. Furthermore, Distribution-based clustering produces clusters which assume concisely defined mathematical models underlying the data, a rather strong assumption for some data distributions.
For Ex- Expectation-maximization algorithm which uses multivariate normal distributions is one of popular example of this algorithm .

Centroid based methods

This is basically one of iterative clustering algorithm in which the clusters are formed by the closeness of data points to the centroid of clusters. Here , the cluster center i.e. centroid is formed such that the distance of data points is minimum with the center. This problem is basically one of NP- Hard problem and thus solutions are commonly approximated over a number of trials.
For Ex- K – means algorithm is one of popular example of this algorithm .

The biggest problem with this algorithm is that we need to specify K in advance. It also has problem in clustering density based distributions.

Connectivity based methods

The core idea of connectivity based model is similar to Centroid based model which is basically defining clusters on the basis of closeness of data points .Here we work on a notion that the data points which are closer have similar behavior as compared to data points that are farther .
It is not a single partitioning of the data set , instead it provides an extensive hierarchy of clusters that merge with each other at certain distances. Here the choice of distance function is subjective. These models are very easy to interpret but it lacks scalability .

For Ex- hierarchical algorithm and it’s variants .

Density Models

In this clustering model there will be a searching of data space for areas of varied density of data points in the data space . It isolates various density regions based on different densities present in the data space .
For Ex- DBSCAN and OPTICS.

 

Subspace clustering

Subspace clustering is an unsupervised learning problem that aims at grouping data points into multiple clusters so that data point at single cluster lie approximately on a low-dimensional linear subspace. Subspace clustering is an extension of feature selection just as with feature selection subspace clustering requires a search method and evaluation criteria but in addition subspace clustering limit the scope of evaluation criteria. Subspace clustering algorithm localize the search for relevant dimension and allow to them to find cluster that exist in multiple overlapping subspaces. Subspace clustering was originally purpose to solved very specific computer vision problem having a union of subspace structure in the data but it gains increasing attention in the statistic and machine learning community. People use this tool in social network, movie recommendation, and biological dataset. Subspace clustering raise the concern of data privacy as many such application involve dealing with sensitive information. Data points are assumed to be incoherentas it only protects the differential privacy of any feature of a user rather than the entire profile user of the database.
There are two branches of subspace clustering based on their search strategy.

  • Top-down algorithms find an initial clustering in the full set of dimension and evaluate the subspace of each cluster.
  • Bottom-up approach finds dense region in low dimensional space then combine to form clusters.
Clustering in Machine Learning (Prev Lesson)
(Next Lesson) K means Clustering – Introduction