# Cluster Analysis

2. SMAK Example

Introduction to Cluster Analysis

Clustering analyses seek to categorize the pixels in the sample image into different groups according to how similar their spectral signals are.

We can consider the sample image to be a matrix of n pixels with m spectral attributes. m may correspond to different energies collected across an absorption edge (for a multi energy map) or m could be PCA components if PCA is applied to the sample image before cluster analysis. Our goal is then to classify each pixel n according to it attributes m so that pixels with similar attributes are grouped into the same cluster.

There are many types of cluster analyses. Cluster algorithms perform the following basic steps:

(1) Define the number of clusters

(2) Identify cluster centers

(3) Classify each pixel based on how far it is from the cluster center (often based on its Euclidean distance to the cluster center, i.e., for a point (x,y,z), the Euclidean distance is the square root of x^2 + y^2 + z^2)

(4) Iteratively move the cluster centers until clusters are identified that minimize the distance from each pixel to the center within groups and maximize the distance between pixels from different groups

The following image illustrates how clustering works. In the image on the left, open crosses indicate initial guesses for cluster centers. The distance of all pixels to the cluster centers is calculated, and then each cluster center is adjusted accordingly (e.g. the open cross in the upper right quadrant is shifted to the position of the black cross). The final, optimized locations for the cluster centers are given by the black crosses in the right-hand image. Different algorithms have different methods of guessing the initial positions of the cluster centers and optimizing the way their positions are optimized. The end result, illustrated in the image on the left, is that all pixels are grouped into distinct clusters.

M. Lerotic et al., Ultramicroscopy 100 (2004) 35–57

The following image shows examples of how different types of cluster analyses group data points with different distributions together. The name of the cluster analysis is given at the top of each column. The number in the bottom right-hand corner of each plot gives the time needed to perform the cluster analysis. You can see that there is often a trade-off between computational time and how well the cluster analysis represents the different groups of data points.  