Details on PCA and Cluster Analyses

SMAK offers a number of statistical tools to help you classify areas of your sample that are chemically distinct from one another. The goal of this section is to provide you with background on each of these tools so that you understand when you should employ them and how to interpret the results.

These statistical tools include principal component analysis (PCA), which is one of the most common and widely used tools across all types of data analysis, as well as several types of cluster analyses.

PCA

The goal of PCA is to identify those areas of your sample that exhibit the most variation in their signal. The assumption here is that the areas that exhibit the greatest variation in signal are the areas that are chemically distinct.

For example, pretend you have just collected a multi-energy map across the Fe K-edge in order to identify regions of your sample that contain Fe(II) and regions that contain Fe(III). You may have collected two different energies, e.g. 7122 eV [Fe(II)] and 7132 eV [Fe(III)]. Areas with more Fe(II) will have more intensity in the map collected at 7122 eV, whereas areas with more Fe(III) will have more intensity in the map collected at 7132 eV. You could simply visually compare the 7122 eV map and 7132 eV map to identify these different areas of interest. For some regions there will be enough contrast in the signal that it will be obvious by eye where Fe(II) and Fe(III) are concentrated. However, the vast majority of the time you try this approach, you will run into two problems. First, the total concentration of Fe in your sample will vary, making it difficult for you to ascertain by eye whether an increase in intensity in an area of your 7122 eV map (for instance) is caused by an increase in the amount of Fe(II) relative to Fe(III) or if there is simply more Fe overall in this particular area. Secondly, it will be nearly impossible to discriminate between areas with more or less Fe(II) when the areas in question contain a significant proportion of both Fe(II) and Fe(III). This is all to say that you really need some sort of quantitative, statistical tool to discriminate between Fe(II) and Fe(III)-containing regions of your sample.

For this, we can turn to PCA. The goal of PCA is to re-express the data set (i.e. the map of n pixels with m intensities) on a new set of axes (basis), for which a linear combination of the coordinates (vectors) within the new set of axes (basis) returns the coordinates (m, n) in the data set.

The new basis is superior because each axis (orthogonal vector) lies along the direction in which there is the greatest amount of variation within the measurements. PCA thus offers a way to view the variation in the signal (i.e. intensities of the wavelengths of interest) in the area mapped by XRF.

Mathematical Basis of PCA

Let’s define a matrix X, that comprises your XRF data set. X has n columns, where n is the number of measurements that you collected, i.e. the number of pixels in your map. X has m rows, where m is the measurement type, in our case, the intensities of each measured wavelength:

X = [x11……x1n

xm1…..xmn]

Note, we actually take the mean adjusted values of xmn, which means that we subtract off the mean from each entry.

The goal of PCA is to find an orthonormal matrix, P, which linearly transforms X into the covariance matrix of X under the special condition that this new matrix (which we’ll call Y) is diagonalized: all entries are zero except those along the matrix diagonal. This can be expressed in the following equation:

Y = PX

The covariance matrix of X is, by definition, XXT/(n-1). This means that the ijth element of the covariance matrix is the dot product of the vector of the ith measurement type with the vector of the jth measurement type, all normalized by n-1. So, each entry is:

xixj/(n-1)

Elements along the diagonal of the matrix XXT are the variance of a particular measurement type (the condition where i=j) and elements not along the diagonal are the covariance between measurement types.

If we specify that Y must meet the condition that all non-diagonal entries are zero, what we are really saying is that we are searching for a basis in which the covariance between the measurement types is zero; that is, the measurement types are uncorrelated.

P is made up of a set of orthogonal vectors, pi. To find P that transforms X into Y, we find the normalized direction (in m-dimensions) along which the variance in X is maximized. This is the vector p1.

We then find another direction in which the next largest variance is observed, restricting ourselves to directions that are orthogonal to the p1 direction. This is the vector p2. We repeat this process m times, yielding m p vectors, which are our principal components.

Each vector in P is an eigenvector. An eigenvector is a vector that gives a scalar multiple of itself when operated on by a linear operator. The magnitude of an eigenvector is an eigenvalue. The greater the variation along the direction pm, the greater the eigenvalue. This means that the most important information in the sample is contained within the principal components with the greatest magnitude. pm with lower eigenvalues arise from noise.

So, let’s review: principal component analysis linearly transforms the data (X) using a matrix, P, which comprises a set of orthogonal eigenvectors. P has the properties that it (1) maximizes variance and (2) minimizes covariance.

Ultimately, that means we transform the wavelength intensities (in SMAK, the channels) into a set of principal components, which allow us to see the areas of the sample that are the most different from one another. We can throw away components with low magnitudes, as these encode only noise.

Non-negative Matrix Factorization (NMF)

PCA is a common and robust way to examine variation in a data set, while minimizing redundancy (i.e. components that have overlapping information) due to the orthogonality constraint. However, the components can be hard to interpret because they do not have any chemical meaning; they simply map the data onto a new set of vectors that are aligned with the direction of greatest variance in the data.

To obtain components that do hold chemical meaning, we can apply NMF instead. NMF is a matrix factorization problem in which the goal is to come up with a set of components that can be thought of as chemical endmembers.

This problem can be expressed as:

X = WH

X is the data matrix, which has dimensions m x n. m represents different types of measurements. In our case, these are intensities of different wavelengths that we measure (in SMAK, this is the number of channels you select). n is the number of measurements; in our case, this is the number of pixels in the map.

W is a matrix of dimensions m x k. W comprises a set of basis vectors that describe the loading of the data set’s variables onto k bases.

H is a matrix of k x n dimensions. The elements of this matrix represent the loading of each k base onto n.

W and H are constrained to be non-negative, hence the term “NMF”

X can be written, column by column, as:

x = Wh

That is to say, that each data vector x can be expressed as a linear combination of the columns of W weighted by the components of h.

Importantly, since k < m, WH only represents a good approximation of X if the basis vectors of W represent the most important chemical constituents of X, that is, if they serve as appropriate chemical endmembers.

Simplex Volume Maximization (SiVM)

SiVM employs the same relationship as NMF (where W and H are non-negative):

X = W H

However, SiVM relies on a geometric solution to obtain W and H, which greatly reduces the computational expense. In geometry, a simplex is a polyhedral shape with k vertices. For instance, a triangle is a simplex with k = 3. In our case, k is the number of bases in W. SiVM seeks to find a set of bases (k) of W that are actual data points. The k data points are chosen such that they maximize the number of data points the simplex encompasses; each k data point thus represents an endmember (or “archetype”) of the whole data set. The coefficients of the matrix H give a measure of the similarity of each element in X to the corresponding archetypes in W.

The following image shows how a simplex is constructed with k = 2, 3 or 4.

Note how the points are chosen to span the largest amount of the data possible.

SiVM clusters data in a very intuitive way: it selects endmembers and then expresses all other points in the data set as a combination of these endmembers. This is the utility and power of SiVM (and NMF, too).

The draw back in using SiVM is that there is a trade-off between adequately describing all data point in the data set and choosing redundant endmembers. If too few endmembers are chosen (i.e., too low a value of k is chosen), then not all points in the data set will be adequately described (i.e. not all data points will lie within the simplex). In the figure above it can be seen that k = 2 and k = 3 are not adequate to describe the majority of the data set. However, as more endmembers are chosen, the endmembers will start to share characteristics; they will not signify unique chemical species. In any case, the choice of k matters for the analysis and it is wise to attempt the same SiVM calculation with a few different values of k.

Romer et al., Functional Plant Biology, 2012, 39, 878–890

m x n

m x k

k x n