Principal Component Analysis
SMAK offers a number of statistical tools to help you classify areas of your sample that are chemically distinct from one another. The goal of this section is to provide you with background on these tools so that you understand when you should employ them and how to interpret the results.
These statistical tools include principal component analysis (PCA), which is one of the most common and widely used tools across all types of data analysis.
The goal of PCA is to identify those areas of your sample that exhibit the most variation in their signal. The assumption here is that the areas that exhibit the greatest variation in signal are the areas that are chemically distinct.
For example, pretend you have just collected a multi-energy map across the Fe K-edge in order to identify regions of your sample that contain Fe(II) and regions that contain Fe(III). You may have collected two different energies, e.g. 7122 eV [Fe(II)] and 7132 eV [Fe(III)]. Areas with more Fe(II) will have more intensity in the map collected at 7122 eV, whereas areas with more Fe(III) will have more intensity in the map collected at 7132 eV. You could simply visually compare the 7122 eV map and 7132 eV map to identify these different areas of interest. For some regions there will be enough contrast in the signal that it will be obvious by eye where Fe(II) and Fe(III) are concentrated. However, the vast majority of the time you try this approach, you will run into two problems. First, the total concentration of Fe in your sample will vary, making it difficult for you to ascertain by eye whether an increase in intensity in an area of your 7122 eV map (for instance) is caused by an increase in the amount of Fe(II) relative to Fe(III) or if there is simply more Fe overall in this particular area. Secondly, it will be nearly impossible to discriminate between areas with more or less Fe(II) when the areas in question contain a significant proportion of both Fe(II) and Fe(III). This is all to say that you really need some sort of quantitative, statistical tool to discriminate between Fe(II) and Fe(III)-containing regions of your sample.
For this, we can turn to PCA. The goal of PCA is to re-express the data set (i.e. the map of n pixels with m intensities) on a new set of axes (basis), for which a linear combination of the coordinates (vectors) within the new set of axes (basis) returns the coordinates (m, n) in the data set.
The new basis is superior because each axis is an orthogonal vector that lies along the direction in which there is the greatest amount of variation within the measurements. PCA thus offers a way to view the variation in the signal (i.e. intensities of the wavelengths of interest) in the area mapped by XRF.
Mathematical Basis of PCA
The text below is adapted from "A Tutorial on Principal Component Analysis" by J. Schlens (2005).
Let’s define a matrix X, that comprises your XRF data set. X has n columns, where n is the number of measurements that you collected, i.e. the number of pixels in your map. X has m rows, where m is the measurement type, in our case, the intensities of each measured wavelength:
X = [x11……x1n
Note, we actually take the mean adjusted values of xmn, which means that we subtract off the mean from each entry.
The goal of PCA is to find an orthonormal matrix, P, which linearly transforms X into the covariance matrix of X under the special condition that this new matrix (which we’ll call Y) is diagonalized: all entries are zero except those along the matrix diagonal. This can be expressed in the following equation:
Y = PX
The covariance matrix of X is, by definition, XXT/(n-1). This means that the ijth element of the covariance matrix is the dot product of the vector of the ith measurement type with the vector of the jth measurement type, all normalized by n-1. So, each entry is:
Elements along the diagonal of the matrix XXT are the variance of a particular measurement type (the condition where i=j) and elements not along the diagonal are the covariance between measurement types.
If we specify that Y must meet the condition that all non-diagonal entries are zero, what we are really saying is that we are searching for a basis in which the covariance between the measurement types is zero; that is, the measurement types are uncorrelated.
P is made up of a set of orthogonal vectors, pi. To find P that transforms X into Y, we find the normalized direction (in m-dimensions) along which the variance in X is maximized. This is the vector p1.
We then find another direction in which the next largest variance is observed, restricting ourselves to directions that are orthogonal to the p1 direction. This is the vector p2. We repeat this process m times, yielding m p vectors, which are our principal components.
Each vector in P is an eigenvector. An eigenvector is a vector that gives a scalar multiple of itself when operated on by a linear operator. The magnitude of an eigenvector is an eigenvalue. The greater the variation along the direction pm, the greater the eigenvalue. This means that the most important information in the sample is contained within the principal components with the greatest magnitude. pm with lower eigenvalues arise from noise.
So, let’s review: principal component analysis linearly transforms the data (X) using a matrix, P, which comprises a set of orthogonal eigenvectors. P has the properties that it (1) maximizes variance and (2) minimizes covariance.
Ultimately, that means we transform the wavelength intensities (in SMAK, the channels) into a set of principal components, which allow us to visualize the areas of the sample that are the most different from one another. We can throw away components with low magnitudes, as these encode only noise.
Graphical Basis of PCA
In the graph shown below, a data set (X) exists in three-dimensional space (denoted by variables, "var", 1, 2 and 3). The first principal component, vector p1 is selected to align with the direction of greatest variance within the three dimensional space comprising X:
This image was taken from: Wold et al., Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, 2 (1987): 37-52.