# Principal Component Analysis

PCA

SMAK offers a number of statistical tools to help you classify areas of your sample that are chemically distinct from one another. The goal of this section is to provide you with background on these tools so that you understand when you should employ them and how to interpret the results.

These statistical tools include principal component analysis (PCA), which is one of the most common and widely used tools across all types of data analysis.

The goal of PCA is to identify those areas of your sample that exhibit the most variation in their signal. The assumption here is that the areas that exhibit the greatest variation in signal are the areas that are chemically distinct.

For example, pretend you have just collected a multi-energy map across the Fe K-edge in order to identify regions of your sample that contain Fe(II) and regions that contain Fe(III). You may have collected two different energies, e.g. 7122 eV [Fe(II)] and 7132 eV [Fe(III)]. Areas with more Fe(II) will have more intensity in the map collected at 7122 eV, whereas areas with more Fe(III) will have more intensity in the map collected at 7132 eV. You could simply visually compare the 7122 eV map and 7132 eV map to identify these different areas of interest. For some regions there will be enough contrast in the signal that it will be obvious by eye where Fe(II) and Fe(III) are concentrated. However, the vast majority of the time you try this approach, you will run into two problems. First, the total concentration of Fe in your sample will vary, making it difficult for you to ascertain by eye whether an increase in intensity in an area of your 7122 eV map (for instance) is caused by an increase in the amount of Fe(II) relative to Fe(III) or if there is simply more Fe overall in this particular area. Secondly, it will be nearly impossible to discriminate between areas with more or less Fe(II) when the areas in question contain a significant proportion of both Fe(II) and Fe(III). This is all to say that you really need some sort of quantitative, statistical tool to discriminate between Fe(II) and Fe(III)-containing regions of your sample.

For this, we can turn to PCA. The goal of PCA is to re-express the data set (i.e. the map of n pixels with m intensities) on a new set of axes (basis), for which a linear combination of the coordinates (vectors) within the new set of axes (basis) returns the coordinates (m, n) in the data set.

The new basis is superior because each axis is an orthogonal vector that lies along the direction in which there is the greatest amount of variation within the measurements. PCA thus offers a way to view the variation in the signal (i.e. intensities of the wavelengths of interest) in the area mapped by XRF.

Mathematical Basis of PCA

The text below is adapted from "A Tutorial on Principal Component Analysis" by J. Schlens (2005).

Let’s define a matrix X, that comprises your XRF data set. X has n columns, where n is the number of measurements that you collected, i.e. the number of pixels in your map. X has m rows, where m is the measurement type, in our case, the intensities of each measured wavelength:

X = [x11……x1n

xm1…..xmn]

Note, we actually take the mean adjusted values of xmn, which means that we subtract off the mean from each entry.

The goal of PCA is to find an orthonormal matrix, P, which linearly transforms X into the covariance matrix of X under the special condition that this new matrix (which we’ll call Y) is diagonalized: all entries are zero except those along the matrix diagonal. This can be expressed in the following equation:

Y = PX

The covariance matrix of X is, by definition, XXT/(n-1). This means that the ijth element of the covariance matrix is the dot product of the vector of the ith measurement type with the vector of the jth measurement type, all normalized by n-1. So, each entry is:

xixj/(n-1)

Elements along the diagonal of the matrix XXT are the variance of a particular measurement type (the condition where i=j) and elements not along the diagonal are the covariance between measurement types.

If we specify that Y must meet the condition that all non-diagonal entries are zero, what we are really saying is that we are searching for a basis in which the covariance between the measurement types is zero; that is, the measurement types are uncorrelated.

P is made up of a set of orthogonal vectors, pi. To find P that transforms X into Y, we find the normalized direction (in m-dimensions) along which the variance in X is maximized. This is the vector p1.

We then find another direction in which the next largest variance is observed, restricting ourselves to directions that are orthogonal to the p1 direction. This is the vector p2. We repeat this process m times, yielding m p vectors, which are our principal components.

Each vector in P is an eigenvector. An eigenvector is a vector that gives a scalar multiple of itself when operated on by a linear operator. The magnitude of an eigenvector is an eigenvalue. The greater the variation along the direction pm, the greater the eigenvalue. This means that the most important information in the sample is contained within the principal components with the greatest magnitude. pm with lower eigenvalues arise from noise.

So, let’s review: principal component analysis linearly transforms the data (X) using a matrix, P, which comprises a set of orthogonal eigenvectors. P has the properties that it (1) maximizes variance and (2) minimizes covariance.

Ultimately, that means we transform the wavelength intensities (in SMAK, the channels) into a set of principal components, which allow us to visualize the areas of the sample that are the most different from one another. We can throw away components with low magnitudes, as these encode only noise.

Graphical Basis of PCA

In the graph shown below, a data set (X) exists in three-dimensional space (denoted by variables, "var", 1, 2 and 3). The first principal component, vector p1 is selected to align with the direction of greatest variance within the three dimensional space comprising X:

This image was taken from: Wold et al., Principal Component Analysis, Chemometrics and Intelligent Laboratory Systems, 2 (1987): 37-52.

Example of PCA for Image Analysis in SMAK

The following example illustrates how PCA can be used to identify distinct regions in a sample that exhibit different chemical forms of Fe. The sample consists of a biofilm of Fe(II)-oxidizing bacteria, which had been collected from a pond, that was drop deposited on a plastic microscope slide. Fe(II)-oxidizing bacteria produce Fe(III) oxy(hydr)oxides. In this example, the researcher was curious to know whether all of the Fe present in the biofilm was Fe(III), or if some Fe(II) had remained. Maps were collected at 7122 eV, 7124 eV, 7126 eV, 7130 eV, and 7150 eV.

(1) Select the green "PCA Analysis" button on the main SMAK window. A window will pop up where you can select the channels on which you want to perform PCA. You will also have the option to select what type of PCA or other cluster analysis you would like to perform. PCA is denoted "sPCA" (the "s" stands for standard):

(2) This will generate new channels in the main window of SMAK: "PCA1", "PCA2", etc. The number of PCA channels is equal to the number of input channels: we ran PCA on five energies, so we have five PCA channels. Typically, the first PCA component expresses difference in concentration across the image, and the second component (and sometimes third, fourth, etc.) express chemical differences across the image. The higher-numbered PCA components are more likely to encode noise instead of meaningful signal. Recall this is because PCA is designed to pick out regions of the image that exhibit the highest variation, followed by the next highest variations, followed by the next. You will need to decide which channels encode useful information- these are your principal components- and which encode noise- these components can be neglected . The PCA component maps are shown below. You can see how PCA1 and PCA2 exhibit distinct patterns in intensity, whereas PCA4 and PCA5 appear largely splotchy, with no clear bright spots.

(3) SMAK will also generate a plot of eigenvectors for each component if "Plot PCA vectors" under the "Analyze" menu is checked. Note that this is checked by default. The x-axis displays the channels you have chosen for the PCA analysis, in this case, the energies for our multienergy map. However, these will display as integers from 0 through 4 instead of 7122, 7124, 7126, 7130 and 7150 eV, unless you first assign the values for the x-axis. BEFORE you run PCA:

(i) Under the "Analyze" menu, select "Spectrum Maker"

(ii) A pop up window will appear where you can select the channels on which you will perform PCA (in this example "Fe7122", "Fe7124", etc).

(iii) A pop up window will appear that will ask if the default value is correct. SMAK should auto-read the energy from your channel names. However, if it cannot read the energy correctly, it will try to set the default value to 1. If this is the case, select no. The window will then ask if the energies are equally spaced. That's not true in this example, so select no again. You will then be prompted to enter each energy individually.

(iv) After you follow these steps, you can then run PCA and your eigenvector plot will look like the following:

For PCA, this plot is not particularly intuitive because the eigenvectors are mathematical abstractions that do not yield easily interpretable chemical information. However, this plot can be extremely helpful when running NMF or SiVM.

(4) An important point that is apparent when viewing the above plot is that EIGENVECTORS CAN BE NEGATIVE. However, the SMAK display window plots negative values as zero (this is convenient for XRF images where we can't have negative intensities), which means that the PCA component images only show the positive part of the signal. To see the negative part, we need to use "map math" to invert the sign on PCA component. To do this select the green "map math" button on the main SMAK window. Select each principal component, select multiply, select scalar, and put a "-1" in the box on the right:

Then "Do" and "Save" your calculation and it will appear as a new channel in SMAK:

The objective of performing PCA is to find locations that are chemically distinct and then measure XANES at these spots. To do this we select spots that are bright in each of the principal components as these are spots that are most different from one another. Remember that component 1 usually encodes information about concentration, whereas component 2 (and maybe higher) encode information about different chemical species (this is valid if performing PCA on mutlienergy maps). So, it is most useful to select bright spots in component 2 (and maybe higher), and a representative point or two from component 1. In this example, two bright spots on PCA 2 were chosen, one bright spot on PCA 3 and one dark spot on PCA 2 that corresponded to a moderately bright spot on PCA 1 (it was desriable to choose very bright spots on PCA 1 to avoid highly concentrated Fe areas that might have been self absrobed). The following image shows the selected spots (as big empty circles):

The following plot shows the XANES spectra for the four selected points. You can see that we have managed to find distinct Fe spectra:

These spectra were used in of the multienergy (ME) maps to determine the fraction of the different Fe species present across the entire image. It is instructive to compare the PCA analysis to the XANES fitting results. You can see in the image below that PCA component 2 exhibits hot spots in the locations where Fe(II) was observed (green and red), whereas PCA component 1 was most sensitive to the major Fe(III) species (blue). This underscores the utility of using PCA to choose where to collect XANES spectra: this analysis can pick out small regions of the sample image that are chemically distinct. It is highly unlikely that one would select the small Fe(II)-containing spots for analysis without this analysis.