Non-Negative Matrix Factorization & Simplex Volume Maximization
Non-negative Matrix Factorization (NMF)
PCA is a common and robust way to examine variation in a data set, while minimizing redundancy (i.e. components that have overlapping information) due to the orthogonality constraint. However, the components can be hard to interpret because they do not have any chemical meaning; they simply map the data onto a new set of vectors that are aligned with the direction of greatest variance in the data.
To obtain components that do hold chemical meaning, we can apply NMF instead. NMF is a matrix factorization problem in which the goal is to come up with a set of components that can be thought of as chemical endmembers.
This problem can be expressed as:
X = WH
X is the data matrix, which has dimensions m x n. m represents different types of measurements. In our case, these are intensities of different wavelengths that we measure (in SMAK, this is the number of channels you select). n is the number of measurements; in our case, this is the number of pixels in the map.
W is a matrix of dimensions m x k. W comprises a set of basis vectors that describe the loading of the data set’s variables onto k bases.
H is a matrix of k x n dimensions. The elements of this matrix represent the loading of each k base onto n.
W and H are constrained to be non-negative, hence the term “NMF”
X can be written, column by column, as:
x = Wh
That is to say, that each data vector x can be expressed as a linear combination of the columns of W weighted by the components of h.
Importantly, since k < m, WH only represents a good approximation of X if the basis vectors of W represent the most important chemical constituents of X, that is, if they serve as appropriate chemical endmembers.
Simplex Volume Maximization (SiVM)
SiVM employs the same relationship as NMF (where W and H are non-negative):
X = W H
However, SiVM relies on a geometric solution to obtain W and H, which greatly reduces the computational expense. In geometry, a simplex is a polyhedral shape with k vertices. For instance, a triangle is a simplex with k = 3. In our case, k is the number of bases in W. SiVM seeks to find a set of bases (k) of W that are actual data points. The k data points are chosen such that they maximize the number of data points the simplex encompasses; each k data point thus represents an endmember (or “archetype”) of the whole data set. The coefficients of the matrix H give a measure of the similarity of each element in X to the corresponding archetypes in W.
The following image shows how a simplex is constructed with k = 2, 3 or 4.
Note how the points are chosen to span the largest amount of the data possible.
SiVM clusters data in a very intuitive way: it selects endmembers and then expresses all other points in the data set as a combination of these endmembers. This is the utility and power of SiVM (and NMF, too).
The draw back in using SiVM is that there is a trade-off between adequately describing all data point in the data set and choosing redundant endmembers. If too few endmembers are chosen (i.e., too low a value of k is chosen), then not all points in the data set will be adequately described (i.e. not all data points will lie within the simplex). In the figure above it can be seen that k = 2 and k = 3 are not adequate to describe the majority of the data set. However, as more endmembers are chosen, the endmembers will start to share characteristics; they will not signify unique chemical species. In any case, the choice of k matters for the analysis and it is wise to attempt the same SiVM calculation with a few different values of k.
Romer et al., Functional Plant Biology, 2012, 39, 878–890
m x n
m x k
k x n