Principal Component Analysis *
Theory Behind Principal Component Analysis *
Residual Standard Deviation (RSD or Real Error RE) *
Target Factor Analysis *
Target Testing *
Principal Component Analysis by Example *
Principal Component Analysis and Real Data *
PCA and CasaXPS *
Viewing the Data in Factor Space *Principal Component Analysis
XPS is a technique that provides chemical information about a sample that sets it apart from other analytical tools. However, the key information sought by the analyst is locked into a data envelope and as a consequence the need for powerful algorithms is paramount when reducing the data to chemically meaningful quantities. Two approaches have been employed on XPS data:
Curve synthesis and fitting (see Section on Quantification).
Techniques from multivariate statistical analysis of which Principal Component Analysis (PCA) is the most common form.
Curve synthesis is probably the most widely used method for data analysis practised by XPS researchers. Unfortunately, statistically good curve fits are not always physically meaningful and, in many cases, great care must be exercised when choosing the model to describe the data. Any assistance in understanding the model is therefore of great value and it is with this end that Principal Component Analysis is offered as a supplementary tool.
Factor analysis is a field that is as broad as it is deep. It is a mathematically challenging tool that requires knowledge of matrix algebra coupled with a feel for a statistical approach to data interpretation. A true understanding for the subject can only be obtained by studying the literature and through practical experience. Therefore the material presented here is only an introduction rather than a complete set of works.
Theory Behind Principal Component Analysis
Factor analysis is a multivariate technique for reducing matrices of data to their lowest dimensionality by use of orthogonal factor space. The challenge is to identify the number of significant factors (principal components) and use this information to model the data using techniques such as Target Transformations or curve fitting.
In XPS the data matrix is composed of spectra where each acquisition channel is viewed as a co-ordinate in an r-dimensional space; r is equal to the number of acquisition channels per spectrum. The problem addressed by PCA is that of determining the number of distinct spectroscopic features present in a particular set of c spectra.
The following example tries to illustrate the nature of the problem. Consider a set of three spectra; each spectrum has three acquisition channels:
s1 = ( 4, 3, 6) , s2 = (2, 3, 2) , s3 = (2, 0, 4)
The data matrix is given by
These three vectors belong to a 3-dimensional space, however they do not span 3-dimensional space for the following reason. If a linear combination of the vectors s1, s2 and s3 is used to construct a new vector v, then v always lies in a plane (a 2-dimensional sub-space of 3-dimensional space). The fact that v lies in a plane is a consequence of the following relationships between the three spectra.
s3 = s1 – s2,
so v = a s1 + b s2 + c s3
= a s1 + b s2 + c (s1 – s2)
= (a + c) s1 + (b – c) s2.
Thus, two principal components exist for the set of three spectra.
The analysis of the data matrix in the above simple example has been performed by observation. Unfortunately real spectra are not so simple and spotting the linear relationships between the columns of the data matrix requires a more sophisticated approach.
PCA, also known as Eigenanalysis, provides a method for identifying the underlying spectra that form the building blocks for the entire set of spectra. The data matrix is transformed into a new set of r-dimensional vectors. These new vectors span the same subspace as the original columns of the data matrix, however they are now characterised by a set of eigenvalues and eigenvectors. The eigenvalues provide a measure for the significance of the abstract factors with respect to the original data. Various statistics can be computed from these values that aid in identifying the dimensionality of the subspace spanned by the spectra.
The procedure for calculating the abstract factors has its roots in linear least square theory. In fact the preferred method is to form a Singular Value Decomposition (SVD) for the data matrix.
Where D is the data matrix formed from c spectra, each containing r channels. U is the same dimension as D, while S and V are c by c matrices. S is a diagonal matrix; the diagonal elements are the square root of the eigenvalues of the correlation matrix
The abstract factors are computed from US. The rows of V are the corresponding eigenvectors of Z; the co-ordinates of the eigenvectors represent the loading for the abstract factors and specify how linear combinations of these factors can be used to reproduce the original data. Including all of the abstract factors with the appropriate loading enables the data to be reproduced to an accuracy only limited by the precision characteristic of the Eigenanalysis procedure.
The essential feature of the SVD procedure is to compute the abstract factors so that the factor corresponding to the largest eigenvalue accounts for a maximum of the variation in the data. Subsequent abstract factors are generated such that 1) as much variance as possible is accounted for by each new factor and 2) the newest axis is mutually orthogonal to the set of axes already located. The procedure therefore computes an orthogonal basis set for the subspace spanned by the original data matrix that is oriented with respect to the data in a linear least square sense.
In principle, the number of non-zero eigenvalues is equal to the number of linearly independent vectors in the original data matrix. This is true for well posed problems, but even the presence of errors due to numerical operations will result in small eigenvalues that theoretically should be zero. Numerical errors are an insignificant problem compared to the one presented by the inclusion of experimental error in the calculation. Noise in the data changes the underlying vectors so that almost every data matrix of c spectra with r acquisition channels, where c <= r, will span a c-dimensional subspace. This is true even though the underlying vectors should only span fewer than c dimensions.
Various statistics are available for identifying the mostly likely dimensionality of a data matrix. These statistics are designed to aid partitioning the abstract factors into primary and secondary factors. The primary factors are those corresponding to the largest n eigenvalues and represent the set of abstract factors that span the true subspace for the data. The secondary factors are those factors that can be associated with the noise and, in principle, can be omitted from subsequent calculations. It is not possible to completely disassociate the true data from the error within the measured data, however the statistics guide the analyst in choosing the most appropriate number of abstract factors that describe the data and therefore the "best guess" dimensionality for the data matrix.
In the case of XPS spectra the experimental error is known to be the square root of the number of counts in an acquisition channel. Under these circumstances where the experimental error is known, a number of statistics have been proposed for determining the size of the true factor space.
Residual Standard Deviation (RSD or Real Error RE)
An alternative name for the RSD (used by Malinowski) is the Real Error (RE).
The RSD is defined to be:
where Ej is the jth largest eigenvalue, n is the number of abstract factors used to reproduce the data; c spectra each with r channels are used to construct the data matrix.
RSDn must be compared against the estimated experimental error. If the value computed for RSDn is approximately equal to the estimated error then the first n abstract factors span the factor space. The dimensionality of the original data matrix is therefore n.
Two further statistics may be derived from RSDn, namely, IEn (Imbedded Error) and INDn (Indicator Function) given by:
IEn and INDn are statistics that should decrease as the number of primary abstract factors is increased. Once all the primary factors have been included, these statistics should begin to increase since at this point factors from the noise subspace start to interfere with the accuracy of the data description. This minimum is therefore an indicator of the dimensionality of the data subspace.
Bartlett proposed using the chi-square criterion for situations similar to XPS data, where the standard deviation varies from one data channel to the next.
The procedure involves reproducing the data matrix using the abstract factors. Each abstract factor is progressively included in a linear combination in the order defined by the size of the eigenvalues and weighted by the co-ordinates of the corresponding eigenvectors. The chi-square value for a set of n abstract factors is computed using:
where dij is an element of the data matrix, dij is the corresponding approximation to the data point constructed from the first n abstract factors with the largest eigenvalues. The standard deviation for XPS data sij is the square root of dij.
The expected value for each n is given by cn2 (expected) = (r-n)(c-n). A comparison between the expected value and the computed value is the basis for determining the number of principal components. Both cn2 and its expected value decrease as n increases. cn2 initially is larger than cn2 (expected) but as n increases a crossover occurs. The true dimensionality of the data matrix is chosen to be the value of n for which cn2 is closest to its expected value.
Note that smoothing the data will alter the characteristics of the noise. Performing such pre-processing therefore invalidates the cn2 statistic.
Target Factor Analysis
Principal Component Analysis provides a set of basis vectors that describe the original set of spectra. Although useful as a means of characterising the data, these abstract factors are in general not physically meaningful. Target Factor Analysis is concerned with identifying vectors that can also describe the data, but with the additional property that they are recognisable as spectra rather than simply abstract vectors in an r-dimensional space.
There are numerous methods for transforming the PCA abstract factors to provide vectors that are more open to chemical interpretation. These involve constructing abstract rotation transformations that map the abstract factors into one of the infinite number of alternative basis sets for the factor space. Fortunately there is a technique which when coupled with curve synthesis, lends itself to the analysis of XPS data, namely, Target Testing.
Once a Principal Component Analysis has been performed, the mathematical bridge between abstract and real solutions is Target Testing. Individual spectra can be evaluated to assess whether the corresponding vector lies in the subspace spanned by the chosen primary abstract factors. The essential feature of Target Testing is to form the projection of the target vector onto the subspace spanned by the primary factors, then compute the predicted target vector using this projection. Statistical tests applied to the predicted and test vectors determine whether these two vectors are one and the same. These tests serve as a mean of accepting or rejecting possible fundamental components of the sample.
Ultimately, the goal of target testing is to identify a set of spectra that span the same subspace as the primary abstract factors. Complete models of real factors are tested in the target-combination step. In the combination step the data matrix is reproduced from the real factors (spectra) rather than from abstract factors and by comparing the results for different sets of real factors, the best TFA solution to a problem can be determined.
Testing a target vector x with respect to the chosen set of primary abstract factors involves forming the projection t onto the subspace spanned by the PCA primary abstract factors. The predicted vector x, calculated using the co-ordinate values of t to load the corresponding abstract factors, is compared to the original target vector. A target vector that belongs to the subspace spanned by the primary abstract factors should result in a predicted vector that is identical to the initial target vector. Errors in the original data matrix and similar errors in the measured target vector mean that the predicted and target vector differ form each other as well as from the pure target vector x* (x but without error). Estimates for these differences allow a comparison to be made between the predicted and target vector and a decision as to which targets to include in the target combination step.
The apparent error in the test vector (AET) measures the difference between the test and predicted vectors in a root mean square (RMS) sense. Similarly two other RMS quantities estimate the real error in the target vector (RET) and the real error in the predicted vector (REP). These error estimates form the basis for the SPOIL function defined to be approximately equal to the ratio RET/REP.
Principal Component Analysis by Example
The first example illustrating the characteristics of PCA uses a set of artificial data.
Three sets of spectra prepared from synthetic components are used in the PCA. The structure of the artificial data derives from Carbon 1s states within three compounds, namely, PMMA, PVA and PVC (Figure 1.). The proportion of each compound varies throughout each set of ten VAMAS blocks. The data is located in the files c1stest1.vms, c1stest2.vms and c1stest3.vms. The underlying trends introduced into each file are as follows: peaks corresponding to PMMA and PVC obey quadratic adjustments in intensity over the set of ten spectra (PMMA decreases while PVC increases). The difference between the three files is the proportion of PVA in each data envelope. The first file (c1test1.vms) has a constant level of PVA (Figure 2); the second file (c1stest2.vms) varies linearly, first increasing then decreasing; the third file (c1stest3.vms) includes a linear increase in the level of PVA.
The objective is to show how the statistics used in PCA behave for a known problem. Data matrices constructed from the three sets of spectra should have a dimensionality of three.
Figure 1: Artificial C 1s Data
Note that, although each compound is constructed from a number of C 1s peaks (PMMA 4, PVA 4 and PVC 2), the stoichiometry of these compounds masks the true number of synthetic components actually present in the data. Hence the dimensionality of the data should be three not ten (4+4+2). An additional twist to this example is that two of the underlying envelopes are similar in shape to each other, though not identical (see Figure 1).
The trend throughout the first data set may be seen in Figure 2.
Figure 2: c1s_test1.vms C 1s Spectra
No noise is present in the data; therefore eigenvalues belonging to the primary set of three abstract factors should be non-zero, while the remaining seven eigenvalues should be zero. The results of applying PCA to these data sets (Table 1) illustrate the uncertainty associated in estimating the dimensionality of the data matrix from the statistics. The fourth largest eigenvalue in each case is small but non-zero. Also the statistics for IE and IND indicate a minimum at eigenvalues other than the expected result. Chi-square is not a valid statistic since no noise is present in the data, however it does show that three abstract factors are sufficient to reproduce the data to within reason.
Table 1: PCA report for file c1stest.vms
It is interesting to see how the eigenvalues change with respect to the three data sets (Figure 3 and Figure 4). The same spectra varied in different ways results in slightly different orientations for the principal component axes and hence different eigenvalues.
The PCA statistics IE and IND have implied a dimensionality other than three (Table 1). The clue to the correct dimensionality of the data lies in the relative size of the eigenvalues. The fourth eigenvalue is in two cases better than five orders of magnitude smaller than the third eigenvalue. This statement has been made with the benefit of a good understanding of what is present in the data. In real situations such statements are themselves suspect and so require support from other data reduction techniques. For example curve fitting using three sets of synthetic peaks all linked with the appropriate stoichiometric relationships would lend support to the hypothesis. Curve fitting such structures is not an exact science and such fits themselves should be supported by statistics gathered from the fitting parameters.
Figure 3: Abstract Factors and Eigenvalues.
Figure 4. Eigenvalues for c1s_test2.vms Abstract Factors
The second example illustrates the effects of experimental error on a PCA calculation.
Table 2: PCA Applied to Artificial Data with Simulated Noise.
IND * 1000
Real data includes noise. The effect of noise on a PCA calculation can be see from Figure 5 together with the report in Table 2. The data in the file c1stest1.vms has been used together with a pseudorandom number generator to simulate noise that would typically be found in XPS data. The consequence of including a random element in the data is that the eigenvalues increase in size and lead to further uncertainty with regard to which eigenvalues belong to the set of primary abstract factors. Note that the abstract factors in Figure 5 are plotted in the reverse order to the ones in Figure 3 and Figure 4.
Figure 5: c1stest1.vms data plus artificial noise.
Fortunately, the chi-square statistic becomes more meaningful when noise is introduced into the problem. A comparison between the computed chi-square and its expected values do seem to point to a 3-dimensional subspace. The crossover between the two quantities suggests the need for three abstract factors when approximating the data matrix using the results of PCA.
Principal Component Analysis and Real Data
XPS depth profiles generate sets of spectra that are idea for examination via PCA. The spectra are produced by repeatedly performing etch cycles followed by measuring the count rate over an identical energy region. The resulting data set therefore varies in chemical composition with respect to etch time and the common acquisition conditions provide data in a form that is well suited to PCA.
An aluminium foil, when profiled using a Kratos Analytical Axis Ultra, provides a good example of a data set that can be analysed using some of the features on offer in CasaXPS. The data is not chemically or structurally interesting, but does show how trends can be identified and anomalies isolated.
Figure 6 shows a plot of the Al 2p energy-region profiled against etch time. The data envelopes change in shape as the surface oxide layer is removed by the etch cycles to reveal the homogeneous bulk aluminium metal.
It should also be noted from Figure 6 that the data contains an imperfection. One of the energy scans includes data acquired during an instrumental event. Noise spikes are superimposed on the true data and these should be removed before techniques such as curve synthesis are applied. In this case the spikes appear in the background and are therefore obvious to the eye, however similar non-physical structure that occurs on the side of a peak is less obvious and could be missed.
The first step in performing the Principal Component Analysis is to define a quantification region for each of the spectra to be used in the analysis. These regions specify the acquisition channels that will be used in the data matrix. Also any shifts in the data due to charging can be removed from the calculation by using an offset in the energy region for those spectra affected.
Next, select the set of spectra in the Browser View and display the data in the active tile. The processing property page labelled "PCA" offers a button labelled "PCA Apply". On pressing this button, those spectra displayed in the active tile are transformed into abstract factors. Figure 7 displays the spectra before the PCA transformation while Figure 8 shows the abstract factors generated from the eigenanalysis.
Note that the abstract factors are truly abstract. The first factor (Figure 8) looks like an Al 2p metal doublet, however this is because the Al 2p metal envelope dominates the data set and therefore a vector having a similar shape accounts for most of the variation in the overall set of spectra. A more even weighting between the underlying line-shapes would produce abstract factors that are less physically meaningful in appearance.
The only real use for the abstract factors is judging their significance with respect to the original data. Abstract vectors that derive from noise look like noise, factors that contribute to the description of the data contain structure. The dividing line between the primary and secondary abstract factors can sometimes be assessed based on the appearance of the abstract factors.
Analysing the Al 2p spectra generates abstract factors and eigenvalues that represent the PCA fingerprint for the data. Table 3 is a report of the Al 2p data set generated by CasaXPS and formatted using a spreadsheet program. Each row of the report is labelled by the VAMAS block name that contains the abstract factor corresponding to the listed eigenvalue.