Introduction

The bulk of this article piggy backs from the work done in this Jupyter notebook [1]. While reviewing the PCA results found in the notebook, I noticed that the standardization process used altered the data in such a way that some information was lost. I’ll explain what I mean by that below.

Required tech

Considering Standardization

First, consider the following matrix:

A = np.array([[0,  1,  6, 15, 12,  1,  0,  0],
              [0,  7, 16,  6,  6, 10,  0,  0],
              [0,  8, 16,  2,  0, 11,  2,  0],
              [0,  5, 16,  3,  0,  5,  7,  0],
              [0,  7, 13,  3,  0,  8,  7,  0],
              [0,  4, 12,  0,  1, 13,  5,  0],
              [0,  0, 14,  9, 15,  9,  0,  0],
              [0,  1,  6, 15, 12,  1,  0,  0]])

This matrix is very similar to the ones constructed from the well-known hand written digit repository found at the UCI machine learning repo [2]. I have made one alteration to the data, which is to make the first and last row contain the same 8 elements. This is akin to the top and bottom of the 0 (the digit this matrix represents) containing the exact same pixel elements.

Have a look and see if you can spot the 0 in the image:

Example of 0-Digit Matrix

The heatmap of the handwritten digit 0 as a matrix.

Let’s standardize just this single matrix:

A_standardized = StandardScaler().fit_transform(A)

>> print(A_standardized)
>> [[0.  -1.06503581 -1.62006839  1.53587336  1.04621744 -1.48609575 -0.87576054  0.]
    [0.   0.97983294  0.92121536 -0.11461742  0.0418487   0.65388213 -0.87576054  0.]
    [0.   1.3206444   0.92121536 -0.84816887 -0.96252004  0.89165745 -0.20851441  0.]
    [0.   0.29821003  0.92121536 -0.66478101 -0.96252004 -0.53499447  1.4596009   0.]
    [0.   0.97983294  0.15883023 -0.66478101 -0.96252004  0.17833149  1.4596009   0.]
    [0.  -0.04260143 -0.09529814 -1.2149446  -0.79512525  1.36720809  0.79235477  0.]
    [0.  -1.40584727  0.41295861  0.43554618  1.54840181  0.41610681 -0.87576054  0.]
    [0.  -1.06503581 -1.62006839  1.53587336  1.04621744 -1.48609575 -0.87576054  0.]]

The data within this matrix has been standardized, and as expected, the first and last rows of the matrix are identical. This makes sense.

However, when the standardization is applied to a data frame where each row is 64 elements (not yet reshaped into the matrices), then we will obtain different values for the first and last row, and thus lose the fact that they were the same to begin with. If you look closely you’ll also see that just about every value is modified in some way. Look at how many different values a 0 is mapped to. This results in information loss and inconsistencies. Here is data and visualizations to show you what I mean:

# standardization code applied to original matrix
# df has shape (3824, 64) - it's loaded from the UCI repo linked above
df_standardized = StandardScaler().fit_transform(df)

# standardized matrix (after applying reshape)
>> print(df_standardized)
>> [[ 0.          0.80578762  0.11187895  0.7497641   0.12088469 -0.80258942 -0.41150365 -0.13531434]
    [-0.02362608  1.65043237  0.99751962 -1.42414335 -0.96579085  0.28715148 -0.54158643 -0.15369919]
    [-0.04146545  1.56407133  1.09041258 -0.79999761 -1.16971881  0.46912891 -0.01285329 -0.11292825]
    [-0.03235924  0.86332244  1.10309145 -1.03900115 -1.53914462 -0.47773704  1.28524666 -0.04857066]
    [-0.03618347  1.54299058  0.85310324 -1.00759414 -1.74717613 -0.20412106  1.17220826  0.        ]
    [-0.08686248  0.88406629  0.85202866 -1.11084563 -1.09563608  0.74489842  0.34094767 -0.09304005]
    [-0.06609584 -0.40816886  1.08179006 -0.17226062  0.98485692 -0.04768401 -0.76374843 -0.19317477]
    [-0.01617327  0.77217283  0.02894122  0.7051445   0.10769183 -0.98683408 -0.52270658 -0.17571686]]

Notice that after the standardization occurs the first and last rows are no longer the same, even though they had the same data to start with.

You can see this visually here:

Pre-Standardized 0-Digit Matrix

The heatmap of the matrix pre-standardization.

Post-Standardized 0-Digit Matrix

The heatmap of the matrix post-standardization.

Notice how the colors change in different places (look at the first and last row closely), indicating that the standardization has altered the matrix in a way that did not retain the original properties of the data.

I’m going to apply a row-based standardization calculation that scales the numbers but keeps their relative magnitudes by row the same.

# new standardization code applied to original matrix
# df has shape (3824, 64) - it's loaded from the UCI repo linked above
df_standardized = (df.T / df.T.sum()).T

# standardized matrix (after applying reshape)
>> [[0.  0.00322581  0.01935484  0.0483871   0.03870968  0.00322581  0.          0.]
    [0.  0.02258065  0.0516129   0.01935484  0.01935484  0.03225806  0.          0.]
    [0.  0.02580645  0.0516129   0.00645161  0.          0.03548387  0.00645161  0.]
    [0.  0.01612903  0.0516129   0.00967742  0.          0.01612903  0.02258065  0.]
    [0.  0.02258065  0.04193548  0.00967742  0.          0.02580645  0.02258065  0.]
    [0.  0.01290323  0.03870968  0.          0.00322581  0.04193548  0.01612903  0.]
    [0.  0.          0.04516129  0.02903226  0.0483871   0.02903226  0.          0.]
    [0.  0.00322581  0.01935484  0.0483871   0.03870968  0.00322581  0.          0.]]

Let’s consider the performance of the PCA’d data using a SVC just as the author in [1] did.

PCA Performance Review

In [1], using an SVC model, the author achieved a 2-Component PCA Accuracy of 55.20%. It’s interesting to note that without using any data standardization technique the 2-Component PCA Accuracy is 61.38%. This indicates that the standardization process might be hurting more than it’s helping. This is the effect I was referring to in the Introduction as “information loss”.

When the row-based standardization process is used the 2-Component PCA Accuracy is 61.83%. I double-checked the numbers and yes, the digits did indeed swap places ( .38 -> .83).

This standardization might be a better choice for the data we are considering for this problem. Let’s apply the loop written in [1] to non-standardized, and row-standardized data.

Best Accuracy according to [1]:   29-component PCA: MSE = 0.1177, Accuracy = 97.05%.
Best Accuracy according to no-standardization:   26-component PCA: MSE = 7391, Accuracy = 98.55%.
Best Accuracy according to row-standardization:   24-component PCA: MSE = 0.08607, Accuracy = 98.66%.

There is only a small difference between standardizing the data and not standardizing the data. However, it appears that on average when considering the number of components used in PCA the technique applied in [1] results in lower accuracy levels overall.

Conclusion

The purpose of [1] wasn’t to push the performance of their model, but rather to help explain the concept of PCA with a classic example. The author did a great job of that. I just wanted to explore the standardization technique they chose, and I wanted to see how things changed when using an algorithm that retained the relative values of the original data. It’s best to always consider any and all transformations we make to our data, and scrutinize whether or not they are making a positive impact on our results.

Sources

[1] https://charlesreid1.github.io/circe/Digit%20Classification%20-%20PCA%20and%20SVC.html
[2] https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Project Code

# this project comes from here: https://charlesreid1.github.io/circe/Digit%20Classification%20-%20PCA%20and%20SVC.html

import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

desired_width = 320
pd.set_option('display.width', desired_width)
np.set_printoptions(linewidth=desired_width)

testing_df = pd.read_csv('c:/users/JOSEPH~1/desktop/root/projects/_data_warehouse/optdigits.tes', header=None)
X_testing,  y_testing = testing_df.loc[:, 0:63],  testing_df.loc[:, 64]

training_df = pd.read_csv('c:/users/JOSEPH~1/desktop/root/projects/_data_warehouse/optdigits-mod.tra', header=None)
X_training, y_training = training_df.loc[:, 0:63], training_df.loc[:, 64]

print('X_training', type(X_training), X_training.shape)
print('y_training', type(y_training), y_training.shape)

matrix = X_training.loc[0, :].values.reshape(8, 8)
print(matrix, y_training[0])
plt.imshow(matrix)
gca().grid(False)
plt.show()


def get_normed_mean_cov(X):
    #X_standardized = StandardScaler().fit_transform(X)
    X_standardized = (X.T / X.T.sum()).T
    #X_standardized = X

    X_mean = np.mean(X_standardized, axis=0)

    # Automatic:
    # X_cov = np.cov(X_standardized.T)

    # Manual:
    X_cov = (X_standardized - X_mean).T.dot((X_standardized - X_mean)) / (X_standardized.shape[0] - 1)

    return X_standardized, X_mean, X_cov


X_standardized, X_mean, X_cov = get_normed_mean_cov(X_training)
X_standardized_validation, _, _ = get_normed_mean_cov(X_testing)

print(len(X_standardized), len(X_mean), len(X_cov))

mat2 = X_standardized.loc[0, :].values.reshape(8, 8)
#mat2 = X_standardized[0].reshape(8, 8)
print(mat2)
plt.imshow(mat2)
gca().grid(False)
plt.show()

# Make PCA
pca2 = PCA(n_components=2, whiten=True)
pca2.fit(X_standardized)
X_reduced = pca2.transform(X_standardized)

print(X_reduced)

# Make SVC
linclass2 = SVC()
linclass2.fit(X_reduced, y_training)

# To color each point by the digit it represents,
# create a color map with 10 elements (10 RGB values).
# Then, use the system response (y_training), which conveniently
# is a digit from 0 to 9.


def get_cmap(n):
    #colorz = plt.cm.cool
    colorz = plt.get_cmap('Set1')
    return [colorz(float(i)/n) for i in range(n)]


colorz = get_cmap(10)
colors = [colorz[yy] for yy in y_training]

scatter(X_reduced[:, 0], X_reduced[:, 1],
        color=colors, marker='*')
xlabel("Principal Component 0")
ylabel("Principal Component 1")
title("Handwritten Digit Data in\nPrincipal Component Space",size=14)
show()

# Now see how good it is:
# Use the PCA to fit the validation data,
# then use the classifier to classify digits.
X_reduced_validation = pca2.transform(X_standardized_validation)
yhat_validation = linclass2.predict(X_reduced_validation)

y_validation = y_testing

pca2_cm = confusion_matrix(y_validation, yhat_validation)
sns.heatmap(pca2_cm, square=True, cmap='inferno')
title('Confusion Matrix:\n2-Component PCA + SVC')
ylabel('True')
xlabel('Predicted')
show()

total = pca2_cm.sum(axis=None)
correct = pca2_cm.diagonal().sum()
print("2-Component PCA Accuracy: %0.2f %%" % (100.0*correct/total))

# MSE associated with back-projection:
X_orig = X_standardized
X_hat = pca2.inverse_transform(pca2.transform(X_orig))
print(X_hat)
print(X_hat.shape, X_orig.shape)

mse = sum(((X_hat - X_orig)**2).mean(axis=None))*100
#mse = ((X_hat - X_orig)**2).mean(axis=None)
print(mse)

# Make PCA
pca5 = PCA(n_components=5, whiten=True)
pca5.fit(X_standardized)
X_reduced = pca5.transform(X_standardized)

# Make SVC
linclass5 = SVC()
linclass5.fit(X_reduced, y_training)

# Use the PCA to fit the validation data,
# then use the classifier to classify digits.
X_reduced_validation = pca5.transform(X_standardized_validation)
yhat_validation = linclass5.predict(X_reduced_validation)

y_validation = y_testing

pca5_cm = confusion_matrix(y_validation, yhat_validation)
sns.heatmap(pca5_cm, square=True, cmap='inferno')
title('Confusion Matrix:\n5-Component PCA + SVC')
show()

total = pca5_cm.sum(axis=None)
correct = pca5_cm.diagonal().sum()
print("5-Component PCA Accuracy: %0.2f %%" % (100.0*correct/total))


def pca_mse_accuracy(n):
    """
    Creates a PCA model with n components,
    reduces the dimensionality of the validation data set,
    then computes two error metrics:
    * Percent correct predictions using linear classifier
    * Back-projection error (MSE of back-projected X minus original X)
    """

    X_std, _, _ = get_normed_mean_cov(X_training)
    X_std_validation, _, _ = get_normed_mean_cov(X_testing)

    #########
    # Start by making PCA/SVC
    # Train on training data

    # Make PCA
    pca = PCA(n_components=n, whiten=True)
    pca.fit(X_std)
    X_red = pca.transform(X_std)

    # Make SVC
    linclass = SVC()
    linclass.fit(X_red, y_training)

    ########
    # Now transform validation data

    # Transform inputs and feed them to linear classifier
    y_validation = y_testing
    X_red_validation = pca.transform(X_std_validation)
    yhat_validation = linclass.predict(X_red_validation)

    # Evaluate confusion matrix for predictions
    cm = confusion_matrix(y_validation, yhat_validation)
    total = cm.sum(axis=None)
    correct = cm.diagonal().sum()
    accuracy = 1.0 * correct / total

    # MSE associated with back-projection:
    X_orig = X_std
    X_hat = pca.inverse_transform(pca.transform(X_orig))
    mse = sum(((X_hat - X_orig) ** 2).mean(axis=None))*100

    return mse, accuracy


mses = []
accuracies = []
N = 33
for i in range(1, N):
    m, a = pca_mse_accuracy(i)

    print("%d-component PCA: MSE = %0.4g, Accuracy = %0.2f" % (i, m, a * 100.0))

    mses.append((i, m))
    accuracies.append((i, a))

mses = np.array(mses)
accuracies = np.array(accuracies)

# Now plot the accuracy
plt.plot(accuracies[:, 0],accuracies[:, 1], 'bo-')
plt.title("Percent Correct: Accuracy of Predictions", size=14)
plt.xlabel("Number of Principal Components")
plt.ylabel("Percent Correct")
plt.show()

# MSE, PCA Plot
Nmax = 64
pcafull = PCA(n_components=Nmax, whiten=True)
_ = pcafull.fit(X_standardized)
explained_var_sum = pcafull.explained_variance_.sum()

explained_var = np.cumsum(pcafull.explained_variance_[:N])
explained_var /= explained_var_sum

plot(mses[:, 0], mses[:, 1], '-o', label='MSE')
plot(mses[:, 0], 1.0-mses[:, 1], '-o', label='1.0 - MSE')
plot(range(1, len(explained_var)+1), explained_var, '-o', label='Expl Var')

xlabel('Number of Principal Components')
ylabel('')
title('Mean Square Error, Principal Components Analysis')

legend(loc='best')
show()