Measuring and Reducing Dimensionality#

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
import os.path
import subprocess

We will use the sklearn decomposition module below:

from sklearn import decomposition

This is a module that includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques.

We will also use the wpca package:

!pip install wpca
import wpca
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
Requirement already satisfied: wpca in /usr/local/lib/python3.11/site-packages (0.1)
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'
WARNING: Skipping /usr/local/lib/python3.11/site-packages/six-1.16.0-py3.11.egg-info due to invalid metadata entry 'name'

Helpers for Getting, Loading and Locating Data

def wget_data(url: str):
    local_path = './tmp_data'
    p = subprocess.Popen(["wget", "-nc", "-P", local_path, url], stderr=subprocess.PIPE, encoding='UTF-8')
    rc = None
    while rc is None:
      line = p.stderr.readline().strip('\n')
      if len(line) > 0:
        print(line)
      rc = p.poll()

def locate_data(name, check_exists=True):
    local_path='./tmp_data'
    path = os.path.join(local_path, name)
    if check_exists and not os.path.exists(path):
        raise RuxntimeError('No such data file: {}'.format(path))
    return path

Get Data#

wget_data('https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/line_data.csv')
wget_data('https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/pong_data.hf5')
wget_data('https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/cluster_3d_data.hf5')
wget_data('https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/cosmo_targets.hf5')
wget_data('https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/spectra_data.hf5')
--2024-01-21 18:56:53--  https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/line_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 114763 (112K) [text/plain]
Saving to: ‘./tmp_data/line_data.csv’
     0K .......... .......... .......... .......... .......... 44% 3.69M 0s
    50K .......... .......... .......... .......... .......... 89% 3.79M 0s
   100K .......... ..                                         100% 43.8M=0.03s
2024-01-21 18:56:53 (4.15 MB/s) - ‘./tmp_data/line_data.csv’ saved [114763/114763]
--2024-01-21 18:56:53--  https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/pong_data.hf5
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 175192 (171K) [application/octet-stream]
Saving to: ‘./tmp_data/pong_data.hf5’
     0K .......... .......... .......... .......... .......... 29% 2.80M 0s
    50K .......... .......... .......... .......... .......... 58% 7.75M 0s
   100K .......... .......... .......... .......... .......... 87% 3.79M 0s
   150K .......... .......... .                               100%  165M=0.04s
2024-01-21 18:56:54 (4.54 MB/s) - ‘./tmp_data/pong_data.hf5’ saved [175192/175192]
File ‘./tmp_data/cluster_3d_data.hf5’ already there; not retrieving.
--2024-01-21 18:56:54--  https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/cosmo_targets.hf5
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2807192 (2.7M) [application/octet-stream]
Saving to: ‘./tmp_data/cosmo_targets.hf5’
     0K .......... .......... .......... .......... ..........  1% 2.68M 1s
    50K .......... .......... .......... .......... ..........  3% 8.07M 1s
   100K .......... .......... .......... .......... ..........  5% 4.71M 1s
   150K .......... .......... .......... .......... ..........  7% 7.89M 1s
   200K .......... .......... .......... .......... ..........  9% 29.0M 0s
   250K .......... .......... .......... .......... .......... 10% 6.97M 0s
   300K .......... .......... .......... .......... .......... 12% 10.6M 0s
   350K .......... .......... .......... .......... .......... 14% 22.6M 0s
   400K .......... .......... .......... .......... .......... 16% 43.3M 0s
   450K .......... .......... .......... .......... .......... 18% 46.2M 0s
   500K .......... .......... .......... .......... .......... 20% 19.1M 0s
   550K .......... .......... .......... .......... .......... 21% 11.0M 0s
   600K .......... .......... .......... .......... .......... 23% 22.9M 0s
   650K .......... .......... .......... .......... .......... 25% 51.2M 0s
   700K .......... .......... .......... .......... .......... 27% 5.94M 0s
   750K .......... .......... .......... .......... .......... 29% 32.8M 0s
   800K .......... .......... .......... .......... .......... 31% 19.2M 0s
   850K .......... .......... .......... .......... .......... 32% 28.2M 0s
   900K .......... .......... .......... .......... .......... 34% 93.7M 0s
   950K .......... .......... .......... .......... .......... 36% 68.3M 0s
  1000K .......... .......... .......... .......... .......... 38% 11.5M 0s
  1050K .......... .......... .......... .......... .......... 40% 77.1M 0s
  1100K .......... .......... .......... .......... .......... 41% 90.8M 0s
  1150K .......... .......... .......... .......... .......... 43% 9.76M 0s
  1200K .......... .......... .......... .......... .......... 45%  147M 0s
  1250K .......... .......... .......... .......... .......... 47% 69.1M 0s
  1300K .......... .......... .......... .......... .......... 49% 79.9M 0s
  1350K .......... .......... .......... .......... .......... 51% 34.6M 0s
  1400K .......... .......... .......... .......... .......... 52% 5.77M 0s
  1450K .......... .......... .......... .......... .......... 54% 72.3M 0s
  1500K .......... .......... .......... .......... .......... 56% 96.9M 0s
  1550K .......... .......... .......... .......... .......... 58% 9.46M 0s
  1600K .......... .......... .......... .......... .......... 60% 95.7M 0s
  1650K .......... .......... .......... .......... .......... 62% 67.8M 0s
  1700K .......... .......... .......... .......... .......... 63% 77.1M 0s
  1750K .......... .......... .......... .......... .......... 65% 79.5M 0s
  1800K .......... .......... .......... .......... .......... 67% 64.1M 0s
  1850K .......... .......... .......... .......... .......... 69%  102M 0s
  1900K .......... .......... .......... .......... .......... 71% 46.4M 0s
  1950K .......... .......... .......... .......... .......... 72% 9.51M 0s
  2000K .......... .......... .......... .......... .......... 74%  107M 0s
  2050K .......... .......... .......... .......... .......... 76% 86.1M 0s
  2100K .......... .......... .......... .......... .......... 78% 41.5M 0s
  2150K .......... .......... .......... .......... .......... 80%  136M 0s
  2200K .......... .......... .......... .......... .......... 82% 10.3M 0s
  2250K .......... .......... .......... .......... .......... 83% 38.9M 0s
  2300K .......... .......... .......... .......... .......... 85% 30.9M 0s
  2350K .......... .......... .......... .......... .......... 87% 32.1M 0s
  2400K .......... .......... .......... .......... .......... 89% 73.9M 0s
  2450K .......... .......... .......... .......... .......... 91% 21.1M 0s
  2500K .......... .......... .......... .......... .......... 93%  104M 0s
  2550K .......... .......... .......... .......... .......... 94% 46.3M 0s
  2600K .......... .......... .......... .......... .......... 96% 34.1M 0s
  2650K .......... .......... .......... .......... .......... 98% 22.6M 0s
  2700K .......... .......... .......... .......... .         100% 72.5M=0.1s
2024-01-21 18:56:54 (18.8 MB/s) - ‘./tmp_data/cosmo_targets.hf5’ saved [2807192/2807192]
--2024-01-21 18:56:54--  https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/data/spectra_data.hf5
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 414744 (405K) [application/octet-stream]
Saving to: ‘./tmp_data/spectra_data.hf5’
     0K .......... .......... .......... .......... .......... 12% 3.01M 0s
    50K .......... .......... .......... .......... .......... 24% 6.24M 0s
   100K .......... .......... .......... .......... .......... 37% 5.60M 0s
   150K .......... .......... .......... .......... .......... 49% 7.30M 0s
   200K .......... .......... .......... .......... .......... 61% 29.8M 0s
   250K .......... .......... .......... .......... .......... 74% 7.63M 0s
   300K .......... .......... .......... .......... .......... 86% 42.3M 0s
   350K .......... .......... .......... .......... .......... 98% 14.9M 0s
   400K .....                                                 100%  213M=0.05s
2024-01-21 18:56:54 (7.62 MB/s) - ‘./tmp_data/spectra_data.hf5’ saved [414744/414744]

Load Data#

line_data     = pd.read_csv(locate_data('line_data.csv'))
pong_data     = pd.read_hdf(locate_data('pong_data.hf5'))
cluster_3d    = pd.read_hdf(locate_data('cluster_3d_data.hf5'))
cosmo_targets = pd.read_hdf(locate_data('cosmo_targets.hf5'))
spectra_data  = pd.read_hdf(locate_data('spectra_data.hf5'))

Data Dimensionality#

We call the number of features (columns) in a dataset its “dimensionality”. In order to learn how different features are related, we need enough samples to get a complete picture.

For example, imagine splitting each feature at its median value then, at a minimum, we would like to have at least one sample in each of the resulting \(2^D\) bins (D = dimensionality = # of features = # of columns; \(r^D\) is the volume of a D-dimensional hypercube with edge length \(r\), with \(r=2\) in our case). This is a very low bar and only requires 8 samples with \(D=3\), but requires \(2^{30} > 1\) billion samples with \(D=30\).

To get a feel for how well sampled your dataset is, estimate how many bins you could split each feature (axis) into and end up with 1 sample per bin (assuming that features are uncorrelated). A value < 2 fails our minimal test above and anything < 5 is a potential red flag.

stats = []
for name in 'line_data', 'cluster_3d', 'cosmo_targets', 'pong_data', 'spectra_data':
    N, D = eval(name).shape
    stats.append([name, N, D, N ** (1 / D)])
pd.DataFrame(stats, columns=('Dataset', 'N', 'D', 'N**(1/D)')).round(3)
Dataset N D N**(1/D)
0 line_data 2000 3 12.599
1 cluster_3d 500 3 7.937
2 cosmo_targets 50000 6 6.070
3 pong_data 1000 20 1.413
4 spectra_data 200 500 1.011

However, not all features carry equal information and the effective dimensionality of a dataset might be lower than the number of columns. As an extreme example, consider the following 2D data which is effectively 1D since one column has a constant value (zero):

gen = np.random.RandomState(seed=123)
data = pd.DataFrame()
data['x'] = gen.uniform(-1, +1, 50)
data['y'] = np.zeros_like(data['x'])
sns.jointplot(data=data, x='x', y='y');
../../_images/03f8614844523bce900f3d04625cd524ec2144ff25739692ba6175b608549bae.png

DISCUSS: Is this data is still 1D if (refer to the plots below):

  • we add some small scatter in the \(2^\mathrm{nd}\) dimension?

  • we perform a coordinate rotation so that \(y \sim m x\)?

  • \(y \sim f(x)\) where \(f(x)\) is nonlinear?

The scatter adds new information in a second dimension, but we can approximately ignore it under two assumptions:

  • The relative scaling of the \(x\) and \(y\) columns is meaningful (which is almost certainly not true if these columns have different dimensions - recall our earlier comments about normalizing data).

  • The origin of the scatter is due to measurement error or some other un-informative process.

The rotation does not change the effective dimensionality of the data.

A non-linear relationship between \(x\) and \(y\) also does not change the underlying dimensionality since we could, in principle, perform a non-linear change of coordinates to undo it. However, we can expect that non-linear relationships will be harder to deal with than linear ones.

# Add some scatter in the 2nd dimension.
data['y'] = gen.normal(scale=0.04, size=len(data))
sns.jointplot(data=data, x='x', y='y');
plt.ylim(-1, +1);
../../_images/21393ef06f0371d0d3deb106c9b4e08db3cd505ba3d802b8fc95a4a55280190b.png
# Rotate by 30 deg counter-clockwise.
theta = np.deg2rad(30.)
rotated = data.copy()
rotated['x'] = np.cos(theta) * data['x'] - np.sin(theta) * data['y']
rotated['y'] = np.cos(theta) * data['x'] + np.sin(theta) * data['y']
#sns.jointplot('x', 'y', rotated, stat_func=None);
sns.jointplot(data=rotated, x='x', y='y');
../../_images/e33c20ed4740981c367e225d81e4bcd997cec6a5b739654b9caaf082707b8950.png
# Use the nonlinear y ~ x ** 2 + x instead of y ~ x.
nonlinear = rotated.copy()
nonlinear['y'] = rotated['y'] + rotated['x'] ** 2
sns.jointplot(data=nonlinear, x='x', y='y');
../../_images/96eb90485c02c87a10b6c886cb4f33486f42c84f3a819f3273a46eba8d15d0aa.png

We will use spectra_data below. Note from the table above that it appears to be severely undersampled with \(N=200\), \(D=500\).

EXERCISE: Plot some rows (samples) of spectra_data using plt.plot(spectra_data.iloc[i], '.') to get a feel for this dataset. What do you think the effective dimensionality of this data is? (Hint: how many independent parameters you would need to generate this data?)

for i in range(10):
    plt.plot(spectra_data.iloc[i], '.')
../../_images/13ea752cc2f07d457fcca5ae59bb511de384b732875cd002863c2f45ab0d2a6a.png

Each sample is a graph of a smooth function with some noise added. The smooth function has three distinct components:

  • two peaks, with fixed locations and shapes, and normalizations that vary independently.

  • a smooth background with no free parameters. Since the data could be reproduced with just normalization parameters (except for the noise), it has an effective dimensionality of \(d=2\).

Note that the relative normalization of each feature is significant here, so we would not want to normalize this data and lose this information. We refer to each sample as a “spectrum” since it looks similar to spectra obtained in different areas of physics (astronomy, nuclear physics, particle physics, …)

Linear Decompositions#

The goal of a linear decomposition is to automatically identify linear combinations of the original features that account for most of the variance in the data. Note that we are using variance (spread) as a proxy for “useful information”, so it is essential that the relative normalization of our features is meaningful.

If we represent our data with the \(N\times D\) matrix \(X\), then a linear decomposition can be represented as the following matrix multiplication:

https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/img/Dimensionality-LinearDecomposition.png

The \(N\times d\) matrix \(Y\) is a reduced representation of the original data \(X\), with \(d < D\) new features that are linear combinations of the original \(D\) features. We call the new features “latent variables”, since they were already present in \(X\) but only implicitly.

The \(d\times D\) matrix \(M\) specifies the relationship between the old and new features: each column is unit vector for a new feature in terms of the old features. Note that \(M\) is not square when \(d < D\) and unit vectors are generally not mutually orthogonal (except for the PCA method).

A linear decomposition is not exact (hence the \(\simeq\) above) and there is no “best” prescription for determining \(Y\) and \(M\). Below we review the most popular prescriptions implemented in the sklearn.decomposition module (links are to wikipedia and sklearn documentation):

Method

sklearn

Principal Component Analysis

PCA

Factor Analysis

FactorAnalysis

Non-negative Matrix Factorization

NMF

Independent Component Analysis

FastICA

All methods require that you specify the number of latent variables \(d\) (but you can easily experiment with different values) and are called using (method = PCA, FactorAnalysis, NMF, FastICA):

fit = decomposition.method(n_components=d).fit(X)

The resulting decomposition into \(Y\) and \(M\) is given by:

M = fit.components_
Y = fit.transform(X)

except for FastICA, where   M = fit.mixing_.T.

When \(d < D\), we refer to the decomposition as a “dimensionality reduction”. A useful visualization of how effectively the latent variables capture the interesting information in the original data is to reconstruct the original data using:

X' = Y M

and compare rows (samples) of \(X'\) with the original \(X\). They will not agree exactly, but if the differences seem uninteresting (e.g., look like noise), then the dimensionality reduction was successful and you can use \(Y\) instead of \(X\) for subsequent analysis.

We will use the function below to demonstrate each of these in turn (but you can ignore its details unless you are interested):

def demo(method='PCA', d=2, data=spectra_data):
    
    X = data.values
    N, D = X.shape
    
    if method == 'NMF':
        # All data must be positive.
        assert np.all(X > 0)
        # Analysis includes the mean.
        mu = np.zeros(D)
        fit = eval('decomposition.' + method)(n_components=d, init='random').fit(X)
    else:
        mu = np.mean(X, axis=0)
        fit = eval('decomposition.' + method)(n_components=d).fit(X)
    
    # Check that decomposition has the expected shape.
    if method == 'FastICA':
        M = fit.mixing_.T
    else:
        M = fit.components_
    assert M.shape == (d, D)
    Y = fit.transform(X)
    assert Y.shape == (N, d)
    
    # Reconstruct X - mu from the fitted Y, M.
    Xr = np.dot(Y, M) + mu
    assert Xr.shape == X.shape
    
    # Plot pairs of latent vars.
    columns = ['y{}'.format(i) for i in range(d)]
    sns.pairplot(pd.DataFrame(Y, columns=columns))
    fig = plt.figure(figsize=(8.5, 4))
    plt.show()
    
    # Compare a few samples from X and Xr.
    fig = plt.figure(figsize=(8.5, 4))
    for i,c in zip((0, 6, 7), sns.color_palette()):
        plt.plot(X[i], '.', c=c, ms=5)
        plt.plot(Xr[i], '-', c=c)
    plt.xlim(-0.5, D+0.5)
    plt.xlabel('Feature #')
    plt.ylabel('Feature Value')
    if (data is spectra_data):
        label = '{}(d={}): $\sigma = {:.3f}$'.format(method, d, np.std(Xr - X))
    else:
        label = '{}(d={}): $\sigma = {:.1e}$'.format(method, d, np.std(Xr - X))
    plt.text(0.95, 0.9, label, horizontalalignment='right',
             fontsize='x-large', transform=plt.gca().transAxes)

Principal Component Analysis#

PCA is the most commonly used method for dimensionality reduction. The decomposition is uniquely specified by the following prescription (more details here):

  • Find the eigenvectors and eigenvalues of the sample covariance matrix \(C\)

\[ \Large C = \frac{1}{N-1}\, X^T X \]

          which is an empirical estimate of the true covariance matrix using the data \(X\) comprised of \(N\) observations (samples) of \(D\) features (i.e. \(X\) is a \(N \times D\) matrix)

  • Construct \(M\) from the eigenvectors ordered by decreasing eigenvalue (which are all positive) and solve the resulting linear equations for \(Y\). At this point the decomposition is exact with \(d = D\); and \(Y\) and \(M\) are \(N \times D\) and \(D \times D\) matrices, respectively.

  • Shrink \(Y\) and \(M\) from \(D\) to \(d\) rows (\(M\)) or columns (\(Y\)), which makes the decomposition approximate while discarding the least amount of variance in the original data (which we use as a proxy for “useful information”).

https://raw.githubusercontent.com/illinois-mlp/MachineLearningForPhysics/main/img/Dimensionality-PCAdecomposition.png

The full \(M\) matrix (before shrinking \(D\rightarrow d\)  ) is orthogonal which means that

\[ \Large M^T = M^{-1} ~~~~~~~~~~~~~~ \text{and} ~~~~~~~~~~~~~~ M M^T = \tilde{1} \]

and satisfies

\[ \Large X^T X = M^T \Lambda M \]

where \(\Lambda\) is a diagonal matrix of the decreasing eigenvalues. Note that this description glosses over some details that you will explore in your homework.

The resulting latent variables are statistically uncorrelated (which is a weaker statement than statistically independent – see below), i.e., the correlation coefficients between different columns of \(Y\) are approximately zero:

\[ \Large \rho(j,k) = \frac{Y_j\cdot Y_k}{|Y_j|\,|Y_k|} \simeq 0 \; . \]

The PCA demonstration below shows a pairplot of the latent variables from a \(d=2\) decomposition, followed by a reconstruction of some samples (red curves) compared with the originals (red points).

Note that the reconstructed samples are in some sense better than the originals since the original noise was associated with a small eigenvalue that was trimmed!

demo('PCA', d=2);
../../_images/d3cdab8dcf01032a4e644dd1a459cf3a8eaf16be9e7460a4846f98747acbd0c3.png
<Figure size 850x400 with 0 Axes>
../../_images/bb25a71b6213daeaec79441c3666adc84e585d8d75c34e4a3a49bf1ed345867e.png

DISCUSS: How many clusters do you expect to see in the scatter plot of y0 versus y1 above based on what you know about this dataset? Can you identify these clusters in plot above?

We expect to see 4 clusters, corresponding to spectra with:

  • No peaks.

  • Only the lower peak.

  • Only the upper peak.

  • Both peaks.

We already saw that this data can be generated from two flux values, giving the normalization of each peak. Lets assume that y0 and y1 are related to these fluxes to identify the clusters:

  • Points near (-2000, -2000), with very little spread.

  • Points along the horizontal line with y0 ~ -2000.

  • Points along the diagonal line.

  • Points scattered between the two lines.

Factor Analysis#

Factor analysis (FA) often produces similar results to PCA, but is conceptually different.

Both PCA and FA implicitly assume that the data is approximately sampled from a multidimensional Gaussian. PCA then finds the principal axes of the the resulting multidimensional ellipsoid, while FA is based on a model for how the original data is generated from the latent variables. Specifically, FA seeks latent variables that are uncorrelated unit Gaussians and allows for different noise levels in each feature, while assuming that this noise is uncorrelated with the latent variables. PCA does not distinguish between “signal” and “noise” and implicitly assumes that the large eigenvalues are more signal-like and small ones more noise-like.

When the FA assumptions about the data (of Gaussian latent variables with uncorrelated noise added) are correct, it is certaintly the better choice, in principle. In practice, FA decomposition is more expensive and requires an iterative Expectation-Maximization (EM) algorithm. You should normally try both, but prefer the simpler PCA when the results are indistinguishable.

demo('FactorAnalysis', d=2)
../../_images/b98645e53451c53210a4a1605cd8a26142353e62b8b08f8a87d24277df05fe63.png
<Figure size 850x400 with 0 Axes>
../../_images/2f3a3a3a75b301169108c3a2c6bac63fd1486df57fb87e8205e8c28be1ea16f9.png

Non-negative Matrix Factorization#

Most linear factorizations start by centering each feature about its mean over the samples:

\[ \Large X_{ij} \rightarrow X_{ij} - \mu_i \quad , \quad \mu_i \equiv \frac{1}{N} \sum_i\, X_{ij} \; . \]

As a result, latent variables are equally likely to be positive or negative.

Non-negative matrix factorization (NMF) assumes that the data are a (possibly noisy) linear superposition of different components, which is often a good description of data resulting from a physical process. For example, the spectrum of a galaxy is a superposition of the spectra of its constituent stars, and the spectrum of a radioactive sample is a superposition of the decays of its constituent unstable isotopes.

These processes can only add data, so the elements of \(Y\) and \(M\) should all be \(\ge 0\) if the latent variables describe additive mixtures of different processes. The NMF factorization guarantees that both \(Y\) and \(M\) are positive, and requires that the input \(X\) is also positive. However, there is no guarantee that the non-negative latent variables found by NMF are due to physical mixtures.

Since NMF does not internally subtract out the means \(\mu_i\), it generally needs an additional component to model the mean. For spectra_data then, we should use d=3 for NMF to compare with PCA using d=2:

demo('NMF', d=3)
../../_images/65aa6743c672810d0c9d3c464f61506a0f027ef5d0b45a41c0d5b41123c1c554.png
<Figure size 850x400 with 0 Axes>
../../_images/b0a837e8a8181f3d81ef7c86ce9e801d7ce51b3111bd9d4a65581737aa8bf5d6.png

To see the importance of the extra latent variable, try with d=2 and note how poorly samples are reconstructed:

demo('NMF', d=2)
../../_images/81adb35f5959aaec8f6ef867e01bf4efb0c2ad52366ca52ab5d475bbd15c8e5a.png
<Figure size 850x400 with 0 Axes>
../../_images/1e9994d65f6eee73a2df127aa0ba88379a76c49d52ee36236f8f7b16fbd156b3.png

Independent Component Analysis#

The final linear decomposition we will consider is ICA, where the goal is to find latent variables \(Y\) that are statistically independent, which is a stronger statement that the statistically uncorrelated guarantee of PCA. We will formalize the definition of independence soon but the basic idea is that the joint probability of a sample occuring with latent variables \(y_1, y_2, y_3, \ldots\) can be factorized into independent probabilities for each component:

\[ \Large P(y_1, y_2, y_3, \ldots) = P(y_1) P(y_2) P(y_3) \ldots \]

ICA has some inherent ambiguities: both the ordering and scaling of latent variables is arbitrary, unlike with PCA. However, in practice, samples reconstructed with ICA often look similar to PCA and FA reconstructions.

ICA is also used for blind signal separation which is a common task is digital signal processing, where usually \(d = N\).

demo('FastICA', d=2)
../../_images/a9e6cf1cbf1ca6625234d2ea7816e6f2185961fb6d063d50b0de09e3c7157cec.png
<Figure size 850x400 with 0 Axes>
../../_images/763ed5f564453c828681d22b53c006c564a2ed41fad739d19452e9c4f76b558b.png

Comparisons of Linear Methods#

To compare the four methods above, plot their normalized “unit vectors” (rows of the \(M\) matrix). Note that only the NMF curves are always positive, as expected. However, while all methods give excellent reconstructions of the original data, they also all mix the two peaks together. In other words, none of the methods has discovered the natural latent variables of the underlying physical process: the individual peak normalizations.

def compare_linear(data=spectra_data):
    X = data.values
    N, D = X.shape
    fig = plt.figure(figsize=(8.5, 4))
    for j, method in enumerate(('PCA', 'FactorAnalysis', 'NMF', 'FastICA')):
        if method == 'NMF':
            d = 3
            mu = np.zeros(D)
        else:
            d = 2
            mu = np.mean(X, axis=0)
        fit = eval('decomposition.' + method)(n_components=d).fit(X - mu)
        M = fit.mixing_.T if method == 'FastICA' else fit.components_
        for i in range(d):
            unitvec = M[i] / M[i].sum()
            plt.plot(unitvec, label=method, c=sns.color_palette()[j], ls=('-', '--', ':')[i])
    plt.legend()
    plt.xlim(-0.5, D + 0.5)
    
compare_linear()
../../_images/60149df2e25494fba306a1f1544885b1a5a36052b158f15a1859e05d49582b85.png

EXERCISE Use the demo() function with data=pong_data and \(d = 1, 2, 3,\ldots\) to determine how many latent variables are necessary to give a good reconstruction. Do the 1D plots of individual pong_data samples make sense?

demo('PCA', d=1, data=pong_data)
../../_images/d1dee1a60e2dab1d4eb97fbe1e98706fe50c7418f2272648259597333d8a151d.png
<Figure size 850x400 with 0 Axes>
../../_images/e9ecc7e1dc3b6c949641a2f253acb46159687bdc81ecd8b5185197939bd5b71d.png
demo('PCA', d=2, data=pong_data)
../../_images/a26aac6c139aa6a1b848f71acc90163a4ec7cb10d46c08738f3f413244517061.png
<Figure size 850x400 with 0 Axes>
../../_images/0f882838d0e70a0c6c6ce6812f74f0afb5200b99498c3a1f69ecb34e6b8293c1.png
demo('PCA', d=3, data=pong_data)
../../_images/30ce3cc75bbaa33f68d6a74501ea6b7ea60d1a6a865d725a3b58fcfe186b3408.png
<Figure size 850x400 with 0 Axes>
../../_images/297162f702bffd8dba416cadf9328a04c8ac01c2e109a527de776d04fbe53173.png

Two latent variables (\(d=2\)) are sufficient for a good reconstruction. The abrupt transition at feature 10 in the reconstructed samples is because features 0-9 are ~linearly increasing x values, while features 10-19 are the corresponding ~parabolic y values. Note that this dataset has negligible noise compared with spectra_data.

Weighted PCA#

The linear algorithms presented above work fine with noisy data, but have no way to take advantage of data that includes its own estimate of the noise level. In the most general case, each element of the data matrix \(X\) has a corresponding RMS error estimate \(\delta X\), with values \(\rightarrow\infty\) used to indicate missing data. In practice, it is convenient to replace \(\delta X\) with a matrix of weights \(W\) where zero values indicate missing data. For data with Gaussian errors, \(X_{ij} \pm \delta X_{ij}\), the appropriate weight is usually the inverse variance \(W_{ij} = \delta X_{ij}^{-2}\).

The wpca package implements two different schemes to incorporate weights into PCA, which give similar results to each other. Both schemes are used almost identically to sklearn PCA, but with an additional weights argument (method = wpca.WPCA or wpca.EMPCA):

fit = method(n_components=d).fit(X, weights=W)

Unfortunately, wpca.WPCA expects weights=np.sqrt(W) but this might be fixed soon.

To study these schemes, we will assign weights to spectra_data by assuming that each value \(X_{ij}\) is the result of a Poisson process so has inverse variance \(W_{ij} = X_{ij}^{-1}\).

The function below allows us to optionally add extra noise that varies across the spectra and remove random chunks of data (by setting their weights to zero). As usual, ignore the details of this function unless you are interested.

def weighted_pca(d=2, add_noise=None, missing=None, data=spectra_data, seed=123):
    gen = np.random.RandomState(seed=seed)
    X = data.values.copy()
    N, D = X.shape
    # Calculate inverse variances assuming Poisson fluctuations in X.
    W = X ** -1
    # Add some Gaussian noise with a linearly varying RMS, if requested.
    if add_noise:
        start, stop = add_noise
        assert start >= 0 and stop >= 0
        extra_rms = np.linspace(start, stop, D)
        W = (X + extra_rms ** 2) ** -1
        X += gen.normal(scale=extra_rms, size=X.shape)
    else:
        W = X ** -1
    # Remove some fraction of data from each sample, if requested.
    if missing:
        assert 0 < missing < 0.5
        start = gen.uniform(high=(1 - missing) * D, size=N).astype(int)
        stop = (start + missing * D).astype(int)
        for i in range(N):
            X[i, start[i]:stop[i]] = W[i, start[i]:stop[i]] = 0.
    # Perform the fit.
    fit = wpca.WPCA(n_components=d).fit(X, weights=np.sqrt(W))
    Y = fit.transform(X, weights=np.sqrt(W))
    Xr = fit.inverse_transform(Y)
    # Show the reconstruction of some samples.
    fig = plt.figure(figsize=(8.5, 4))
    for i,c in zip((0, 6, 7), sns.color_palette()):
        plt.plot(X[i], '.', c=c, ms=5)
        plt.plot(Xr[i], '-', c=c)
    plt.xlim(-0.5, D+0.5)
    plt.xlabel('Feature #')
    plt.ylabel('Feature Value')

First check that we recover similar results to the standard PCA with no added noise or missing data. The results are not identical (but presumably better now) because we are accounting for the fact that the relative errors are larger for lower fluxes.

weighted_pca(d=2)
../../_images/44edb3796beb34b35c05addff7e056656b9a679e54e35648dc97a3e0761efc63.png

Next, increase the noise across the spectrum. The larger errors makes the principal components harder to nail down, leading to noisier reconstructions:

weighted_pca(d=2, add_noise=(0, 100))
../../_images/4979fbc5e8d8cb8cb8ec6827e3f1ff2490998e3a5de3bb4b176d46b27a4360d5.png
weighted_pca(d=2, add_noise=(200, 100))
../../_images/8695b097bad77120971b42d4f03b838b9d1ecc3e8e5814e90759b2738f872185.png

Finally, remove 10% of the data from each sample (but, crucially, a different 10% from each sample). Note how this allows us to make a sensible guess at the missing data! (statisticians call this imputation.)

weighted_pca(d=2, missing=0.1, seed=2)
../../_images/a13d9ee86f0e24f03fee8490c30a44499fab85275499555df4beea27785380d6.png

Acknowledgments#

  • Initial version: Mark Neubauer

© Copyright 2024