# Distribution of Data

In this notebook we will look at how patterns emerge even in completely random data, especially when the data is many-dimensional.

In [None]:
%pylab inline
import numpy as np
from sklearn import linear_model
cls = linear_model.SGDClassifier(max_iter=1000)

# Some sklearn versions spam warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

The following function will generate a number of datapoints with a given dimension, then split that data completely randomly into two separate classes. It will then fit a linear classifier to the data. Intuitively, because the data is generated and split randomly, we wouldn't expect a linear classifier to be able to separate the two classes. However, as we will see, when the dimension of the data is large, the classifier will be able to find features which separate the two classes, even though those classes have no semantic meaning to humans.

In [None]:
def separate_points(N=2000, d=2):
    
    # Generate N random d-dimensional data points
    X = np.random.normal(0,1,size=(N, d))
    # Separate the data randomly into two classes where the data is in class
    # 0 with probability prob and class 1 with probability 1-prob.
    prob = 0.5
    y = np.random.rand(N) >= prob

    # Fit the first classifier
    cls.fit(X, y)
    # Let's project all points along the decision boundary of the first classifier
    p1 = cls.coef_ / np.linalg.norm(cls.coef_)
    d1 = X.dot(p1.T)
    score = cls.score(X, y)

    # Fit a second classifier on the projected points.
    cls.fit(X - d1*p1, y)
    p2 = cls.coef_ / np.linalg.norm(cls.coef_)
    d2 = X.dot(p2.T)

    # Plot the two classifiers
    figure(figsize=(10,5))
    subplot(1,2,1)
    # Plot the data projected onto it's first two dimensions
    scatter(*X[:,:2].T, c=y.flat, cmap='Paired')
    axis('equal')
    subplot(1,2,2)
    # Now plot the projection of the data long the axes defined by the two classifiers.
    scatter(d1.flat, d2.flat, c=y.flat, cmap='Paired')
    axis('equal')

    print( score )    

First, we will look at some low-dimensional data.

In [None]:
separate_points(2000, 2)

In this case, we see random-looking clouds of points in both images. This is what we intuitively expect when we generate random data. But now let's see what happens when we increase the number of dimensions.

In [None]:
separate_points(4000, 1000)

Notice that in the second figure, the two classes are starting to separate. One class is clearly more heavily represented on the left and the other on the right. This is because even though the data is still totally random, the classifier is starting to overfit. You'll also notice that the accuracy of the linear classifier, shown above the graphs, is significantly higher than 0.5.

In the following cell, we go to an even higher dimension.

In [None]:
separate_points(4000, 4000)

Here we have as many dimensions as we have data points, and a linear classifier is able to almost perfectly separate the data.

In the following cell, we plot the accuracy of linear classifiers over datasets of of different sizes, where the data has different dimensions.

In [None]:
def plot_sep(d=2, N=[10,25,50,100,250,500,1000,2000]):
    S = []
    mnS, mxS = [], []
    for n in N:
        s = []
        for it in range(5000 // n + 1):
            X = np.random.normal(0,1,size=(n, d))
            y = np.random.rand(n)
            y = y > np.median(y)

            # Fit the first classifier
            s.append( cls.fit(X, y).score(X,y) )
        S.append(np.mean(s))
#         mnS.append(np.min(s))
#         mxS.append(np.max(s))
        mnS.append(np.mean(s) - np.std(s))
        mxS.append(np.mean(s) + np.std(s))
    fill_between(N, mnS, mxS, alpha=0.5)
    plot(N,S, label='d = %d'%d)
    xlabel('# data points')
    ylabel('classification accuracy')
    xscale('log')
    legend()

figure(figsize=(10,10))
plot_sep(2)
plot_sep(10)
plot_sep(100)
plot_sep(1000)
plot_sep(32*32*3)

In this chart, the different colored lines represent datasets in different dimensions. For each dimension, we generate datasets of different sizes and fit a linear classifier. We also repeat each experiment several times -- the line shows the mean accuracy over all of the experiments and the shaded envelope shows one standard deviation on each side of the mean. Notice that in higher dimensions, the accuracy of the linear classifiers is very high until the number of data points becomes large. In fact, when the number of dimensions is greater than the number of datapoints, the linear classifier *perfectly* separates the two classes, even though the data are completely random. In practice, the datasets we work with in computer vision are very high-dimension. For example, a well-known classification network, ResNet18, takes 224x224 images as inputs. Lets see how many dimensions that is:

In [None]:
224*224*3

This means that for images of this size, we would still need more than 150,000 data points just to prevent a linear separator from *perfectly* separating randomly distributed data. This is why overfitting happens -- it is simply unavoidable that there will be exploitable patterns in the data, especially in high-dimensional spaces.