【笔记整理】斯坦福大学CS231n-Q1-kNN分类器

kNN分类器是一种非常简单粗暴的分类器,它包含两个步骤: - 训练:读取训练数据并存储 - 测试:对于每一张测试图像,kNN把它与训练集中的每一张图像进行距离计算,找出距离最相近的k张图像,这k张图像里,占多数的标签类别,就是测试图像的类别。计算图像距离有两种方式,分别是l1距离和l2距离。

avatar avatar

k的取值通过交叉验证得到。

对于kNN算法,k值的选择十分重要,如图,较小的k值容易受到噪声的干扰,较大的k值会导致边界上样本的分类有歧义。

avatar

为了得到较为合适的k值,我们使用交叉验证的方法。

作业内容

import random
import numpy as np
import matplotlib.pyplot as plt
from cs231n.data_utils import load_CIFAR10

from __future__ import print_function
from past.builtins import xrange
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0 , 8.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

输出数据样本中的内容

cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train , y_train, X_test , y_test = load_CIFAR10(cifar10_dir)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(50000, 32, 32, 3)
(50000,)
(10000, 32, 32, 3)
(10000,)

随机采样一些数据来进行查看

# 随机采样
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y , cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idxs, in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class , num_classes , plt_idx)
        plt.imshow(X_train[idxs].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

png

# 取出一个子集进行训练
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

# 将图像数据转置为二维的
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)
(5000, 3072) (500, 3072)

创建kNN分类器对象

记住kNN分类器不进行操作,只是将数据进行简单的存储

from cs231n.classifiers import KNearestNeighbor
classifier = KNearestNeighbor()
classifier.train(X_train,y_train)

现在我们可以使用kNN分类器进行数据分类。我们可以将测试过程分为以下两步:

  • 首先我们需要计算测试样本所得到训练样本的距离
  • 得到距离巨震后,找出离测试样本最近的k歌训练样本,选择出现次数最的的类别作为测试样本的类别

让我们从计算距离矩阵开始。如果训练样本有Ntr个,测试样本有Nte个,测距离矩阵应该是个Nte * Ntr大小的矩阵,其元素[i,j]表示第i个测试样本到第j个测试样本的距离。 下面我们将 k_nearest_neighbor.py中的compute_distances_no_loops方法补全,它使用了两个循环的方式来计算测试样本与训练样本之间的距离,这种方式效率非常低下。

测试计算结果

def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops

    通过一个两层的嵌套循环,遍历测试样本点,并求其全部训练样本点的距离

    输入:
    - X :测试数据集,一个(num_test,D)大小的numpy数组

    返回:
    - dists:一个(num_test,num_train)大小的numpy数组,其中dists[i,j]表示测试样本i到训练样本j的欧式距离

    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #
    # 任务:
    # 计算第i个测试点到第j个训练样本点的L2距离,并保存到dists[i,j]中
    # 注意不在维度上使用for循环
    #
    #########################################################################
    partx = np.sum(np.square(X), axis = 1)
    partxt = np.sum(np.square(self.X_train), axis = 1)
    partxxt = np.dot(X, self.X_train.T)
    dists = np.sqrt(np.matrix(partx).T + partxt - 2 * partxxt)

    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
     return dists
#    return partx,partxt,partxxt,dists

个人补充1:

  • partx:将二维的矩阵X进行求平方的操作后,使用sum函数将其第二维的数进行相加组成了一个一维向量
  • partxt:同上,获得训练集矩阵操作完后的一个一维向量
  • partxxt:将操作完后的测试集X向量partx与训练集X_train向量partxt的转置相乘,得到一个新的矩阵
  • dists:取得新的矩阵后,进行求平方根,得到测试集和测试集的距离矩阵
dists = classifier.compute_distances_no_loops(X_test)
print(dists.shape, type(dists))
(500, 5000) <class 'numpy.matrixlib.defmatrix.matrix'>
# 我们将距离矩阵进行可视化,其中每一行表示一个测试样本与所有训练样本的距离
plt.imshow(dists,interpolation='none')
plt.show()

png

图中可以明显的看出,有一些行或列明显颜色较浅。其中深度表示距离小,浅色表示距离大。 某些行颜色偏浅,表示测试样本与训练集中的所有样本差异较大,该测试样本可能明显过亮或者过暗或者有色差。

某些列颜色偏浅,所有测试样本与该列表示的样本距离都较大,该训练样本可能明显过亮或者过暗或者有色差。

实现predict_labels方法

def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  

    通过距离矩阵,预测每一个测试样本的类别

    输入:
    - dists:一个(num_test,num_train)大小的numpy数组,其中dists[i,j]表示测试样本i到训练样本j的欧式距离

    返回:
    - y:一个(num_test,)大小的numpy数组,其中y[i]表示测试样本X[i]的测试结果 

    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      # 一个长度为k的list数组,其中保存着第i个测试样本的k个最近邻的类别标签
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #

      # 任务:
      # 通过距离矩阵找到第i个测试样本的k个最近邻,然后再self.y_train中找到这些最近邻对应的类别标
      #  , 并将这些类别标签保存到closest_y中
      # 提示:可以尝试使用numpy.argsort方法
      #########################################################################
      index = np.argsort(dists[i, :]).tolist()[0]
      # print(index)
      kInd = index[:k]
      closest_y = self.y_train[kInd]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      # 任务:
      # 现在你已经找到了k个最近邻对应的标签,下面就需要找到其中出现最多的那个类别标签,
      # 然后保存到y_pred[i]中。若果有票数相同的类别,则选择编号小的类别
      #########################################################################
      y_pred[i] = np.argmax(np.bincount(closest_y))
      # print(y_pred)
      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred

这里我们将k设置为1(也就是最邻近算法)

y_test_pred = classifier.predict_labels(dists,k=1)

计算并打印准确率

num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
Got 137 / 500 correct => accuracy: 0.274000

结果应该约为27%,现在,我们将k调大一点点,令k=5

y_test_pred = classifier.predict_labels(dists,k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
Got 139 / 500 correct => accuracy: 0.278000

结果应该略好于k=1时的情况

我们将距离提升的效率提升一下,使用单层循环结构的计算方法。

实现compute_distances_one_loop方法

def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops

    通过一个单层的嵌套循环,遍历测试样本,并求其到全部训练样本点的距离输入/输出:
    和compute_distances_no_loops方法相同
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      # 计算第i个测试样本点到所有训练样本的L2距离,并保存dists[i,:]中 #
      #######################################################################
      dists[i, :] = np.sqrt(np.sum(np.power(X[i, :] - self.X_train, 2), axis = 1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists
dists_one = classifier.compute_distances_one_loop(X_test)

为了保证向量化的代码运行正确,我们将运行结果与前面的方法的结果进行对比。对比两个矩阵是否相等的方法很多,比较简单的一种是使用Frobenius范数。对比两个矩阵所有元素的差值的均方根。或者说是将两个矩阵reshape成向量后,它们之间的欧式距离。

difference = np.linalg.norm(dists - dists_one , ord='fro')
print('Difference was : %f ' % (difference , ))
if difference<0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

完成完全向量化方式运行的compute_distances_no_loop的方法

def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops

    不通过循环方式,遍历测试样本点,并求其到全部训练样本点的距离
    输入/输出:和compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    # 任务 :
    # 计算测试样本点和训练本点之前的L2距离,并且不适用for循环,最后结果保存在dists中
    #########################################################################
    partx = np.sum(np.square(X), axis = 1)
    partxt = np.sum(np.square(self.X_train), axis = 1)
    partxxt = np.dot(X, self.X_train.T)
    dists = np.sqrt(np.matrix(partx).T + partxt - 2 * partxxt)

    print(partx)
    print(partxt)
    print(partxxt)
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

在notebook中运行代码:

dists_two = classifier.compute_distances_no_loops(X_test)
compute_distances_no_loops(X_test)

将结果与之前的计算结果进行对比

difference = np.linalg.norm(dists - dists_two,ord='fro')
print('Difference was : %f' % (difference , ))
if difference<0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')
Difference was : 0.000000
Good! The distance matrices are the same

下面我们对比一下各种方法的执行速度

def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)
Two loop version took 22.814964 seconds
One loop version took 214.188476 seconds
No loop version took 0.257259 seconds

具体的输出值可能都不太一样

交叉验证

之前我们已经完成了k-Nearest分类器的编写,但是对于k值的选择很随意。下面我们将使用交叉验证的方式选择最优超参数k。

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #

# 任务:
# 将训练数据切分成不同的折。切分之后,训练样本和对应的样本标签被包含在数组
# X_train_folds和y_train_folds之中,数组长度是折数num_folds。
# 其中y_train_folds[i]是一个矢量,表示X_train_folds[i]中所有样本的标签
# 提示:可以尝试使用numpy的array_split方法。
################################################################################
# Your code
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.

# 我们将不同k值下的准确率保存在一个字典中。交叉验证之后,k_to_accuracies[k]保存了一个长度为折数的list
# ,值为k值下的准确率
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #

# 任务:
# 通过k折的交叉验证找到最佳k值。对于每一个k值,执行kNN算法num_folds次,每一次执行中,
# 选择一折为验证集,其他折为训练集。将不同k值在不同折上的验证结果保存在k_to_accuracies字典中
################################################################################
# Your code
i = 0
for k in k_choices:
    k_to_accuracies[k] = []
    for i in range(num_folds):
#         classifier = KNearestNeighbor()
        Xte = np.array(X_train_folds[i])
        yte = np.array(y_train_folds[i])
        Xtr = np.array(X_train_folds[:i] + X_train_folds[i + 1 :])
        Xtr = Xtr.reshape(-1, Xtr.shape[2])
        ytr = np.array(y_train_folds[:i] + y_train_folds[i + 1 :])
        ytr = ytr.reshape(-1)
        classifier.train(Xtr, ytr)
        dist = classifier.compute_distances_no_loops(Xte)
        y_test_pred = classifier.predict_labels(dist, k)
        num_correct = np.sum(y_test_pred == yte)
        accuracy = float(num_correct) / yte.shape[0]
        k_to_accuracies[k].append(accuracy)

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))
k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000
k = 50, accuracy = 0.271000
k = 50, accuracy = 0.288000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.266000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.270000
k = 100, accuracy = 0.263000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.263000
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies) , accuracies)

    #画出在不同k值下,误差均值和标准差
    accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
    accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
    plt.errorbar(k_choices,accuracies_mean,yerr=accuracies_std)
    plt.title('Cross-validation on k')
    plt.xlabel('k')
    plt.ylabel('Cross-validation accuracy')
plt.show()

png

根据上面交叉验证的结果,选择最优的k,然后在全量数据上进行试验,你将得到28%的准确率

best_k = 10
classifier = KNearestNeighbor()
classifier.train(X_train,y_train)
y_test_pred = classifier.predict(X_test , k=best_k)

num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
Got 141 / 500 correct => accuracy: 0.282000

版权声明:如无特殊说明,文章均为本站原创,转载请注明出处

本文链接:http://tunm.top/article/q1/