【问题7】：《人脸识别实战》--采用的是SVM，数据集小，直接用sklearn做的

本次我们实战人脸识别。。采用的数据集下载：点我下载数据集

第一步：先导入我们本次实验所需要的全部模型

import time
import logging
from sklearn.datasets import fetch_olivetti_faces
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

第二步：加载数据集，并查看数据的格式等信息。。

在程序的平级目录中创建一个data文件，并将下载的数据集（就是那个pkl格式的文件）放入其中。

这一步主要是读取信息，将训练样本放在X中，将对应的标签放在y中，接着对每一种类的人进行命名。请详读下面代码

if __name__ == '__main__':

    data_home = 'data/'
    faces = fetch_olivetti_faces(data_home=data_home)

    X = faces.data
    print(X)   # 这里的一行代表一张图片
    y = faces.target
    print(y)   # 这里的类别是用0,1,2.....39标记的
    targets = np.unique(faces.target)
    # 给目标任务命名
    target_names = np.array(['c%d'%t for t in targets])  # 只要是同一类别，名字相同
    n_targets = target_names.shape[0]    # 类别个数
    n_samples, h, w = faces.images.shape  # 样本数，图片的高，宽
    print('样本总数:{}, 类别:{}'.format(n_samples, n_targets))
    print('图片的尺寸:{}x{}, 数据集呈现的样式:{}'.format(w, h, X.shape))

此步代码运行结果：一种有40个种类（也就是40个人）。每个人有10张照片，总共400张。

第三步：我们还不知道数据中图片到底是什么样的？那我们就在每一种类（一个人代表一个种类）中随机选一张画出来

具体做法：因为给的数据一行代表一张图片，那我们先根据标签选出这个人的10张图片，再从这10张图片中随机选一张绘制出来，就OK了。。

继续在if __name__ == "__main__": 中添加以下代码

    # 从每个人物中选取一些照片输出
    n_row = 4   # 4行
    n_col = 10  # 10列   把所有人画出来
    sample_images = []
    sample_titles = []
    for i in range(n_targets):
        people_images = X[y == i]
        people_sample_index = np.random.randint(0, people_images.shape[0])  # 从当前这个种类中随机选择一个人
        people_sample_image = people_images[people_sample_index, :]
        sample_images.append(people_sample_image)  # 随机选出每个种类中的一个加入样本
        sample_titles.append(target_names[i])  # 顺便练名字也画出来
    plot_gallery(sample_images, sample_titles, h, w, n_row, n_col)

我们再创建一个函数plot_gallery()去画出图像，代码如下：

def plot_gallery(images, titles, h, w, n_row=2, n_col=5):

    plt.figure(figsize=(2*n_col, 2.2*n_row), dpi=144)
    plt.subplots_adjust(bottom=0, left=0.01, right=0.99, top=0.90,  hspace=0.01)
    for i in range(n_row*n_col):
        plt.subplot(n_row, n_col, i+1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i])
        plt.axis('off')  # 把坐标轴的显示关闭掉
    plt.show()

运行结果：其中c0, c1。。等就是我们第一步给出的命名。。

第四步：接下来我们直接用svm先进行训练一下（这是一次失败的尝试）

我们创建一个函数，拆分出训练集和测试集，接着构建SVM训练模型，并打印出模型的分类报告。记得在下面的main中调用该函数。。即添加： model_svm(X, y, target_names)

def model_svm(X, y, target_names):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # 我们在训练模型时，进行计时
    start = time.clock()
    print("模型正在训练。。。")
    clf = SVC(class_weight='balanced')
    clf.fit(X_train, y_train)
    print("模型预测完了,总共耗时:{}".format(time.clock() - start))

    # 接着对测试集进行预测
    start = time.clock()
    print("模型正在对测试集预测。。。")
    y_pred = clf.predict(X_test)
    print("模型预测结束, 总共耗时:{}".format(time.clock() - start))

    # 打印一下分类报告
    print("分类报告:\n", classification_report(y_test, y_pred, target_names=target_names))

    # 从打印报告中可以得出这个模型是最差的模型，准确率，召回率等都是零

    # 严重严重的过拟合  特征有4096个，但样本只有400个，其中还有部分作为测试集了。。所以这种数据就是垃圾

输出结果：准确率为0，召回率为0 ，f1分数也为0 没有比这个模型更差的模型里。。具体原因上面代码的最后几行已经进行了说明。。

第五步：采用PCA进行降维降到合适的维度

PCA降维，降到多少维是合适的呢？有一个东西了解一下：数据还原率。。随着维度的增大，数据还原率将逐渐接近1，白话讲：你将高维数据降到低维，会导致信息的失真。你降的维度越低，失真越严重。我们在10到300每隔30取一个值（代表降到的维数），看一下还原率。。最后画出图像来，看一下取哪个值比较好。。

创建一个函数shizhendu()计算还原率。。记得在main中调用这个函数

def shizhendu(X):

    candidate_components = range(10, 300, 30)   # 从10到300每个30取一次
    explained_ratios = []
    for c in candidate_components:
        pca = PCA(n_components=c)
        x_pca = pca.fit_transform(X)
        explained_ratios.append(np.sum(pca.explained_variance_ratio_))  # 降到c维时的失真度

    # 我们画图看一下，选择一个合适的c值
    plt.figure(figsize=(10, 8), dpi=80)
    plt.grid()
    plt.plot(candidate_components, explained_ratios)
    plt.xlabel("c")
    plt.ylabel("shi_zhen_du")
    plt.yticks(np.arange(0.5, 1.05, 0.05))
    plt.xticks(np.arange(0, 300, 20))
    plt.show()

本步骤的输出结果：从下面曲线中可以看出，大概在140或160处比较理想。。那我们就采用140吧。也就是降到140维。

第六步：我们采用第五步得到的理想维度，再加网格搜索+SVM进行预测

除了增加下面函数，还要再main中增加：

def model_grid_svm(X, y, n_components, target_names):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True).fit(X_train)

    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)

    # 进行网格搜索找最优的参数
    params = {'C': [1, 5, 10, 50, 100], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01]}
    clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), params, verbose=2, n_jobs=4)

    clf = clf.fit(X_train_pca, y_train)
    print("查看一下找到最优参数:", clf.best_params_)

    # 接着用最有参数对测试集预测
    y_pred = clf.best_estimator_.predict(X_test_pca)
    # 打印检测报告
    print('PCA+SVM预测的结果报告:', classification_report(y_test, y_pred))

最后的输出结果：