Python3《机器学习实战》笔记:K-近邻算法
2.1 实施KNN算法
python3实现KNN算法,本书采用的是python2,转化为python3
import numpy as np
#运算符模块
import operator
def createDataSet():
group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = ['A', 'A', 'B', 'B']
return group, labels
#K-近邻算法
def classify0(inX, dataSet, labels, k):
#获取shape的第一个值
dataSetSize = dataSet.shape[0]
#tile函数把inX重复dataSetSize遍,1列,用欧拉定理进行计算
diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
sqDiffMat = diffMat ** 2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances ** 0.5
#argsort函数返回的是数组值从小到大的索引值
sortedDistIndices = distances.argsort()
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndices[i]]
classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
sortedClassCount = sorted(classCount.items(),
key=operator.itemgetter(1), reverse=True)
# 返回最近邻的点
return sortedClassCount[0][0]
测试结果如下:
输入:
import kNN
group , labels = kNN.createDataSet()
print(kNN.classify0([0,0],group,labels,3)
输出:
B
2.2 使用K——近邻算法对约会网站的匹配效果进行改进
下载《机器学习实战》的辅助材料,下载地址为https://github.com/frankstar007/kNN
数据集放在 2.2data 中, 可以下载使用(注意文件名字是:datingTestSet2,本书中没有2)
2.2.1 在KNN.py中加入下列代码:
def file2matrix(filename):
#打开文件
fr=open(filename)
#readlines() 方法用于读取所有行(直到结束符 EOF)并返回列表,
#该列表可以由 Python 的 for... in ... 结构进行处理。
#返回类型为一个列表
arrayOLines=fr.readlines()
#列表的长度
numberOfLines=len(arrayOLines)
#设置numberOfLines行3列的0矩阵
returnMat=np.zeros((numberOfLines,3))
#设置空列表
classLabelVector=[]
index=0
for line in arrayOLines:
#利用函数strip截取掉所有的回车符
line=line.strip()
#使用tab字符\t将整行的数据分割为1个元素
listFromLine=line.split('\t')
#选取前三个矩阵,将它们存储在特征矩阵中
returnMat[index,:]=listFromLine[0:3]
#将列表中的最后一行元素存储在classLabelVector中
classLabelVector.append(listFromLine[-1])
index+=1
return returnMat,classLabelVector
在test.py中进行测试:
import kNN
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels[0:20])
[[ 4.09200000e+04 8.32697600e+00 9.53952000e-01]
[ 1.44880000e+04 7.15346900e+00 1.67390400e+00]
[ 2.60520000e+04 1.44187100e+00 8.05124000e-01]
...,
[ 2.65750000e+04 1.06501020e+01 8.66627000e-01]
[ 4.81110000e+04 9.13452800e+00 7.28045000e-01]
[ 4.37570000e+04 7.88260100e+00 1.33244600e+00]]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]
2.2.2 分析数据:使用Matplotlib创建散点图
import matplotlib
import os
import matplotlib.pyplot as plt
from numpy import *
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()
本题目一共有三组特征值:玩游戏视频所耗时间的百分比;每周消费的冰淇淋公升数;每年获取的飞行常客里程数;
分别绘制出彩色图像
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)
ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
ax2.scatter(datingDataMat[:,1], datingDataMat[:,2],
15.0*array(list(map(int,datingLabels))),
15.0*array(list(map(int,datingLabels))))
#数据乘以特征值,更好的区别特征数据
ax3.scatter(datingDataMat[:,0], datingDataMat[:,2],
15.0*array(list(map(int,datingLabels))),
15.0*array(list(map(int,datingLabels))))
ax4.scatter(datingDataMat[:,0], datingDataMat[:,1],
15.0*array(list(map(int,datingLabels))),
15.0*array(list(map(int,datingLabels))))
plt.show()
2.2.3 准备数据: 归一化数值
根据表格中的数值和欧拉两点之间距离,数值差值最大的属性对计算结果的影响最大,当数据的样本特征权重不一样,就会导致某一个特征权重的差距太大影响到整体的距离,因此要使用归一化来将这种不同取值范围的特征值归一化,将取值范围处理为0到1,或者-1到1之间;使用如下公式可以讲任意取值范围的特征值转化为0到1区间内的值:
newValue = (oldValue-min) / (max-min)
这里的new和old都针对的是某一列里的一个,而在这里使用应该是列表整体的使用了公式,故得到的是一个列表类型的newValue;
def autoNorm(dataSet):#输入为数据集数据
minVals = dataSet.min(0)#获得数据每列的最小值,minval是个列表
maxVals = dataSet.max(0)#获得数据每列的最大值,maxval是个列表
ranges = maxVals - minVals#获得取值范围
normDataSet = zeros(shape(dataSet)) #初始化归一化数据集
m = dataSet.shape[0]#得到行
normDataSet = dataSet - tile(minVals,(m,1))
normDataSet = normDataSet/tile(ranges,(m,1)) #特征值相除
return normDataSet,ranges , minVals#返回归一化矩阵,取值范围, 最小值
test测试:
import kNN
from numpy import *
import operator
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
normMat , ranges , minval= kNN.autoNorm(datingDataMat)
print(normMat,'\n' ,ranges,'\n' , minval)
输出结果:
[[ 0.44832535 0.39805139 0.56233353]
[ 0.15873259 0.34195467 0.98724416]
[ 0.28542943 0.06892523 0.47449629]
...,
[ 0.29115949 0.50910294 0.51079493]
[ 0.52711097 0.43665451 0.4290048 ]
[ 0.47940793 0.3768091 0.78571804]] #归一化矩阵
[ 9.12730000e+04 2.09193490e+01 1.69436100e+00] #取值范围:max-min
[ 0. 0. 0.001156] #最小值
2.2.4 测试算法:作为完整的程序验证分类器
def datingClassTest():
hoRatio = 0.10 #测试数据占总样本的10%
datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #样本集,样本标签
normMat , ranges , minVals = autoNorm(datingDataMat) #归一化处理样本集,然后得到取值范围和最小值
m = normMat.shape[0]#样本集行数
numTestVecs = int(m*hoRatio) #测试样本集的数量
errorCount = 0.0#初始化错误率
for i in range(numTestVecs):#对样本集进行错误收集
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m], 3)#kNN
print("The classifier came back with : %d , the real answer is : %d" % (int(classifierResult),int(datingLabels[i])))
if(classifierResult!=datingLabels[i]):
errorCount+=1.0
print("the total error rate if :%f" % (errorCount/float(numTestVecs)))#计算错误率并输出
test测试代码
KNN.datingClassTest()
输出结果:
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
最后一步操作:约会网站预测函数
最后一个主要是构建分类器,然后自己读入数据给出结果。
def classfyPerson():
resultList = ['not at all' , 'in small doese ' , 'in large dose'] #分类器
precentTats = float(raw_input("precentage of time spent playint video games?")) #输入数据
ffMiles = float(raw_input("frequent flier miles earned per year"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat , datingLabels = file2matrix('datingTestSet2.txt') #训练集
normMat , ranges , minVals = autoNorm(datingDataMat) #进行训练
inArr =array([ffMiles,precentTats,iceCream]) #把特征加入矩阵
#4个输入参数分别为:用于分类的输入向量inX,输入的训练样本集dataSet,标签向量labels,选择最近邻居的数目k
classfierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3) #归一化处理矩阵,并且结果就是序列号-1就是对应
print "You will probably like this person : " , resultList[classfierResult - 1 ]
输出结果:
You will probably like this person : in small doses
感谢学长博客的帮助
借鉴博客为:https://blog.csdn.net/qq_33638791/article/details/53163659
源自于《机器学习实战》