Mini Batch K-Means时效方面聚类效果方面具体步骤&对比Python代码实战总结Mini Batch K-Means 更适合处理大规模数据集,特别是在计算资源有限的情况下,而标准 K-Means 更适合小型数据集或对精确度要求较高的场景。时效方面Mini Batch K-Means仅使用数据集的一个小批量(mini-batch)来更新质心,而K-Means由于使用全部数据,收敛速度可能较慢,尤其在大数据集上。聚类效果方面惯性(Inertia)是 K-Means 和 Mini Batch K-Means 聚类算法中的一种度量指标,用来衡量数据点到其最近簇中心的距离之和。惯性值越小,表示数据点越接近其簇中心,聚类效果越好。具体步骤&对比Python代码实战import timeimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans, MiniBatchKMeansfrom sklearn.datasets import make_blobs# Generate synthetic dataX, y = make_blobs(n_samples=3000, centers=3, cluster_std=1.0, random_state=42)# Set the number of clustersn_clusters = 3# K-Means clusteringkmeans = KMeans(n_clusters=n_clusters, random_state=42)start_time = time.time()kmeans.fit(X)kmeans_time = time.time() - start_timekmeans_inertia = kmeans.inertia_# Mini Batch K-Means clusteringminibatch_kmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=100, random_state=42)start_time = time.time()minibatch_kmeans.fit(X)minibatch_kmeans_time = time.time() - start_timeminibatch_kmeans_inertia = minibatch_kmeans.inertia_# Print results comparisonprint(f"K-Means training time: {kmeans_time:.4f} seconds, Inertia: {kmeans_inertia}")print(f"Mini Batch K-Means training time: {minibatch_kmeans_time:.4f} seconds, Inertia: {minibatch_kmeans_inertia}")# Visualize the clustering resultsfig, ax = plt.subplots(1, 3, figsize=(15, 5))# Left plot: K-Meansax[0].scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=1, cmap='viridis')ax[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')ax[0].set_title(f"K-Means\nTraining time: {kmeans_time:.2f}s\nInertia: {kmeans_inertia:.2f}")# Middle plot: Mini Batch K-Meansax[1].scatter(X[:, 0], X[:, 1], c=minibatch_kmeans.labels_, s=1, cmap='viridis')ax[1].scatter(minibatch_kmeans.cluster_centers_[:, 0], minibatch_kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')ax[1].set_title(f"Mini Batch K-Means\nTraining time: {minibatch_kmeans_time:.2f}s\nInertia: {minibatch_kmeans_inertia:.2f}")# Right plot: Difference# Highlight points assigned to different clusters by the two methodsdiff_labels = kmeans.labels_ != minibatch_kmeans.labels_ax[2].scatter(X[:, 0], X[:, 1], c='lightgrey', s=1)ax[2].scatter(X[diff_labels, 0], X[diff_labels, 1], c='magenta', s=10)ax[2].set_title("Difference")plt.tight_layout()plt.show()总结MiniBatch K-means 是 K-means 的一种加速算法,适合处理大规模数据集,核心要点如下:小批量数据更新:不像标准 K-means 需要处理整个数据集,MiniBatch K-means 通过从数据集中随机抽取小批量样本进行聚类更新,每次迭代只使用小批量样本来更新簇中心。更快的收敛:小批量更新显著减少了计算量,使算法在大数据集上更快收敛,适合流数据或大规模数据场景。降低内存需求:只需存储小批量数据,不需要将整个数据集加载到内存中,降低了内存消耗。相对准确的聚类效果:虽然惯性可能略高于标准 K-means,但在速度和性能间取得了良好的平衡。适合数据分布较均匀时应用。易于扩展:适合分布式和在线学习,可以结合多次小批量更新逐渐改进聚类效果。