2025-02-03 14:20:00 来源:简单动画制作源码

2.Python实现KMeans(K-means Clustering Algorithm)
6.Python 机器学习 PCA降维和K-means聚类及案例

kmeans python源码


       基于相似性度量,将相近的源码样本归为同一个子集,使得相同子集中各元素间差异性最小,源码而不同子集间的源码元素差异性最大,这就是源码安卓lollipop源码(空间)聚类算法的本质。K-Means正是源码这样一种算法的代表,它在上个世纪/年代被独立提出,源码并由教授James MacQueen在年首次使用术语“K-Means”,源码从此被广泛应用和改进。源码如今,源码K-Means聚类算法虽然已有超过年的源码历史,但仍然是源码应用最广泛、地位最核心的源码空间数据划分聚类方法之一。作为一种无监督算法,源码尽管无法判断结果的正确性,但能为我们研究对象群体的内部结构提供很好的切入点。


       接下来,我们详细介绍K-Means算法的继承 mfc 源码基本步骤:首先从N个样本数据中随机选取K个对象作为初始聚类中心;然后分别计算每个样本点到各个聚类中心的距离,并逐个分配到距离其最近的簇中;接着更新K个类中心位置,类中心定义为簇内所有对象在各个维度的均值;之后与前一次计算得到的K个聚类中心比较,如果聚类中心发生变化,转至步骤2,否则转至步骤5;最后,当类中心不再发生变化,停止并输出聚类结果。






Python实现KMeans(K-means Clustering Algorithm)


       本篇文章旨在采用Python语言实现经典的机器学习算法K-means Clustering Algorithm,对KMeans算法进行深入解析并提供代码实现。cms修改源码KMeans算法是一种无监督学习方法,旨在将一组数据点划分为多个簇,基于数据点的相似性进行分类。


       实现K-means Clustering Algorithm,本文将重点讲述算法原理、优化方式及其Python实现,避开复杂细节,专注于算法核心流程,适合初学者理解。

       ### KMeans算法原理


       1. 初始化k个随机簇中心。

       2. 将每个数据点分配给最近的簇中心。

       3. 更新簇中心为当前簇中所有点的平均值。

       4. 重复步骤2和3,直至簇中心不再显著变化或达到预设迭代次数。

       ### KMeans算法优化方式

       1. **快速KMeans**:通过提前选择初始簇中心或采用随机抽样,加速收敛。

       2. **MiniBatchKMeans**:使用小批量数据进行迭代,减小计算复杂度,适用于大规模数据集。

       ### KMeans算法复杂度


       ### KMeans算法实现


       **1. 导包



       **2. 定义随机数种子



       **3. 定义KMeans模型



       **3.3.1 模型训练



       **3.3.2 模型预测



       **3.3.3 K-means Clustering Algorithm模型完整定义



       **3.4 导入数据



       **3.5 模型训练



       **3.6 可视化决策边界




       ### 完整源码






       1. 初始化:随机选择k个样本点作为初始聚类中心。

       2. 聚类过程:计算每个样本点到各个聚类中心的距离,并将样本指派到最近的聚类中心所在的类别。

       3. 计算新的聚类中心:对于每个聚类结果,计算该类中所有样本的均值,作为新的聚类中心。

       4. 判断迭代是否收敛:如果新旧聚类中心没有变化或者满足迭代条件,则输出结果并结束;否则,回到步骤2继续迭代。







       Clustering is a common unsupervised learning method, which involves categorizing similar data samples into groups (clusters). This process doesn't involve predefined labels; instead, it aims to group similar samples based on their inherent distribution patterns.

       Clustering algorithms can be broadly categorized into traditional clustering algorithms and deep clustering algorithms. Among them, K-means clustering is one of the most widely used techniques, which relies on the partitioning method. The core idea is to initialize k cluster centers and then categorize samples based on their distances to these centers, iteratively minimizing the distance between each sample and its respective cluster center (as defined by a target function).

       The optimization algorithm for K-means clustering comprises several steps:

       1. Randomly select k samples as initial cluster centers (where k is the hyperparameter representing the number of clusters. Its value can be determined by prior knowledge or validation techniques).

       2. For each data point, calculate its distance to k cluster centers and assign it to the cluster with the closest center.

       3. Recalculate the position of each cluster center based on the newly assigned cluster memberships.

       4. Repeat steps 2 and 3 until a stopping condition is met (like a predetermined number of iterations or when cluster centers stabilize).

       It's worth noting that K-means clustering's iterative algorithm is closely related to the Expectation-Maximization (EM) algorithm. The EM algorithm tackles the issue of parameter estimation in probabilistic models with unobservable latent variables. In the K-means context, the latent variables are the cluster assignments for each data point. The K-means algorithm's steps of assigning data points to clusters and recalculating cluster centers correspond to the E-step and M-step of the EM algorithm, respectively.

       One of the critical challenges in K-means clustering is the selection of the distance metric. The algorithm relies on measuring the similarity between samples based on distance, which determines their assignment to the nearest cluster center. Commonly used metrics include Manhattan distance and Euclidean distance, as discussed in the article "Comprehensive Overview of Distance and Similarity Methods (7 Types)."

       The Manhattan and Euclidean distances are straightforward to compute, involving the sum of the differences between each feature of two samples. For instance, in a two-dimensional feature space, the blue line represents the Manhattan distance (akin to driving from one intersection to another in the Manhattan grid system), and the red line represents the Euclidean distance.

       Deciding the value of k, or the number of clusters, is a crucial aspect of K-means clustering. The outcome can vary significantly with different k values. Determining k can be done through various methods, such as the prior knowledge approach, elbow method, or other techniques like:

       Firstly, the prior approach is relatively simple and relies on domain expertise to determine the value of k. For example, using the iris flower dataset, which typically contains three categories, one can set k=3 for clustering validation. The below image illustrates that the clustering prediction aligns well with the actual iris categories.

       The elbow method's limitation lies in its subjective nature, lacking automation. Other methods include:

       The limitations of K-means clustering include:

       One significant issue is the initialization of cluster centers, which can significantly impact the final results. To address this, the K-means++ algorithm was introduced, which initializes centers by selecting points that are as far as possible from each other, based on the distances to existing centers. The probability of selecting a point as a new center is proportional to its distance from the already determined centers, squared.

       Another limitation is the assumption of spherical and isotropic data clusters in Euclidean space, which doesn't always hold in real-world scenarios. To tackle this, we can employ kernel functions, leading to the kernel K-means algorithm, a variant of kernel clustering. This method involves mapping input data points into a higher-dimensional feature space using a nonlinear transformation, where clustering is performed. This transformation increases the likelihood of linear separability, thus enabling more accurate clustering in cases where classical algorithms fail.

       K-means is designed for numerical features, necessitating encoding techniques for categorical features. Alternative algorithms like K-Modes and K-Prototypes are tailored for mixed data types, where the cluster centers for numerical features are averages, for categorical features are modes, and the distance metric is the Hamming distance.

       Another challenge is the handling of feature scaling, as larger features can disproportionately influence distance calculations. To mitigate this, data is typically standardized or normalized to ensure all numerical features are on a comparable scale. For instance, in a dataset with age and salary features, the squared difference in age will be vastly smaller than that in salary, potentially biasing the clustering results. To address this, feature scaling methods like normalization or standardization are employed.

       For assigning weights to features, K-means calculates distances based on feature similarity. To incorporate feature weights, normalization can be adjusted by multiplying each feature's value by the appropriate weight. For Manhattan distance, simply multiplying the feature value by the weight accomplishes this. When dealing with categorical features represented through embeddings, scaling each embedding dimension by the square root of its size ensures that the clustering is not disproportionately influenced by these high-dimensional features.

       Feature selection is crucial in unsupervised clustering, as it can dramatically affect the outcome. Including irrelevant or noisy features can lead to misleading results. For instance, in clustering bank customers based on quality, transaction frequency and deposit amount are significant features, whereas gender and age might introduce noise, leading to clusters based on similarities in gender and age rather than customer quality.

       For guidance on unsupervised clustering feature selection, consult the following references:

       - END -





Python 机器学习 PCA降维和K-means聚类及案例


       在传统k-means中,计算所有样本与质心的距离会消耗大量资源。Mini Batch K-means通过随机采样部分样本进行聚类,有效降低了计算量。对于聚类效果的评估,无监督情况下常用轮廓系数Calinski-Harabasz,其计算公式为[公式],值越大,表明聚类效果越好,簇间距离大,类内距离小。
