kmeans聚类分析的原理,大数据计算引擎

文章目录一、电信运营商---客户价值分析二、使用聚类模型---项目需求的分析三、聚类模型原理与方法四、代码： 4.1数据感知4.2数据预处理4.3模型建立4.4概率密度图

开发环境jupyter notebook

数据集下载位置： https://download.csdn.net/download/wsp _ 113886114/10616250

一、电信运营商客户价值分析从客户需求出发，了解客户需要什么，他们有哪些特点，

电信运营商为客户设置不同的折扣套餐

获得更多用户：提供各种优惠套餐

降低客户流失率

提高收入

增加ARPU值(平均每个平均每个用户的平均收益) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )。

定制精准营销策略

二、使用聚类模型——分析项目需求客户多，消费行为复杂，难以人工标记客户。在这种情况下，采用无监督学习的聚类算法更适合通过分析客户特征、日常消费行为来了解其偏好。降低客户流失率，为获取新用户提供个性化营销依据客户：普通用户商业用户大用户初步目标中端用户离线趋势用户及其他需求用户通过聚类将一般用户分成多个类别进行聚类后，进行年龄、性别、消费情况3、聚类模型的原理和方法3.1聚类(按类别、按组)、聚类(无监督) )

分类(有监督，知道事务类别) )

3.2聚类效果评价标准(适合归入几个类；分层聚类) )。

这是一种非常直观的算法，在一个一个的级别上进行，有时会阶段性地收集小的cluster，有时会阶段性地分割大的cluster 多是阶段性地聚集在一起

分层聚类的dendrogram树(亲缘关系树形图解) ) ) ) ) ) )。

可以在scipy.cluster.hierarchy.linkage进行分层群集时使用

scipy.cluster.hierarchy.dendrogram画二叉树，二叉树的高度表示两个后代相互的距离

如何断开dendrogram树的连接

四、代码： 4.1数据感知importpandasaspdimportmatplotlib.pyplotaspltfromsklearn.clusterimportkmeansfromscipy.cluster.hierarchy import (r'.data\custinfo.CSV ' ) cust call=PD.read _ CSV (r '. daaasv )

mer_IDPeak

_callsPeak

_minsOffPeak_

callsOffPeak

_minsWeekend_

callsWeekend_

最小国际

_minsNat_

呼叫

cost month 0k 1001301210.58746554.47931200.0000004.381410041 k 1001301411.53007674.87810913.0457564.771490062 k 10013013013010 . 358024034 k 100130119.50731944.02232700.000003.934413024.2数据预处理# 数据聚合：-02232700.00003.934413024.2最后一列【month】cust call2=cust call.group by (cust call [ ' customer _ id ' ] ).mean ) custcall3=custcall2.drop () ) )要删除的custcall3，left_on='Customer_ID '， right_index=True ) data.index=data [ ' customer _ id ' ] data=data.75 % ) desc=data.describe(print ) descrint

andset_cnt = pd.value_counts(data['Handset'])print(handset_cnt)for col in data.columns: if not col in [u'Gender',u'Tariff',u'Handset']: fig = plt.figure() ax=fig.add_subplot(1,1,1) data[col].hist(bins=20) ax.set_title(col) fig.show()

4.3 模型建立 data_feature = data.drop('Age',1)data_feature = data_feature.drop('Gender',1)data_feature = data_feature.drop('Tariff',1)data_feature = data_feature.drop('Handset',1)data_zs = 1.0*(data_feature - data_feature.mean())/data_feature.std() #数据标准化 Z = linkage(data_zs, method = 'ward', metric = 'euclidean') #谱系聚类图（欧式距离）P = dendrogram(Z, 0) #画谱系聚类图plt.show() k = 4 #聚类的类别iteration = 500 #聚类最大循环次数model = KMeans(n_clusters = k, n_jobs = 1, max_iter = iteration) #分为k类，并发数1，数值大系统卡死model.fit(data_zs) #开始聚类r1 = pd.Series(model.labels_).value_counts() #统计各个类别的数目r2 = pd.DataFrame(model.cluster_centers_) #找出聚类中心r = pd.concat([r2, r1], axis = 1) #横向连接（0是纵向），得到聚类中心对应的类别下的数目r.columns = list(data_zs.columns) + [u'class'] #重命名表头print(r)#类中心比较# r[cols].plot(figsize=(10,10))r2.columns = list(data_feature.columns)r2.plot(figsize=(10,10))plt.show()#详细输出原始数据及其类别res = pd.concat([data, pd.Series(model.labels_, index = data.index)], axis = 1) #详细输出每个样本对应的类别res.columns = list(data.columns) + [u'class'] #重命名表头res.to_excel('.\data\result.xls') #保存结果pd.crosstab(res['Tariff'],res['class'])pd.crosstab(res['Handset'],res['class'])pd.crosstab(res['Gender'],res['class'])res[[u'Age',u'class']].hist(by='class')res[u'Age'].groupby(res['class']).mean()

4.4 概率密度图 def density_plot(data): #自定义作图函数 plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号 p = data.plot(kind='kde', linewidth = 2, subplots = True, sharex = False, figsize=(10,15) ) [p[i].set_ylabel(u'密度',fontproperties='SimHei') for i in range(k)] plt.legend() return plt""" 看密度图的话可以看到更多的细节，但是对比效果不明显。 pd_: 概率密度图文件名前缀"""pic_output = '.\data\pd_' for i in range(k): density_plot(data[res[u'class']==i]).savefig(u'%s%s.png' %(pic_output, i))