2019-04-13

人和wifi的亲密度,机器学习手把手教程

本文是一次手把手的使用jupter跑机器学习的教程，训练出来的结果不保证可用。环境为jupter，可以跳过数据准备部分直接看Python脚本。采用wifi数据，选择2个特征，使用逻辑回归进行训练。

uid2wifi 评分

uid2wifi模型训练

基础数据为用户使用wifi的数据，训练出来的最后得分为用户和该wifi的紧密度，采取的特征为用户使用对应wifi的频率。

数据准备

下载hive中打好分数的数据，prepare_data -i tmp.uid2wifi_feature -o . --merge true -db bi。 tmp.uid2wifi_feature数据取10天使用频率32周的flag认为1，只使用过一次的为0.总共5个特征，分别是hour_count，ten_days_count，night_count,daytime_count,weekend_count。训练结果5个特征的没有2个特征的效果好，2个最终采取2个特征hour_count，ten_days_count。

抽取训练数据

drop table if exists tmp.uid2wifi_feature;
create table tmp.uid2wifi_feature as
select uid,size(four_hour_set) hour_count
,size(ten_days_set) ten_days_count
,night_count 
,daytime_count
,weekend_count
,0 as label
from tmp.tmp_uid2wifi_flag 
  where uid is not null and size(ten_days_set)=1 and size(four_hour_set)<6 and night_count<3 and weekend_count<3 order by rand() limit 60000;
insert into tmp.uid2wifi_feature 
select uid
, case when size(four_hour_set)>=32  then 32 else size(four_hour_set) end as hour_count
,case when size(ten_days_set)>=32  then 32 else size(ten_days_set) end as ten_days_count
,case when night_count>=32 then 32 else night_count end as night_count 
,case when daytime_count>=32 then 32 else daytime_count end as daytime_count
,case when weekend_count>=32 then 32 else weekend_count end as weekend_count
,1 as label
  from tmp.tmp_uid2wifi_flag 
  where 
   -- 99%以上的4小时片段
--   size(four_hour_set)>=90
      -- 32以下，十天数在99%以上  
   (size(four_hour_set)<32 and size(ten_days_set)>6)
  -- 90%以上的小时数，90%以上的十天数
   or size(ten_days_set)>5
  -- 80%以上的10天数量，95%以上的周末数,
  or  (size(ten_days_set)>=3 and weekend_count>=4 ) 
  -- 80%以上的10天数量，%95的夜晚片段或95%的白天片段,
  or (size(ten_days_set)>=3 and night_count>=10 ) 
  order by rand() limit 60000;

python的模型训练脚本


# 加载数据
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
df = pd.read_csv('tmp.uid2wifi_feature.csv')
# 抽取30%的数据为测试数据，其余为训练数据
from sklearn.model_selection import train_test_split
#feature_names=['hour_count','ten_days_count','night_count','daytime_count','weekend_count']
feature_names=['hour_count','ten_days_count']
label=['label']
x_train, x_test, y_train, y_test = train_test_split(df[feature_names], df[label], test_size=0.3, random_state=111)
# 运行逻辑回归
clf = LogisticRegressionCV(cv=5, random_state=0, multi_class='multinomial', penalty='l2', class_weight='balanced')
clf.fit(x_train.values, y_train.values)
# 计算auc,ks的函数
from sklearn import cross_validation, metrics 
def compute_auc_ks(y_true, y_pred):
    fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
    auc = metrics.auc(fpr, tpr)
    ks = (tpr - fpr).max()
    return auc, ks
    
# 拿那30%的数据预测一波
fn_pred = clf.predict_proba(x_test.values)[:,1]    
# 查看auc,ks；日了狗了，我这返回auc 1.0,ks 1.0
compute_auc_ks(y_test,fn_pred)
# 查看准确率
score = clf.score(x_train, y_train)

uid2wifi 模型测试

数据准备

随机抽取1000条数据。prepare_data -i tmp.uid2wifi_feature -o . --merge true -db bi,

python的测试脚本

# 加载数据
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
predict_df = pd.read_csv('tmp.uid2wifi_predict.csv')
# 加载列
from sklearn.model_selection import train_test_split
predict_feature_names=['hour_count','ten_days_count','night_count','daytime_count','weekend_count']
label=['label']
x_pred_train, x_pred_test, y_pred_train, y_pred_test = train_test_split(predict_df[predict_feature_names], predict_df[label], test_size=0.3, random_state=111)
# 预测
x_pred_test_result = clf.predict_proba(x_pred_test[['hour_count','ten_days_count']].values)
# 查看预测结果
pred_test_df = x_pred_test.loc[::]
pred_test_df['score'] = x_pred_test_result[:,1]
pred_test_df
# 查看公式
clf.coef_,clf.intercept_

特征权重(array([[ 0.94491135, 2.06726915]]), array([-14.6439925]))。

对应的公式 y=0.94491135*hour_count + 2.06726915*ten_days_count - -14.6439925

对应的得分为 score=(1/( 1 + Math.pow(Math.E,(-1*y)))）。公式比较简单，直接写个方法，比到处pml文件更加方便。

生成pml文件

安装依赖包:pip install sklearn2pmml -t /notebook/shared/extra/
下载把二进制文件转换为pml文件的工具:git clone https://github.com/jpmml/jpmml-sklearn.git
编译jpmml-sklearn:mvn clean install -D skipTests
使用python脚本输出二进制文件

import sys
sys.path.append('/notebook/shared/extra/')
from sklearn2pmml import PMMLPipeline
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.externals import joblib
iris = load_iris()
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(iris.data, iris.target)
joblib.dump(pipeline, "uid-wifi-score.pkl.z", compress = 9)

把二进制文件转换成Pml文件. java -jar target/jpmml-sklearn-executable-1.5-SNAPSHOT.jar --pkl-input ~/uid-wifi-score.pkl.z --pmml-output uid-wifi-score.pmml

大功告成

甲鱼的大数据之旅

从入门到跑路

人和wifi的亲密度,机器学习手把手教程

uid2wifi 评分

uid2wifi模型训练

数据准备

python的模型训练脚本

uid2wifi 模型测试

数据准备

python的测试脚本

生成pml文件