人和wifi的亲密度,机器学习手把手教程

本文是一次手把手的使用jupter跑机器学习的教程,训练出来的结果不保证可用。环境为jupter,可以跳过数据准备部分直接看Python脚本。采用wifi数据,选择2个特征,使用逻辑回归进行训练。

uid2wifi 评分

uid2wifi模型训练

基础数据为用户使用wifi的数据,训练出来的最后得分为用户和该wifi的紧密度,采取的特征为用户使用对应wifi的频率。

数据准备

下载hive中打好分数的数据,prepare_data -i tmp.uid2wifi_feature -o . --merge true -db bi。 tmp.uid2wifi_feature数据取10天使用频率32周的flag认为1,只使用过一次的为0.总共5个特征,分别是hour_count,ten_days_count,night_count,daytime_count,weekend_count。训练结果5个特征的没有2个特征的效果好,2个最终采取2个特征hour_count,ten_days_count

抽取训练数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
drop table if exists tmp.uid2wifi_feature;
create table tmp.uid2wifi_feature as
select uid,size(four_hour_set) hour_count
,size(ten_days_set) ten_days_count
,night_count
,daytime_count
,weekend_count
,0 as label
from tmp.tmp_uid2wifi_flag
where uid is not null and size(ten_days_set)=1 and size(four_hour_set)<6 and night_count<3 and weekend_count<3 order by rand() limit 60000;
insert into tmp.uid2wifi_feature
select uid
, case when size(four_hour_set)>=32 then 32 else size(four_hour_set) end as hour_count
,case when size(ten_days_set)>=32 then 32 else size(ten_days_set) end as ten_days_count
,case when night_count>=32 then 32 else night_count end as night_count
,case when daytime_count>=32 then 32 else daytime_count end as daytime_count
,case when weekend_count>=32 then 32 else weekend_count end as weekend_count
,1 as label
from tmp.tmp_uid2wifi_flag
where
-- 99%以上的4小时片段
-- size(four_hour_set)>=90
-- 32以下,十天数在99%以上
(size(four_hour_set)<32 and size(ten_days_set)>6)
-- 90%以上的小时数,90%以上的十天数
or size(ten_days_set)>5
-- 80%以上的10天数量,95%以上的周末数,
or (size(ten_days_set)>=3 and weekend_count>=4 )
-- 80%以上的10天数量,%95的夜晚片段或95%的白天片段,
or (size(ten_days_set)>=3 and night_count>=10 )
order by rand() limit 60000;

python的模型训练脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 加载数据
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
df = pd.read_csv('tmp.uid2wifi_feature.csv')
# 抽取30%的数据为测试数据,其余为训练数据
from sklearn.model_selection import train_test_split
#feature_names=['hour_count','ten_days_count','night_count','daytime_count','weekend_count']
feature_names=['hour_count','ten_days_count']
label=['label']
x_train, x_test, y_train, y_test = train_test_split(df[feature_names], df[label], test_size=0.3, random_state=111)
# 运行逻辑回归
clf = LogisticRegressionCV(cv=5, random_state=0, multi_class='multinomial', penalty='l2', class_weight='balanced')
clf.fit(x_train.values, y_train.values)
# 计算auc,ks的函数
from sklearn import cross_validation, metrics
def compute_auc_ks(y_true, y_pred):
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
auc = metrics.auc(fpr, tpr)
ks = (tpr - fpr).max()
return auc, ks
# 拿那30%的数据预测一波
fn_pred = clf.predict_proba(x_test.values)[:,1]
# 查看auc,ks;日了狗了,我这返回auc 1.0,ks 1.0
compute_auc_ks(y_test,fn_pred)
# 查看准确率
score = clf.score(x_train, y_train)

uid2wifi 模型测试

数据准备

随机抽取1000条数据。prepare_data -i tmp.uid2wifi_feature -o . --merge true -db bi,

python的测试脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 加载数据
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
predict_df = pd.read_csv('tmp.uid2wifi_predict.csv')
# 加载列
from sklearn.model_selection import train_test_split
predict_feature_names=['hour_count','ten_days_count','night_count','daytime_count','weekend_count']
label=['label']
x_pred_train, x_pred_test, y_pred_train, y_pred_test = train_test_split(predict_df[predict_feature_names], predict_df[label], test_size=0.3, random_state=111)
# 预测
x_pred_test_result = clf.predict_proba(x_pred_test[['hour_count','ten_days_count']].values)
# 查看预测结果
pred_test_df = x_pred_test.loc[::]
pred_test_df['score'] = x_pred_test_result[:,1]
pred_test_df
# 查看公式
clf.coef_,clf.intercept_

特征权重(array([[ 0.94491135, 2.06726915]]), array([-14.6439925]))。

对应的公式 y=0.94491135*hour_count + 2.06726915*ten_days_count - -14.6439925

对应的得分为 score=(1/( 1 + Math.pow(Math.E,(-1*y))))。公式比较简单,直接写个方法,比到处pml文件更加方便。

生成pml文件

  • 安装依赖包:pip install sklearn2pmml -t /notebook/shared/extra/
  • 下载把二进制文件转换为pml文件的工具:git clone https://github.com/jpmml/jpmml-sklearn.git
  • 编译jpmml-sklearn:mvn clean install -D skipTests
  • 使用python脚本输出二进制文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import sys
sys.path.append('/notebook/shared/extra/')
from sklearn2pmml import PMMLPipeline
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.externals import joblib
iris = load_iris()
pipeline = PMMLPipeline([("classifier", clf)])
pipeline.fit(iris.data, iris.target)
joblib.dump(pipeline, "uid-wifi-score.pkl.z", compress = 9)
  • 把二进制文件转换成Pml文件. java -jar target/jpmml-sklearn-executable-1.5-SNAPSHOT.jar --pkl-input ~/uid-wifi-score.pkl.z --pmml-output uid-wifi-score.pmml

大功告成