波士顿房价预测练习

波士顿房价学习

1、数据介绍

数据包中总共包含了 13个字段,分别对应如下:

  • CRIM:城镇人均犯罪率
  • ZN:住宅地所占比例
  • INDUS:城镇中非住宅用地所占比例
  • CHAS:虚拟变量,用于回归分析
  • NOX :环保指数
  • RM :每栋住宅的房间数
  • AGE :1940年以前建成的自主单位比例
  • DIS :距离5个波士顿的就业中心的加权距离
  • RAD :距离高速公路的便利指数
  • TAX: 每一万美元的不动产税率
  • PRTATIO: 城镇中的教师学生比例
  • B: 城镇中的黑人比例
  • LSTAT: 地区中有多少房东属于低收入人群
  • MEDV: 自住房屋房价中位数(也就是均价)

我们需要做的事情:
1、通过数据挖掘,对已知的数据进行分析
2、搭建数据模型,预测房价

2、开始处理数据

2.1 数据导入

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
data = pd.read_csv('file/boston_house_prices.csv')

可以查看各个散点的分布图:
pd.plotting.scatter_matrix(data, alpha=0.7, figsize=(10,10), diagonal='kde')

图片太大,我们选择三个维度来看看:

cols = ['RM', 'MEDV', 'LSTAT']
pd.plotting.scatter_matrix(data[cols], alpha=0.7, figsize=(10,10), diagonal='kde')

2.2 寻找数据的相关性

我们从上面的图中也能看出相关性,比如 RM 和 MEDV 呈线性相关。但数据量较大时,就难以肉眼辨别了。我们这里使用热力图 和 corr() 函数来直观表现相关度。

需要注意的是,这里的相关性有正相关和负相关,也就是说,是取的绝对值作为参考,代码如下:

#计算相关性
corr = data.corr()
#绘制热力图
plt.imshow(corr, cmap=plt.cm.hot, vmin=0, vmax=1)
plt.colorbar()
plt.rcParams["font.sans-serif"] = ['SimHei']
plt.title('相关性热力图')
# 设置横轴记号
plt.xticks(np.linspace(0,13,14,endpoint=True),('CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'),rotation=90)
# 设置纵轴记号
plt.yticks(np.linspace(0,13,14,endpoint=True),('CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'),rotation=360)
plt.show()

2.3 特征属性选择

由分析得出,与房价相关性较高的三个属性分别为,,RM(每栋住宅的房间数),PTRATIO(城镇中的教师学生比例),LSTAT(地区中有多少房东属于低收入人群)。

并经过分析,认为相关关联性具有逻辑,较好理解,故选取此三个参数作为基础数据进行训练。绘制出三个相关图

2.4 特征归一化

615ad427-5a28-45fc-8298-cb8061d66a82
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
features = data[['RM', 'PTRATIO', 'LSTAT']]
for feature in features.columns:
    features['标准化'+feature] = scaler.fit_transform(features[[feature]])

#散点可视化,查看特征归一化后的数据
font={
      'family':'SimHei'
      }
matplotlib.rc('font', **font)
pd.plotting.scatter_matrix(features[['标准化RM', '标准化PTRATIO', '标准化LSTAT']], alpha=0.7, figsize=(6,6), diagonal='hist')
plt.show()

3、模型的选择与优化

3.1 数据拆分,将数据拆分为训练集和测试集

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features[['标准化RM', '标准化PTRATIO', '标准化LSTAT ']], y, test_size=0.3,random_state=33)

3.2 模型选择,使用交叉验证方法,对模型进行评估(就是把数据打乱,然后从中选取一部分作为测试集,一部分作为验证集,以获取模型准确分数。)

from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score

3.3 模型尝试

线性回归 支持向量回归 KNN模型 决策树模型

  • 线性回归预测
y = data[['MEDV']] 
x_train, x_test, y_train, y_test = train_test_split(features[['标准化RM', '标准化PTRATIO', '标准化LSTAT']], y, test_size=0.3,random_state=33) 
print(x_train) 
lr = linear_model.LinearRegression() 
lr_predict = cross_val_predict(lr,x_train, y_train, cv=5) 
lr_score = cross_val_score(lr, x_train, y_train, cv=5) 
lr_meanscore = lr_score.mean()
  • SVR模型预测,尝试三种核 linear ,poly,rfb
#SVR
y_train = y_train.values.ravel()

from sklearn.svm import SVR
linear_svr = SVR(kernel = 'linear')
linear_svr_predict = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)

linear_svr_meanscore = linear_svr_score.mean()
print(linear_svr_meanscore)

poly_svr = SVR(kernel = 'poly')
poly_svr_predict = cross_val_predict(poly_svr, x_train, y_train, cv=5)
poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
poly_svr_meanscore = poly_svr_score.mean()
print(poly_svr_meanscore)

rbf_svr = SVR(kernel = 'rbf')
rbf_svr_predict = cross_val_predict(rbf_svr, x_train, y_train, cv=5)
rbf_svr_score = cross_val_score(rbf_svr, x_train, y_train, cv=5)
rbf_svr_meanscore = rbf_svr_score.mean()
print(rbf_svr_meanscore)
  • KNN模型
    在KNN算法中,我们没办法确定K ,所以最20以内的数字,我们全部做测试。
from sklearn.neighbors import KNeighborsRegressor 
score=[] 
for n_neighbors in range(1,21): 
knn = KNeighborsRegressor(n_neighbors, weights = 'uniform' ) 
knn_predict = cross_val_predict(knn, x_train, y_train, cv=5) 
knn_score = cross_val_score(knn, x_train, y_train, cv=5) 
knn_meanscore = knn_score.mean() 
score.append(knn_meanscore) 
plt.plot(score) 
plt.xlabel('n-neighbors') 
plt.ylabel('mean-score') 
plt.show()

效果图如下,由图可以看出,K取2时,模型评分最高,预测能力最强,于是我们选 K=2

代码如下:

n_neighbors=2
knn = KNeighborsRegressor(n_neighbors, weights = 'uniform' )
knn_predict = cross_val_predict(knn, x_train, y_train, cv=5)
knn_score = cross_val_score(knn, x_train, y_train, cv=5)
knn_meanscore = knn_score.mean()</code></pre>
  • 决策树预测房价

和KNN类似,我们也没办法确定深度,所以也先做个测试。选出最优值。

from sklearn.tree import DecisionTreeRegressor
score=[]
for n in range(1,11):
dtr = DecisionTreeRegressor(max_depth = n)
dtr_predict = cross_val_predict(dtr, x_train, y_train, cv=5)
dtr_score = cross_val_score(dtr, x_train, y_train, cv=5)
dtr_meanscore = dtr_score.mean()
score.append(dtr_meanscore)
plt.plot(np.linspace(1,10,10), score)
plt.xlabel('max_depth')
plt.ylabel('mean-score')
plt.show()

由图可以轻易看出,当n=5时,效果最佳。

n=5
dtr = DecisionTreeRegressor(max_depth = n)
dtr_predict = cross_val_predict(dtr, x_train, y_train, cv=5)
dtr_score = cross_val_score(dtr, x_train, y_train, cv=5)
dtr_meanscore = dtr_score.mean()

4、模型评估

evaluating = {
        'lr':lr_score,
        'linear_svr':linear_svr_score,
        'poly_svr':poly_svr_score,
        'rbf_svr':rbf_svr_score,
        'knn':knn_score,
        'dtr':dtr_score
        }
evaluating = pd.DataFrame(evaluating)
print(evaluating)
evaluating.plot.kde()
plt.show()

4.1 汇总评分

由图像可知,KNN模型是当前得分最高的模型。

4.2 模型优化

  • ISVR 通过更改惩罚系数C来查看对模型的影响。
lSVR_score=[]
for i in [1,10,1e2,1e3,1e4]:
    linear_svr = SVR(kernel = 'linear', C=i)
    linear_svr_predict = cross_val_predict(linear_svr, x_train, y_train, cv=5)
    linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
    linear_svr_meanscore = linear_svr_score.mean()
    lSVR_score.append(linear_svr_meanscore)
plt.plot(lSVR_score)

查看C 为 1,10,100,1000时 ,对模型的影响,由图可知,超过10时,对模型变化影响就不太明显,所以我们取 C =10,

重训练评分,如下

linear_svr = SVR(kernel = 'linear', C=10)
linear_svr_predict = cross_val_predict(linear_svr, x_train, y_train, cv=5)
linear_svr_score = cross_val_score(linear_svr, x_train, y_train, cv=5)
linear_svr_meanscore = linear_svr_score.mean()
  • 同样,对 poly 核优化,更改惩罚项系数和 degree
for i in [1,10,1e2,1e3,1e4]:
polySVR_score=[]
for j in np.linspace(1,10,10):
    poly_svr = SVR(kernel = 'poly', C=i, degree=j)
    poly_svr_predict = cross_val_predict(poly_svr, x_train, y_train, cv=5)
    poly_svr_score = cross_val_score(poly_svr, x_train, y_train, cv=5)
    poly_svr_meanscore = poly_svr_score.mean()
    polySVR_score.append(poly_svr_meanscore)
plt.plot(polySVR_score)
plt.legend()
  • 这个在本机运行较慢,使用了百度的paddlepaddle,速度还可以,运行结果如下图:

从上图看出,对变量C来说,当C>1000 时,poly核模型结果变化不大,当degree=2时,模型得分最高。所以取 C=1000 ,degree = 2

赞 (1)