机器学习入门之基于Scrapy的网络爬虫和Sklearn的机器学习算法-职坐标

机器学习入门之基于Scrapy的网络爬虫和Sklearn的机器学习算法

小标 2018-11-28 来源：阅读 1450 评论 0

摘要：本文主要向大家介绍了机器学习入门之基于Scrapy的网络爬虫和Sklearn的机器学习算法，通过具体的内容向大家展现，希望对大家学习机器学习入门有所帮助。

本文主要向大家介绍了机器学习入门之基于Scrapy的网络爬虫和Sklearn的机器学习算法，通过具体的内容向大家展现，希望对大家学习机器学习入门有所帮助。

本着对网络爬虫的兴趣，在闲来无事时做了一个有关网络爬虫的项目，本项目用的是Scrapy爬虫框架，同时为了有效利用这些数据，用入门的sklearn对这些数据进行预处理并训练除了一个预测模型，下面开始本项目的介绍。

1、数据准备与爬虫

本项目以房天下网站北京市租房信息为对象，首先确定爬取的房屋属性为：标题、出租方式、户型、建筑面积、朝向、楼层、装修程度等因素。首先我们获取要爬取内容的首页为//zu.fang.com/，可以看到北京市有多个区，为了尽可能多的爬取租房数据，因此我们再每个区所在网页爬取100页。同时用xpath提取器来提取网页的内容，爬虫的代码如下：

import scrapy
from scrapy.http import Request
from zufang_scrapy.items import ZufangScrapyItem
import requests
from lxml import etree
class ZufangSpider(scrapy.Spider):
    name = 'zufang'
    headers = {'User_Agent':
                   'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.168 Safari/537.36'}
    def start_requests(self):
        for i in range(1, 150):
            start_url = "//zu.fang.com/house-a0" + str(i) + "/"
            for j in range(1, 100):
                url = start_url + "i3" + str(j) + "/"
                html=requests.get(url)
                xpath1=etree.HTML(html.text)
                href=xpath1.xpath(
                    '//p[@class="title"]/a/@href'
                )
                for link in href:
                    link1='//zu.fang.com'+link
                    yield Request(link1, headers=self.headers)

    def parse(self, response):
        zufang = response.xpath('//div[@class="wid1200 clearfix"]')
        for fangzi in zufang:
            title=fangzi.xpath('/html/body/div[5]/div[1]/div[1]/text()').extract()
            area = fangzi.xpath('//*[@id="agantzfxq_C01_03"]/text()').extract()

rent_style = fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[3]/div[1]/div[1]/text()').extract()

house_type= fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[3]/div[2]/div[1]/text()').extract()

house_area = fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[3]/div[3]/div[1]/text()').extract()

orientation = fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[4]/div[1]/div[1]/text()').extract()

floor = fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[4]/div[2]/div[1]/text()').extract()

decoration = fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[4]/div[3]/div[1]/text()').extract()

            price = fangzi.xpath('/html/body/div[5]/div[1]/div[3]/div[2]/div/i/text()').extract()
            for i in range(len(title)):
                item = ZufangScrapyItem()
                item['title'] = title[i]
                item['area'] = area[i]
                item['rent_style'] = rent_style[i]
                item['house_type'] = house_type[i]
                item['house_area'] = house_area[i][:-2]
                item['orientation'] = orientation[i]
                item['floor'] = floor[i][:-2]
                item['decoration'] = decoration
                item['price'] = price[i]

yield item

2、数据处理与sklearn

用Python的科学计算库Numpy和Pandas对爬取的数据进行去重，切片等处理操作，数据已经被保存在Mysql数据库中。处理完毕后，用sklearn训练处理后的数据并且用一些测试数据对模型进行评估。

skelarn算法：

import pandas as pd
from pylab import mpl
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
mpl.rcParams['font.sans-serif'] = ['FangSong']
df = pd.read_csv(u'chaoyang.csv',encoding="gbk" )
def change_decoration(x):
    if x == "豪华装修":
        x = 0.9
    elif x == "精装修":
        x = 0.7
    elif x == "中装修":
        x = 0.5
    elif x == "简装修":
        x = 0.3
    else :
        x = 0.1
    return x
def change_floor(x):
    if x=="高层":
        x = 0.8
    elif x=="中层":
        x= 0.5
    else :
        x=0.2
    return x

df["decoration_num"] = df["decoration"].apply(change_decoration)
df["floor_num"] = df["floor"].apply(change_floor)
X = df[['house_area', 'floor_num', 'decoration_num','price']]
y = df[['price']]
sns.pairplot(df,x_vars=['house_area','floor_num','decoration_num'],y_vars=['price'],size=7,aspect=0.8,kind=)
plt.show()
predictors = ["house_area","floor_num","decoration_num"]
X = df[predictors]
y = df["price"].tolist()
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.05, random_state=1)
alg = LinearRegression()
alg.fit(X_train, y_train)