2020-02-20发表2021-02-17更新学习笔记12 分钟读完 (大约1746个字)0次访问

爬虫初步

爬虫简介

网络爬虫（又称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常的称为网页追逐者），是一种按照一定的规则，自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。

爬虫的基本原理

网页请求的过程分为两个环节：
- Request （请求）：每一个展示在用户面前的网页都必须经过这一步，也就是向服务器发送访问请求。
- Response（响应）：服务器在接收到用户的请求后，会验证请求的有效性，然后向用户（客户端）发送响应的内容，客户端接收服务器响应的内容，将内容展示出来。
网页请求的方式也分为两种：
- GET：最常见的方式，一般用于获取或者查询资源信息，也是大多数网站使用的方式，响应速度快。
- POST：相比 GET 方式，多了以表单形式上传参数的功能，因此除查询信息外，还可以修改信息。

下面是我用GET方法爬取的某小说网站的小说的源代码：

# 引入库
import requests
import re #正则表达式
# 写网站站点
url = "http://www.ddxsku.com/files/article/html/55/55756/index.html"
# 写入headers模拟浏览器上网,避免出现网站拒绝访问的情况
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",}
# get发送请求
response = requests.get(url,headers=headers)
# 将网页编码方式转换为utf-8
response.encoding = 'utf-8'
# 网站源码
html = response.text
# print(html)
table = re.findall(r'<table cellspacing="1" cellpadding="0" bgcolor="#E4E4E4" id="at">.*?</table>',html,re.S)[0]
# print(table)
# 小说名字
title = re.findall(r'<h1>(.*?)</h1>',html)[0]
# print(title)
# 获取章节信息
chapter_info_list = re.findall(r'href="(.*?)">(.*?)<',table)
# print(chapter_info_list)
# 创建文本文档（存小说内容）
f = open('%s未格式化完全版本.txt' % title, 'w', encoding='utf-8')
# 循环访问章节链接
for chapter_info in chapter_info_list:
    # print(chapter_info)
    # chapter_title = chapter_info[1]
    # chapter_url = chapter_info[0]
    chapter_url, chapter_title = chapter_info
    # print(chapter_url, chapter_title)
    chapter_response = requests.get(chapter_url)
    chapter_response.encoding='utf-8'
    chapter_html = chapter_response.text
    chapter_content = re.findall(r'<dd id="contents">(.*?)</dd>',chapter_html,re.S)[0]
    # 清洗数据
    chapter_content = chapter_content.replace('<br />','')
    chapter_content = chapter_content.replace(' ','')
    chapter_content = chapter_content.replace('&nbsp;','')
    # print(chapter_content)
    # 写入文本(持久化)
    f.write(chapter_title)
    f.write(chapter_content)
    print(chapter_title)
f.close()

with open('%s.txt' % title, 'w', encoding='utf-8') as fn,open('%s未格式换完全版本.txt' % title, 'r', encoding='utf-8') as fo:
    for line in fo.readlines():
            if line.split():
                    fn.write(line)
                    fn.write('\n')
fo.close()
fn.close()

值得注意的是，此代码并没有添加防止由于访问量（批量下载）过大而导致IP可能被封的方法代码。
这个问题的解决方案有两个：
- 增设延时（但影响效率）：
1
2
import time
time.sleep(2)
- requests模块的proxies属性方法:
1
2
3
4
5
proxies={
"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
response = requests.get(url, proxies=proxies)
- 构建自己的代理IP池，将其以字典的形式赋值给proxies，传输给requests

Beautiful Soup 模块的简单应用

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

由于各个网站的网页代码结构不经相同，这次爬取中遇到了多个<div>标签嵌套的问题，如果用简单的正则表达式可能出现找不准对应末尾</div>的问题（其他标签类型类似），于是我引入了Beautiful Soup模块。

from bs4 import BeautifulSoup

url='http://www.mingchaonaxieshier.com/'
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
soup = BeautifulSoup(html,'html.parser')
divs = soup.findAll(name='div',attrs = {'class':'main'})

经过以上代码的处理，<div>标签能够被准确地定位，并将其中地内容保存在变量divs中。

w3lib库html模块的简单应用

w3lib主要包括四个模块：
- html模块：处理与html标签相关的问题
- http模块：处理与http报文相关的问题
- url模块：处理与url地址相关的问题
- encoding模块：处理与编码格式相关的问题
w3lib库中的html模块主要用于处理与html标签相关的问题。

此次我使用html模块解决了，爬出的html文档任然带有JavaScript标签的问题。

1
2
3

from w3lib.html import remove_tags_with_content

chapter_content = remove_tags_with_content(chapter_content,which_ones=('script','ins'),encoding='utf-8')

使用以上代码删除<script>和<ins>标签及其内容。

当然html模块中肯定不止有remove_tags_with_content()函数。
- 通常用到的还有：
1
2
3
html.remove_tags()
html.remove_comments()
html.remove_entities()
具体用法和更多html模块的函数可以参考官方文档或者这位CSDN博主的博客。

Python 爬虫中@retry装饰器的使用

在爬虫代码的编写中，requests请求网页的时候常常请求失败或错误，一般的操作是各种判断状态和超时，需要多次重试请求，可以用@retry装饰器来实现。

from retrying import retry


# 两次retry之间等待3秒，重试100次
@retry(stop_max_attempt_number=100, wait_fixed=3000)
def get_chapter_url(chapter_url):
    chapter_response = requests.get(chapter_url, headers=headers, timeout=1)
    chapter_response.encoding='utf-8'
    return chapter_response

更多详细用法参照：CSDN博主的博客。

最近写的小爬虫

前程无忧招聘信息爬取（岗位、公司、薪酬）

import re
import requests

key = "python"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",}
page_num = '1'
response = requests.get("https://search.51job.com/list/090200,000000,0000,00,9,99,"+key+",2,1.html", headers=headers)
# 先根据源网页编码（response.encoding）转换为二进制编码（bytes）
# 再用decode转换为gbk编码解决乱码问题
data = bytes(response.text, response.encoding).decode("gbk", "ignore")
pat_page = "共(.*?)条职位"
# print(data)
allline = re.compile(pat_page, re.S).findall(data)[0]
allpage = int(allline)//50+1
for i in range(allpage):
    print("---正在爬"+str(i+1)+"页---")
    page_num = str(i+1)
    response = requests.get("https://search.51job.com/list/090200,000000,0000,00,9,99,"+key+",2,"+page_num+".html", headers=headers)
    data_page =  bytes(response.text, response.encoding).decode("gbk", "ignore")
    # print(data_page)
    job_url_pat = '<em class="check" name="delivery_em" onclick="checkboxClick.this."></em>.*?href="https://jobs.51job.com/(.*?).html.*?"'
    job_url_all = re.compile(job_url_pat, re.S).findall(data_page)
    # print(job_url_all)
    for job_url in job_url_all:
        # print(job_url)
        thisurl = "https://jobs.51job.com/"+job_url+".html"
        response = requests.get(thisurl)
        data_job = bytes(response.text, response.encoding).decode("gbk", "ignore")
        # print(data_job)
        pat_title = '<h1 title="(.*?)"'
        pat_company = '<p class="cname">.*?title="(.*?)"'
        pat_money = '<div class="cn">.*?<strong>(.*?)</strong>'
        title = re.compile(pat_title, re.S).findall(data_job)[0]
        company= re.compile(pat_company, re.S).findall(data_job)[0]
        money = re.compile(pat_money, re.S).findall(data_job)[0]
        print('------------------')
        print(title)
        print(company)
        print(money)

爬虫初步

http://tianyuzhou.github.io/2020/02/20/爬虫初步/

作者

TIANYUZHOU

发布于

2020-02-20

更新于

2021-02-17

许可协议

#Python应用(爬虫)

爬虫初步

爬虫简介

爬虫的基本原理

Beautiful Soup 模块的简单应用

w3lib库html模块的简单应用

Python 爬虫中@retry装饰器的使用

最近写的小爬虫

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

链接

分类

最新文章

归档

标签

订阅更新