python爬虫学习
安装
在cmd.exe中
测试
打开idle,测试安装是否成功
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import requests
>>> r = requests.get("http://www.baibu.com")
>>> r.status_code
200
>>> r.encoding = 'utf-8'
>>> r.text
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">\n<!-- saved from url=(0022)http://localhost:8552/ -->\n<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=8">\n\n<title>吉胜科技</title>\n\n<link rel="icon" type="image/x-icon" href="http://localhost:8552/template/images/favicon.ico">\n <link rel="shortcut icon" type="image/x-icon" href="http://localhost:8552/template/images/favicon.ico">\n <link rel="stylesheet" type="text/css" href="./吉胜科技_files/css.css">\n <link rel="stylesheet" type="text/css" href="./吉胜科技_files/sicent_gmyn.css">\n <link rel="stylesheet" type="text/css" href="./吉胜科技_files/niceforms-default.css">\n<script type="text/javascript" src="./吉胜科技_files/jquery-1.10.1.min.js.下载"></script>\n<style type="text/css">\n\n\n</style>\n</head>\n<body style="">\n<div class="header">\n <div class="logo_sicent">\n <div class="logo fl">\n \t<a href="http://jisheng.sicent.com/"><img src="./吉胜科技_files/logo.png"></a>\n </div>\n <a href="http://localhost:8552/index.html" id="an_an1" class="current"> </a>\n <a href="http://localhost:8552/group.html" id="an_an2"></a>\n <a href="http://localhost:8552/company.html" id="an_an3"></a>\n <a href="http://localhost:8552/joinus.html" id="an_an4"></a>\n <div class="select-group">\n\t <div class="select">访问相关网站</div>\n\t <div class="caret dropdown-toggle" data-toggle="dropdown">\n\t </div>\n\t <ul class="dropdown-menu">\n\t \t<li><a href="http://www.300113.com/" target="_blank">访问顺网科技</a></li>\n\t <li><a href="http://www.sicent.com/" target="_blank">吉胜产品中心</a></li>\n\t <li><a href="http://www1.sicent.com/SC_UserLogin.aspx" target="_blank">访问用户中心</a></li>\n\t <li><a href="http://pay.sicent.com/" target="_blank">吉胜充值中心</a></li>\n\t <li><a href="http://agent.sicent.com/BLogin.aspx" target="_blank">吉胜代理平台</a></li>\n\t <li><a href="http://help.sicent.com/" target="_blank" class="last">吉胜帮助中心</a></li>\n\t </ul>\n\t </div>\n </div>\n</div>\n<div class="banner">\n <div class="index_view" id="view_img">\n <ul>\n <li id="banner51" style="display: list-item;">\n <a href="javascript:void(0);" style="background:url('http://image.sicent.com/images/1402391207934.jpg') repeat scroll center center transparent;"></a>\n </li>\n <li id="banner52" style="display: none;">\n <a href="javascript:void(0);" style="background:url('http://image.sicent.com/images/1402391233603.jpg') repeat scroll center center transparent;"></a>\n </li>\n </ul>\n <div class="banner_bj "> \n <div class="news_ico w980">\n <ol class="activeOL">\n <li><a href="http://localhost:8552/#" οnclick="s.showIndex(0);" class="active"></a></li>\n <li><a href="http://localhost:8552/#" οnclick="s.showIndex(1);" class=""></a></li>\n </ol>\n </div>\n </div> \n </div>\n</div>\n<!--banner结束-->\n<div class="middlebg">\n<div class="w980 content">\n <div class=" nrkj nrkj_1 text">\n <a href="http://localhost:8552/company.html"><img src="./吉胜科技_files/icon_sicent_1.jpg"></a>\n <p>\n 网吧行业第一营销服务商——成都吉胜科技有限责任公司成立于2001年。自创立之日起,始终坚持"为网吧创造全新价值"的经营理念,为网吧提供与时俱进的经营管理解决方案和增值服务解...\n </p>\n <p><a href="http://localhost:8552/company.html" class="more">详细>></a></p>\n </div>\n <div class="line"></div>\n \n <div class="nrkj nrkj_2">\n <a href="http://localhost:8552/company/news.html"><img src="./吉胜科技_files/icon_sicent_2.jpg"></a>\n <div class="news">\n\t\t\t <a href="http://localhost:8552/company/newsdetail-164.html" class="text_a">\n\t\t\t <span class="title">[官方新闻]万象2004停止维护公告</span>\n\t\t\t <span class="date">02-25</span></a>\n\t\t\t </div>\n <div class="news">\n\t\t\t <a href="http://localhost:8552/company/newsdetail-161.html" class="text_a">\n\t\t\t <span class="title">[官方新闻]吉胜用户中心、充值中心维护公告</span>\n\t\t\t <span class="date">07-02</span></a>\n\t\t\t </div>\n <div class="news">\n\t\t\t <a href="http://localhost:8552/company/newsdetail-158.html" class="text_a">\n\t\t\t <span class="title">[官方新闻]万象网管OL V4.2全新上线</span>\n\t\t\t <span class="date">04-25</span></a>\n\t\t\t </div>\n <div class="news">\n\t\t\t <a href="http://localhost:8552/company/newsdetail-157.html" class="text_a">\n\t\t\t <span class="title">[官方新闻]官网全新升级公告</span>\n\t\t\t <span class="date">04-01</span></a>\n\t\t\t </div>\n <p><a href="http://localhost:8552/company/news.html" class="more">更多>></a></p>\n </div>\n <div class="line"></div>\n <div class="nrkj nrkj_3">\n <a href="http://localhost:8552/company/contact.html"><img src="./吉胜科技_files/icon_sicent_3.jpg"></a>\n <img src="./吉胜科技_files/teleno.jpg">\n <p class="text">\n 传真:028-85227531<br>\n 邮编:610041<br>\n 成都市高新孵化园8号楼附2号楼德商国际A座9层<br>\n </p>\n </div>\n </div>\n</div>\n<div class="bottom">\n <div class="bottom_nr w980">\n <!--底部左边开始-->\n <div class="fl text12_hui">\n \n <a href="http://www.300113.com/" target="_blank" class="text12_bai">顺网科技</a>\n | \n <a href="http://www.sicent.com/" target="_blank" class="text12_bai">产品中心</a>\n | \n <a href="http://www1.sicent.com/SC_UserLogin.aspx" target="_blank" class="text12_bai">用户中心</a>\n | \n <a href="http://pay.sicent.com/" target="_blank" class="text12_bai">充值中心</a>\n | \n <a href="http://agent.sicent.com/BLogin.aspx" target="_blank" class="text12_bai">代理平台</a>\n | \n <a href="http://help.sicent.com/" target="_blank" class="text12_bai">帮助中心</a>\n | \n <a href="http://jisheng.sicent.com/sitemap.html" target="_blank" class="text12_bai">网站地图</a>\n <div class="text12_hui">\n <p>成都吉胜科技有限责任公司版权所有 2006~2018. 软件企业认定证书 <a href="http://www.miitbeian.gov.cn/" target="_blank">蜀ICP备05001520号</a><a target="_blank" class="set-color" href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=51019002001639"> <img src="./吉胜科技_files/beian.png">川公网安备 51019002001639号</a> </p>\n <p>Copyright 2006~2018. Chengdu Sicent Technology Co.,Ltd all rights reserved </p>\n </div>\n </div>\n <a href="http://jisheng.sicent.com/" id="logo_js"></a>\n <a href="http://www.300113.com/" target="_new" id="logo_sw"></a>\n <!--底部右边结束-->\n </div>\n</div>\n<!--底部结束-->\n<script src="./吉胜科技_files/bootstrap.modal.js.下载" type="text/javascript"></script>\n<script src="./吉胜科技_files/niceforms.js.下载" type="text/javascript"></script>\n<script src="./吉胜科技_files/index.js.下载" type="text/javascript"></script>\n<script type="text/javascript" src="./吉胜科技_files/onload.js.下载"></script>\n<script type="text/javascript" src="./吉胜科技_files/carousel.js.下载"></script>\n\n\n</body></html>'
>>>
requests方法
requests.request(): 构造一个请求,支撑一下各方法的基础方法
**kwargs 是一个可变的参数类型,在传实参时,以关键字参数的形式传入,python会自动解析成字典的形式。
介绍下一些最最常用的可选参数:
params: 字典或元组列表或字节,作为参数增加到url中;一般用于get请求,post请求也可用(不常用)。
data:字典,元组列表,字节或文件对象,作为post请求的参数。
json: JSON格式的数据,作为post请求的json参数。
files: 字典类型,传输文件,作为post请求文件流数据。
headers: 字典, HTTP请求头信息。
requests.get(): 获取HTML网页的主要方法,对应于HTTP的GET
requests.head(): 获取HTML网页头信息的方法,对应于HTTP的HEAD
requests.post(): 向HTML网页提交POST请求的方法,对应于HTTP的POST
requests.put(): 向HTML网页提交PUT请求的方法,对应于HTTP的PUT
requests.patch(): 向HTML网页提交局部修改请求,对应于HTTP的PATCH
requests.delete(): 向HTML页面提交删除请求,对应于HTTP的DELETE
Robots协议
以百度为例
访问http://www.baidu.com/robots.txt,得到如下Robots协议
Robots基本语法:# 注释,*代表所有,/代表根目录,User-agent代表哪些爬虫,Disallow不允许这个爬虫访问的资源目录
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: *
Disallow: /
案例演示
案例1:京东商品页面的爬取
import requests
class Demo01():
try:
r = requests.get("https://item.jd.com/100004404944.html")
# 检查状态码200位访问正常,非200就产生异常
r.raise_for_status()
r.encoding = r.apparent_encoding
# 取前1000个字符,0~1000,0省略不写
print(r.text[:1000])
except:
print("爬取失败")
案例2:亚马逊商品页面的爬取
import requests
class Demo02():
url = "https://www.amazon.cn/dp/B000SACJ74/ref=lp_76564071_1_1?s=grocery&ie=UTF8&qid=1569154117&sr=1-1"
try:
# 模拟浏览器向Amazon发出请求0
kv = {
'user-agent':'Mozilla/5.0'}
r = requests.get(url, headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[1000:2000])
except:
print("爬取失败")
案例3:百度搜索关键词提交,有多少条记录
import requests
class Demo03():
keyword = "库里"
try:
kv = {'wd':keyword}
r = requests.get("http://www.baidu.com/s",params=kv)
print(r.request.url)
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")
案例4:网络图片的爬取和存储
import requests
import os
class Demo04():
url = "http://img.caixin.com/2018-06-20/1529477438516178_480_320.jpg"
# 设置文件下载路径
root = "E:/code/python/firstpython/demo2"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path, 'wb') as f:
# 以二进制形式读
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")
案例5:ip地址归属地的自动查询
import requests
class Demo05():
url = "http://m.ip138.com/ip.asp?ip="
try:
r = requests.get(url+'202.204.80.112')
r.raise_for_status()
r.encoding = r.apparent_encoding
print(r.text[-500:])
except:
print("爬取失败")
遍历出a标签的所有父结点
>>> for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
p
body
html
[document]
标签树的平行遍历
条件,必须发生在同一个父节点下的各个节点