【编程之美02期】如何给女朋友买到心仪的零食


双十一要来临了,啊喂呜~~





是不是要剁手了!





等等,是不是少了点什么



双十一最开始的寓意可是光棍节啊!!




那么身为一枚伟大的程序员,多么好的资源!双十一有女朋友的一定要哄女朋友开心,没有女朋友的那就自己创造条件收获女朋友!

零食一直都是妹子们喜欢的东西,所以如何约到女神,让女神开心就要用程序员自身带的属性来解决!

爬虫可是能解决我们很多 实际问题的!最强大的语言就是python啦,很好入手和学习的语言,而且实用性很强的哦!


“编程之美,让你爱上编程的美。”


挑战下面编程题目,
一起体验编程的乐趣!

本期题目:

用你的代码,来爬取天猫或者京东或者任一电商网站上购买率比较高的零食,来拯救你双十一!可以是淘宝关键字中有女朋友字样的零食,也可以是直接看购买率比较高的零食,亦或是你通过什么途径来获取!
总之!真正锻炼动手能力才是真!

之后会每期在回帖中评选一个最佳”代码牛客并送出牛客大礼包!时间为一周内,也就是说下周四之前的为有效~~

包括:
鼠标垫
+

程序员精美独家贴纸:
+

牛客独家定制T恤:



当然啦,重要的是来练习自己的编程能力,分享代码,交流技术的过程,这个过程中,你提升的不只是一点点~

为了让牛友能够更高效,更好的学习,特意为大家建了一个群:牛客编程之美源码群 595665246,只给真正想参与这个栏目和真正想学习的人开放,会在群里定期分享源码,只让真正想学习的人来参加,所以只有参与栏目(在本栏目下发出自己的代码的)才能加,加的时候备注一下牛客昵称~



栏目介绍
编程之美,是牛客网推出的新栏目,每周推出一个项目供大家练手讨论交流。
如果你有想实现的项目问题,欢迎私信牛妹~
另外!另外!如果有好玩的项目题目可以私信牛妹,一经采用有奖励哦~~

如果你有写博客或者公众号的习惯,也欢迎加牛妹qq:1037532015私信。
全部评论
代码放在github上:https://github.com/cooljacket/SnackSpider,绕开需要js解析反爬虫的行为,下面贴张效果图:
点赞 回复 分享
发布于 2016-11-04 10:24
var keyword = "d3.js";//@input(keyword, 查询关键字, 爬取该关键字搜索出来的京东商品) var comment_count = 100;//@input(comment_count, 爬取的评论数, 最多爬取多少条评论) var page_count = comment_count / 10; keyword = keyword.trim(); var scanUrls = []; scanUrls.push("http://search.jd.com/Search?keyword=" + keyword.replace(/ /g, "+") + "&enc=utf-8&scrolling=y&page=200"); var helperUrlRegexes = []; helperUrlRegexes.push("http://search\\.jd\\.com/Search\\?keyword=" + keyword.replace(/ /g, "\\+").replace(/\./g, "\\.") + "&enc=utf-8&scrolling=y&page=\\d+"); var configs = { domains: ["search.jd.com", "item.jd.com", "club.jd.com"], scanUrls: scanUrls, contentUrlRegexes: ["http://item\\.jd\\.com/\\d+.html"], helperUrlRegexes: helperUrlRegexes, interval: 10000, fields: [ { // 第一个抽取项 name: "title", selector: "//div[@id='name']/h1", required: true }, { // 第一个抽取项 name: "productid", selector: "//div[contains(@class,'fl')]/span[2]", required: true }, { name: "comments", selector: "//div[@id='comment-pages']/span", repeated: true, children: [ { name: "page", selector: "//text()" }, { name: "comments", sourceType: SourceType.AttachedUrl, attachedUrl: "http://club.jd.com/productpage/p-{$.productid}-s-0-t-3-p-{page}.html", selectorType: SelectorType.JsonPath, selector: "$.comments", repeated: true, children:[ { name: "com_content", selectorType: SelectorType.JsonPath, selector: "$.content" }, { name: "com_nickname", selectorType: SelectorType.JsonPath, selector: "$.nickname" } ] } ] } ] }; configs.afterDownloadPage = function(page, site) { var matches = /item\.jd\.com\/(\d+)\.html/.exec(page.url); if (!matches) return page; var commentUrl = "http://club.jd.com/productpage/p-"+matches[1]+"-s-0-t-3-p-0.html"; var result = site.requestUrl(commentUrl); var data = JSON.parse(result); var commentCount = data.productCommentSummary.commentCount; var pages = commentCount / 10; if (pages > page_count) pages = page_count; var pageHtml = "<div id=\"comment-pages\">"; for (var i = 0; i < pages; i++) { pageHtml += "<span>" + i + "</span>"; } pageHtml += "</div>"; var index = page.raw.indexOf("</body>"); page.raw = page.raw.substring(0, index) + pageHtml + page.raw.substring(index); return page; }; var dataSku = 0; configs.onProcessHelperPage = function(page, content, site) { var num = parseInt(extract(content, "//*[@id='J_goodsList']/ul/li[1]/@data-sku")); if (dataSku === 0) { dataSku = isNaN(num) ? 0 : num; } else if (dataSku === num) { dataSku = 0; return false; } var currentPageNum = parseInt(page.url.substring(page.url.indexOf("&page=") + 6)); if (currentPageNum === 0) { currentPageNum = 1; } var pageNum = currentPageNum + 1; var nextUrl = page.url.replace("&page=" + currentPageNum, "&page=" + pageNum); site.addUrl(nextUrl); return true; }; configs.afterExtractPage = function(page, data) { if (data.comments === null || data.comments === undefined) return data; var comments = []; for (var i = 0; i < data.comments.length; i++) { var p = data.comments[i]; for (var j = 0; j < p.comments.length; j++) { comments.push(p.comments[j]); } } data.comments = comments; return data; }; var crawler = new Crawler(configs); crawler.start(); //牛妹双11快乐,大家双11快乐(*^▽^*)
点赞 回复 分享
发布于 2016-11-03 19:34
项目分析与小结: http://blog.csdn.net/zhyh1435589631/article/details/53053949 实现环境:  pyspider + centos7 + mysql 5.5 pyspider 部分代码: #!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on 2016-11-05 23:18:55 # Project: taobao_food from pyspider.libs.base_handler import * import re import json import MySQLdb class Handler (BaseHandler): # 数据库链接配置 def __init__ (self): db_host= "127.0.0.1" user= "root" passwd= "zhyh2010" db= "taobao_food" charset= "utf8" conn = MySQLdb.connect(host=db_host, user = user, passwd=passwd, db=db, charset=charset) conn.autocommit( True ) self.db=conn.cursor() # 爬虫的起始url @every(minutes=24 * 60) def on_start (self): self.crawl( 'https://tce.taobao.com/api/mget.htm?callback=jsonp221&tce_sid=659631&tce_vid=8,2&tid=,&tab=,&topic=,&count=,&env=online,online' , callback=self.json_parser) # 解析相应的 json 数据 @config(age=24 * 60 * 60) def select_json (self, response): content = response.text pattern = re.compile( 'window.jsonp.*?\((.*?)\)' , re.S) content_select = re.findall(pattern, content) return content_select[ 0 ].strip() # 提取相应数据 插入数据库表中 def product_info (self, response): for data in response[ "result" ]: res = { "item_pic" : "https:" + data[ "item_pic" ], "item_youhui_price" : data[ "item_youhui_price" ], "item_title" : data[ "item_title" ] } sql = "insert into food_info(url, price, title) values (%s,%s,%s)" values = [(res[ "item_pic" ], res[ "item_youhui_price" ], res[ "item_title" ])] self.db.executemany(sql, values) # 解析 json @config(age=24 * 60 * 60) def json_parser (self, response): content = self.select_json(response) contents = json.loads(content) subres = contents[ "result" ] for each in contents[ "result" ]: info = self.product_info(subres[each])
点赞 回复 分享
发布于 2016-11-06 13:12
也来凑个热闹吧,直接 python + selenium + chrome 来抓的,下面是结果+代码 #coding=utf-8 # 测试运行环境: ubuntu 14.04 python 2.7.6 chrome # 需要安装 selenim chromedriver from selenium import webdriver from selenium.webdriver.common.by import By import sys GBK = 'gbk' def logToFile(tag, msg): logFile = open('log', 'w') out = tag + '--\n' + msg logFile.write(out) def log(tag, msg): print tag, ' -- ' print msg def defLog(msg): log('out', msg) # 保存零食信息 class Item: def __init__(self): self.CODE = 'utf-8' # 输出内容到markdown文件中 class MarkdownWriter: def __init__(self, name='out.md'): mdFile = open(name, 'w') self.mdFile = mdFile def writeContent(self, content): self.mdFile.write(content) def writeItems(self, title, items): # 组装markdown格式 content = '### ' + title + ' \n' for item in items: content += '#### ' + item.title + ' \n' content += '![img](' + item.img + ') \n' content += '[goto](' + item.url + ') \n' content += 'money: ' + item.money + ' \n' content += 'store: ' + item.store + ' \n' content += '\n\n' self.mdFile.write(content) class TaoBaoSpider: def __init__(self): driver = webdriver.Chrome() self.driver = driver def getUrl(self, url): print 'start get ...' # 通过chrome加载url包括js脚本 self.driver.get(url) print 'get finished ...' def getHtmlWithJs(self): return self.driver.page_source def getElements(self): print 'get item ...' els = self.driver.find_elements(By.CSS_SELECTOR, "li[class=' item item-border'") return els def getContent(self, element): item = Item() # 从获取的html页面中获取需要的信息并封装到item类中 item.img = element.find_element_by_tag_name('img').get_attribute('src') item.money = element.find_element(By.CSS_SELECTOR, "div[class='price child-component clearfix'").find_element_by_tag_name('strong').text titleElement = element.find_element(By.CSS_SELECTOR, "div[class='title child-component'").find_element_by_class_name('J_AtpLog') item.title = titleElement.text item.url = titleElement.get_attribute('href') item.store = element.find_element(By.CSS_SELECTOR, "div[class='seller child-component clearfix'").find_element_by_tag_name('a').text return item def start(self, url): self.url = url self.getUrl(url) els = self.getElements() items = [] for e in els: item = self.getContent(e) items.append(item) return items def main(): # 设置下编码 reload(sys) sys.setdefaultencoding('utf-8') url = 'https://world.taobao.com/search/search.htm?_ksTS=1478358034370_312&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E5%A5%B3%E6%9C%8B%E5%8F%8B%20%E9%9B%B6%E9%A3%9F&cna=Eg9NDplivkkCAXuCB323%2Fsy9&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895' # 爬虫运行 spider = TaoBaoSpider() items = spider.start(url) # 输出到markdown文件中 writer = MarkdownWriter('taobao.md') writer.writeItems('零食列表', items) main() 更具体些的可以上这里看:https://github.com/5A59/Practice
点赞 回复 分享
发布于 2016-11-06 16:12
没写爬淘宝的。老婆也不让买什么。写了个爬小说的。 package years.year2016.months11; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import years.year2016.months10.WebUtil; public class WebDataGain { public static void main(String[]args){ WebDataGain w = new WebDataGain(); String url = "http://www.biqugezw.com/3_3096/"; String bookname = "一念永恒"; w.downNovel_Biqugezw(url,bookname); } /** * 下载笔趣阁小说功能 * @param url * @throws IOException */ public void downNovel_Biqugezw(String url,String bookName) { String url_root = "http://www.biqugezw.com"; //用Jsoup连接站点 Document doc=null; try { doc = Jsoup.connect(url).get(); } catch (IOException e2) { // TODO Auto-generated catch block e2.printStackTrace(); } //选择器,选择class做为容器 Elements elementList = doc.select("#list"); String query ="a[href~=/[0-9]{1}_[0-9]{4}/.*html]"; Elements elements = elementList.select(query); int size = elements.size(); System.out.println(size); String fileName = ""; int num = 0; int initnum=371; for(int i=initnum;i<size;i++){ Element e = elements.get(i); String href = e.attr("href"); String tempurl = url_root+href; System.out.println(tempurl); Document docInner=null; try { docInner = Jsoup.connect(tempurl).get(); } catch (IOException e1) { // TODO Auto-generated catch block e1.printStackTrace(); System.out.println(fileName); System.out.println(i); } Elements elementsClass = docInner.select(".bookname "); Elements elementsH = elementsClass.select("h1"); String sectionkname = elementsH.text(); System.out.println(sectionkname); Elements elementsContent = docInner.select("#content"); String content = elementsContent.text(); System.out.println(content); num=i%20; if(num==0&&i==0){ fileName="1-20章"; }else if(num==0&&i!=0){ fileName=i+"-"+(i+20)+"章节"; }else if(i==initnum){ int temp=initnum-num; fileName = temp+"-"+(temp+20)+"章节"; } try { WebUtil.downloadText(sectionkname+" "+content, bookName+"--"+fileName+".txt", WebUtil.getFileDir()+"//book//"+bookName+"//"); } catch (IOException e1) { // TODO Auto-generated catch block e1.printStackTrace(); } } } }
点赞 回复 分享
发布于 2016-11-07 15:08
var keyword = "d3.js"; //@input(keyword, 查询关键字, 爬取该关键字搜索出来的京东商品) var comment_count = 100; //@input(comment_count, 爬取的评论数, 最多爬取多少条评论) var page_count = comment_count / 10; keyword = keyword.trim(); var scanUrls = []; scanUrls.push("http://search.jd.com/Search?keyword=" + keyword.replace(/ /g, "+") + "&enc=utf-8&scrolling=y&page=200"); var helperUrlRegexes = []; helperUrlRegexes.push("http://search\\.jd\\.com/Search\\?keyword=" + keyword.replace(/ /g, "\\+").replace(/\./g, "\\.") + "&enc=utf-8&scrolling=y&page=\\d+"); var configs = { domains: ["search.jd.com", "item.jd.com", "club.jd.com"], scanUrls: scanUrls, contentUrlRegexes: ["http://item\\.jd\\.com/\\d+.html"], helperUrlRegexes: helperUrlRegexes, interval: 10000, fields: [ { // 第一个抽取项 name: "title", selector: "//div[@id='name']/h1", required: true }, { // 第一个抽取项 name: "productid", selector: "//div[contains(@class,'fl')]/span[2]", required: true }, { name: "comments", selector: "//div[@id='comment-pages']/span", repeated: true, children: [ { name: "page", selector: "//text()" }, { name: "comments", sourceType: SourceType.AttachedUrl, attachedUrl: "http://club.jd.com/productpage/p-{$.productid}-s-0-t-3-p-{page}.html", selectorType: SelectorType.JsonPath, selector: "$.comments", repeated: true, children:[ { name: "com_content", selectorType: SelectorType.JsonPath, selector: "$.content" }, { name: "com_nickname", selectorType: SelectorType.JsonPath, selector: "$.nickname" } ] } ] } ] }; configs.afterDownloadPage = function(page, site) { var matches = /item\.jd\.com\/(\d+)\.html/.exec(page.url); if (!matches) return page; var commentUrl = "http://club.jd.com/productpage/p-"+matches[1]+"-s-0-t-3-p-0.html"; var result = site.requestUrl(commentUrl); var data = JSON.parse(result); var commentCount = data.productCommentSummary.commentCount; var pages = commentCount / 10; if (pages > page_count) pages = page_count; var pageHtml = "<div id=\"comment-pages\">"; for (var i = 0; i < pages; i++) { pageHtml += "<span>" + i + "</span>"; } pageHtml += "</div>"; var index = page.raw.indexOf("</body>"); page.raw = page.raw.substring(0, index) + pageHtml + page.raw.substring(index); return page; }; var dataSku = 0; configs.onProcessHelperPage = function(page, content, site) { var num = parseInt(extract(content, "//*[@id='J_goodsList']/ul/li[1]/@data-sku")); if (dataSku === 0) { dataSku = isNaN(num) ? 0 : num; } else if (dataSku === num) { dataSku = 0; return false; } var currentPageNum = parseInt(page.url.substring(page.url.indexOf("&page=") + 6)); if (currentPageNum === 0) { currentPageNum = 1; } var pageNum = currentPageNum + 1; var nextUrl = page.url.replace("&page=" + currentPageNum, "&page=" + pageNum); site.addUrl(nextUrl); return true; }; configs.afterExtractPage = function(page, data) { if (data.comments === null || data.comments === undefined) return data; var comments = []; for (var i = 0; i < data.comments.length; i++) { var p = data.comments[i]; for (var j = 0; j < p.comments.length; j++) { comments.push(p.comments[j]); } } data.comments = comments; return data; }; var crawler = new Crawler(configs); crawler.start(); //牛妹双11快乐,大家双11快乐(*^▽^*)
点赞 回复 分享
发布于 2016-11-04 10:40
完全不会,只能照着抄一遍了。。
点赞 回复 分享
发布于 2016-11-05 19:00
https://github.com/huowolf/practice/blob/master/JDSpider/src/com/huowolf/service/SpiderService.java
点赞 回复 分享
发布于 2016-11-07 20:26
牛妹喜欢吃什么
点赞 回复 分享
发布于 2016-11-02 20:32
首先。。。你要。。
点赞 回复 分享
发布于 2016-11-03 18:04
什么时候可以写出这样的,明天回去试试
点赞 回复 分享
发布于 2016-11-05 18:43
技术流都是通过技术找到女朋友的。哈哈
点赞 回复 分享
发布于 2016-11-06 09:18
给个思路,phantomjs实现元搜索
点赞 回复 分享
发布于 2016-11-10 14:48
var webpage = require('webpage') , page = webpage.create(); page.viewportSize = { width: 1024, height: 800 }; page.clipRect = { top: 0, left: 0, width: 1024, height: 800 }; page.settings = { javascriptEnabled: false, loadImages: true, userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/19.0' }; page.open('http://search.jd.com/Search?keyword=%E5%A5%B3%E6%9C%8B%E5%8F%8B%E9%9B%B6%E9%A3%9F&enc=utf-8&pvid=u4eabcvi.cns6qn', function (status) { var data; if (status === 'fail') { console.log('open page fail!'); } else { page.render('./test.png'); console.log('the mirrior page has saved '); } // release the memory page.close(); });
点赞 回复 分享
发布于 2016-11-10 20:14
双十一已经结束,本期也为双十一话题,大家本期写的又非常棒,为了让大家都过一个不一样的双十一,所有本期参与的都送一份大礼包!大家继续努力,再接再厉,毕竟学习知识才是最重要的~~
点赞 回复 分享
发布于 2016-11-11 16:33
正在学python,争取早日能像楼上的巨巨一样,写出自己想要的程序。加油!
点赞 回复 分享
发布于 2016-11-15 16:04

相关推荐

牛客101244697号:这个衣服和发型不去投偶像练习生?
点赞 评论 收藏
分享
11-24 00:11
已编辑
广东工业大学 算法工程师
避雷深圳&nbsp;&nbsp;yidao,试用期&nbsp;6&nbsp;个月。好嘛,试用期还没结束,就直接告诉你尽快找下一家吧,我谢谢您嘞
牛客75408465号:笑死,直属领导和 hr 口径都没统一,各自说了一些离谱的被裁理由,你们能不能认真一点呀,哈哈哈哈哈😅😅😅
点赞 评论 收藏
分享
评论
点赞
收藏
分享
牛客网
牛客企业服务