下面是使用Scrapy的解決方案。看看在overview,你就會明白,它是專爲這種任務的工具:
- 它正迅速(基於雙絞線)
- 易於使用和理解
- 建-in提取基於XPath的機制(可以使用
bs
或lxml
太雖然)
- 內置支持流水線提取的項目數據庫,XML,JSON無論
- 和更多的功能
這裏的工作蜘蛛,提取你問的一切(15分鐘的工作對我,而老,筆記本電腦):
import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BillBoardItem(Item):
date = Field()
song = Field()
artist = Field()
BASE_URL = "http://www.billboard.com/charts/%s/hot-100"
class BillBoardSpider(BaseSpider):
name = "billboard_spider"
allowed_domains = ["billboard.com"]
def __init__(self):
date = datetime.date(year=1958, month=8, day=9)
self.start_urls = []
while True:
if date.year >= 2013:
break
self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
date += datetime.timedelta(days=7)
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]
songs = hxs.select('//div[@class="listing chart_listing"]/article')
for song in songs:
item = BillBoardItem()
item['date'] = date
try:
item['song'] = song.select('.//header/h1/text()').extract()[0]
item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
except:
continue
yield item
保存成billboard.py
,並通過scrapy runspider billboard.py -o output.json
運行。然後,在output.json
你會看到:
...
{"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"}
{"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"}
{"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
{"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"}
{"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"}
{"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"}
{"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"}
{"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"}
{"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"}
{"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"}
...
而且,看看grequests作爲替代工具。
希望有所幫助。
[Scrapy(https://scrapy.readthedocs.org/)將執行好多了,它的該工作的工具,當然。讓我知道你是否可以 - 我會爲你寫一個蜘蛛樣本。 – alecxe
改進將包括不使用urllib2,不使用正則表達式來解析html,並使用多個線程來執行您的I/O。 – roippi
我真誠懷疑'urllib2'與任何效率問題都有關係。它所做的只是發出請求並拉下回應;有99.99%的時間是網絡時間,沒有其他方法可以改善。問題是(a)你的解析代碼可能會很慢,(b)你可能會做很多重複或不必要的下載,(c)你需要並行下載(你可以用'urllib2' (d)您需要更快的網絡連接,或者(e)billboard.com正在限制您的工作。 – abarnert