我試圖從gsmarena下載數據。下載HTC one me spec的示例代碼來自以下網站「http://www.gsmarena.com/htc_one_me-7275.php」,如下所述:避免在使用scrapy的網站上禁止使用
網站上的數據以表格和表格行的形式分類。 數據的格式爲:
table header > td[@class='ttl'] > td[@class='nfo']
Items.py文件:
import scrapy
class gsmArenaDataItem(scrapy.Item):
phoneName = scrapy.Field()
phoneDetails = scrapy.Field()
pass
蜘蛛文件:
from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem
class testSpider(Spider):
name = "mobile_test"
allowed_domains = ["gsmarena.com"]
start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)
def parse(self, response):
# extract whatever stuffs you want and yield items here
hxs = Selector(response)
phone = gsmArenaDataItem()
tableRows = hxs.css("div#specs-list table")
for tableRows in tableRows:
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
colonSign = ": "
commaSign = ", "
seq = [ttl_value, colonSign, nfo_value, commaSign]
seq = seq.join(seq)
phone['phoneDetails'] = seq
yield phone
但是,我越來越只要我試圖取締甚至使用scrapy外殼加載頁面:
"http://www.gsmarena.com/htc_one_me-7275.php"
我甚至嘗試在settings.py中使用DOWNLOAD_DELAY = 3。
請建議我應該如何去做。
有趣...我試圖用UserAgent切換器在網站上玩耍,現在我根本無法加載任何頁面!這個網站可能有一個非常嚴格的禁止IP地址的用戶的政策,如果它們出現在任何表示爬蟲的UserAgent上,所以你可能暫時無法加載它。 – FBidu