試圖讓我的webcrawler抓取從網頁中提取的鏈接。我正在使用Scrapy。我可以使用我的抓取工具成功抓取數據,但無法抓取它。我相信問題出在我的規則部分。 Scrapy新手。感謝您提前幫忙。問題以下鏈接Scrapy
我刮這個網站:
/wiki/index.php/A._Ghani
或
/wiki/index.php/A._Keith_Carreiro
這裏:
http://ballotpedia.org/wiki/index.php/Category:2012_challenger
我試圖按照這個樣子的源代碼的鏈接是我的蜘蛛的代碼:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
from ballot1.items import Ballot1Item
class Ballot1Spider(CrawlSpider):
name = "stewie"
allowed_domains = ["ballotpedia.org"]
start_urls = [
"http://ballotpedia.org/wiki/index.php/Category:2012_challenger"
]
rules = (
Rule (SgmlLinkExtractor(allow=r'w+'), follow=True),
Rule(SgmlLinkExtractor(allow=r'\w{4}/\w+/\w+'), callback='parse')
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('*')
items = []
for site in sites:
item = Ballot1Item()
item['candidate'] = site.select('/html/head/title/text()').extract()
item['position'] = site.select('//table[@class="infobox"]/tr/td/b/text()').extract()
item['controversies'] = site.select('//h3/span[@id="Controversies"]/text()').extract()
item['endorsements'] = site.select('//h3/span[@id="Endorsements"]/text()').extract()
item['currentposition'] = site.select('//table[@class="infobox"]/tr/td[@style="text-align:center; background-color:red;color:white; font-size:100%; font-weight:bold;"]/text()').extract()
items.append(item)
return items
嘿,非常感謝。我現在就試試。 – 2013-02-12 00:52:19
剛剛嘗試了上述對規則的更改。它仍然只報廢我的起始網址。 – 2013-02-12 00:55:20