我是scrapy的新手,並試圖通過scraping yellowpages.com網站了解它。Scrapy網絡爬行不好
我的目標是編寫一個python代碼來進入yellowpages.com主頁的搜索字段(業務和位置),然後刮取後續的URL。
我的代碼如下所示:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from spider.items import Website
class YellowPages(Spider):
name = "yellow"
allowed_domains = ["yellowpages.com"]
start_urls = [
"http://www.yellowpages.com/"
]
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formxpath="//form[@id='search-form']",
formdata={
"query":"business",
"location" : "78735" },
callback=self.after_results
)
def after_results(self, response):
self.logger.info("info msg")
我想搜索 「業務」 的位置, 「78735」。但是,這些不是傳遞給網站的價值。我的日誌看起來是這樣的:
2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/> (referer: None)
2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/search?search_terms=&geo_location_terms=Los+Angeles%2C+CA&query=business&location=78735> (referer: http://www.yellowpages.com/)
在第二個URL,術語洛杉磯+洛杉磯以某種方式插入。當我嘗試手動輸入搜索字段並提交,這是URL應該怎樣看這樣的:
http://www.yellowpages.com/search?search_terms=business&geo_location_terms=78735
誰能告訴我什麼錯誤,以及如何解決它?
非常感謝。
僅供參考,這裏是yellowpages.com主頁
<div class="search-bar"><form id="search-form" action="/search" method="GET"><div><label><span>What do you want to find?</span><input id="query" type="text" value="" placeholder="What do you want to find?" autocomplete="off" data-onempty="recent-searches" name="search_terms" tabindex="1"/></label><ul id="recent-searches" class="search-dropdown recent-searches"><li class="search-hint">Search by<b> business name,</b> or<b> keyword</b></li></ul><ul id="autosuggest-term" data-analytics='{"moi":105}' class="search-dropdown autosuggest-term"></ul></div><em>near</em><div><label><span>Where?</span> <input id="location"type="text" value="78735" placeholder="Where?" autocomplete="off" data-onempty="menu-location" name="geo_location_terms" tabindex="2"/></label>