我的任務是爲物業站點構建一個刮板,將結果存儲起來供以後處理。所涉及的網站是一個國家網站,不會在單個搜索中獲得所有內容,並且希望在提供結果之前提供一個區域。爲了解決這個問題,我創建了一個使用scrapy的scraper,使用多個開始URL將我直接帶到我感興趣的區域。該網站也是動態填充的,因此我使用selenium在頁面上呈現JavaScript,然後跟蹤直到刮刀完成每個區域的下一個按鈕。 只有存在單個起始網址時,此功能纔有效,但只要有多個網址遇到問題,就會立即生效。最初刮刀工作正常,但是在webdriver完成跟隨「下一個」按鈕到一個區域末尾之前(例如,對於單個區域可能有20個頁面),刮刀只移動到下一個區域(開始URL)刮第一個區域的內容。 我已經廣泛地尋找解決這個問題的辦法,但是我還沒有看到有這個問題的任何人。任何建議將是最受歡迎的。下面的代碼示例如下:使用硒和Scrapy刮取多個開始URL的動態內容
from scrapy.spider import CrawlSpider
from scrapy.http import TextResponse
from scrapy.selector import HtmlXPathSelector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
from selenium import webdriver
from selenium import selenium
from selenium_spider.items import DemoSpiderItem
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import sys
class DemoSpider(CrawlSpider):
name="Demo"
allowed_domains = ['example.com']
start_urls= ["http://www.example.co.uk/locationIdentifier=REGION 1234",
"http://www.example.co.uk/property-for-sale/locationIdentifier=REGION 5678"]
def __init__(self):
self.driver = webdriver.Firefox()
def __del__(self):
self.selenium.stop()
def parse (self, response):
self.driver.get(response.url)
result = response.xpath('//*[@class="l-searchResults"]')
source = 'aTest'
while True:
try:
element = WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR,".pagination-button.pagination-direction.pagination-direction--next"))
)
print "Scraping new site --------------->", result
print "This is the result----------->", result
for properties in result:
saleOrRent = properties.xpath('//*[@class = "property-title"]/text()').extract()
addresses = properties.xpath('//*[@class="property-address"]/text()').extract()
if saleOrRent:
saleOrRent = saleOrRent[0]
if 'for sale' in saleOrRent:
saleOrRent = 'For Sale'
elif 'to rent' in saleOrRent:
saleOrRent = 'To Rent'
for a in addresses:
item = DemoSpiderItem()
address = a
item ["saleOrRent"] = saleOrRent
item ["source"] = source
item ["address"] = address
item ["response"] = response
yield item
element.click()
except TimeoutException:
break
我有相同的問題!你有沒有找到解決方案?我目前也在尋找,如果我遇到一些問題,我會讓你知道。 –