使用scrapy與硒

我想抓取這個頁面的所有exibitors動態頁面：使用scrapy與硒

https://greenbuildexpo.com/Attendee/Expohall/Exhibitors

但scrapy不會加載我在做什麼現在用硒加載它的內容頁面和搜索與scrapy鏈接：

url = 'https://greenbuildexpo.com/Attendee/Expohall/Exhibitors' 

driver_1 = webdriver.Firefox() 
driver_1.get(url) 
content = driver_1.page_source 

response = TextResponse(url='',body=content,encoding='utf-8') 
print len(set(response.xpath('//*[contains(@href,"Attendee/")]//@href').extract()))

該網站似乎並沒有做出任何新的請求時，「下一個」按鈕被按下，所以我希望得到所有鏈接的一個，但我只是很與該代碼獲得43個鏈接。他們應該是在500左右。

現在我想按「下一步」按鈕抓取網頁：

for i in range(10): 
    xpath = '//*[@id="pagingNormalView"]/ul/li[15]' 
    driver_1.find_element_by_xpath(xpath).click()

，但我得到了一個錯誤：

File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_response 
    raise exception_class(message, screen, stacktrace) 
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"//*[@id=\"pagingNormalView\"]/ul/li[15]"} 
Stacktrace:

來源

2016-10-13 Luis Ramon Ramirez Rodriguez

你不」 t需要selenium那，有一個XHR請求讓所有參展商，模擬它，從Scrapy Shell演示：

$ scrapy shell https://greenbuildexpo.com/Attendee/Expohall/Exhibitors 
In [1]: fetch("https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors") 
2016-10-13 12:45:46 [scrapy] DEBUG: Crawled (200) <GET https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors> (referer: None) 

In [2]: import json 

In [3]: data = json.loads(response.body) 

In [4]: len(data["Data"]) 
Out[4]: 541 

# printing booth number for demonstration purposes 
In [5]: for item in data["Data"]: 
    ...:  print(item["BoothNumber"]) 
    ...: 
2309 
2507 
... 
1243 
2203 
943

來源

2016-10-13 16:47:31 alecxe

使用scrapy與硒

回答

相關問題