2016-12-15 35 views
1

我想刮一個網站,加載其數據在AJAX。我想通過一系列我已經放在列表中的URL來做到這一點。我迭代使用for循環。這是我的代碼美麗的湯4好奇的錯誤循環迭代

import requests 
from bs4 import BeautifulSoup 
from selenium import webdriver 
import pandas as pd 
import pdb 

listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya','https://www.flipkart.com/samsung-galaxy-on8-gold-16-gb/p/itmemvarkqg5dyay'] 
PHANTOMJS_PATH = './phantomjs' 
browser = webdriver.PhantomJS(PHANTOMJS_PATH) 

for url in listUrls: 
    browser.get(url) 
    soup = BeautifulSoup(browser.page_source, "html.parser") 
    labels = soup.findAll('li', {'class':"_1KuY3T row"}) 
    print labels 

當我運行此代碼時,我得到第一個URL的結果,但第二個顯示了一個空白列表。我嘗試打印這兩個URL的湯,並工作。只有當我打印標籤時,該錯誤纔會持續存在。第一個URL的標籤被打印,但第二個列表是空的。

[<truncated>...Formats</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">MP3</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Capacity</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">3300 mAh</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Type</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Li-Ion</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Width</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">75 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Height</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">151.7 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>] 
[] 

Image:Result when I print labels in a loop

我使用交互式調試模塊PDB進一步調試這一點,一個奇怪的事情發生了 - 當我打印標籤前加一個堆棧跟蹤並執行步驟循環一步,它打印的標籤列表第二個網址也是如此。

for url in listUrls: 
    browser.get(url) 
    soup = BeautifulSoup(browser.page_source, "html.parser") 
    labels = soup.findAll('li', {'class':"_1KuY3T row"}) 
    pdb.set_trace() 
    print labels 

...

[<truncated>..."vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>] 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(12)<module>() 
-> for url in listUrls: 
(Pdb) n 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(13)<module>() 
-> browser.get(url) 
(Pdb) n 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(15)<module>() 
-> soup = BeautifulSoup(browser.page_source, "html.parser") #put all html in soup 
(Pdb) n 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(16)<module>() 
-> labels = soup.findAll('li', {'class':"_1KuY3T row"}) 
(Pdb) n 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(17)<module>() 
-> pdb.set_trace() 
(Pdb) 
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(18)<module>() 
-> print labels 
(Pdb) n 
[<li class="_1KuY3T row"><div class="vmXPri col col-3-12">Sales Package</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Handset, Adapter, Earphone, User Manual</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Model Number</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">J710FZDGINS</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Model Name</...<truncated>] 

Image: Result when I run the code with stack trace

我也單獨檢查每個URL的循環,它工作正常。我是編程新手,現在我處於虧損狀態,對於發生這種情況的原因真的很感激。謝謝!

+2

這將有助於如果你可以添加結果/堆棧作爲文本追蹤問題而不是圖像。 –

+0

@TeemuRisikko完成。 – dontpanic

回答

0

它在調試時工作的事實只是表明這是一個計時問題。當您逐步調試時,您基本上會讓頁面有更多時間加載,因此標籤可以正確打印。

你需要做的是讓事情更可靠,通過添加Explicit Wait預測 - 等待至少一個標籤是目前網頁:

from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

# ... 

for url in listUrls: 
    browser.get(url) 

    # wait for labels to be present/rendered 
    wait = WebDriverWait(browser, 20) 
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "li._1KuY3T.row"))) 

    soup = BeautifulSoup(browser.page_source, "html.parser") 
    labels = soup.select("li._1KuY3T.row") 
    print(labels) 
+0

感謝@alecxe!這工作。但是,對我來說這並不明顯,如果循環對第一個URL起作用,爲什麼我需要添加明確的等待它才能用於第二個URL? – dontpanic

+0

@dontpanic好吧,如果你運行代碼,比如說,上百次,我敢打賭,你也會看到它在第一次失敗。關鍵之處在於,等待使代碼可靠,您不會對在某個點呈現的元素進行假設,您只需明確地等待它。考慮接受解決該主題的答案,謝謝。 – alecxe