我想刮一個網站,加載其數據在AJAX。我想通過一系列我已經放在列表中的URL來做到這一點。我迭代使用for循環。這是我的代碼美麗的湯4好奇的錯誤循環迭代
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import pdb
listUrls = ['https://www.flipkart.com/samsung-galaxy-nxt-gold-32-gb/p/itmemzd4gepexjya','https://www.flipkart.com/samsung-galaxy-on8-gold-16-gb/p/itmemvarkqg5dyay']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
labels = soup.findAll('li', {'class':"_1KuY3T row"})
print labels
當我運行此代碼時,我得到第一個URL的結果,但第二個顯示了一個空白列表。我嘗試打印這兩個URL的湯,並工作。只有當我打印標籤時,該錯誤纔會持續存在。第一個URL的標籤被打印,但第二個列表是空的。
[<truncated>...Formats</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">MP3</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Capacity</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">3300 mAh</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Battery Type</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Li-Ion</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Width</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">75 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Height</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">151.7 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>]
[]
Image:Result when I print labels in a loop
我使用交互式調試模塊PDB進一步調試這一點,一個奇怪的事情發生了 - 當我打印標籤前加一個堆棧跟蹤並執行步驟循環一步,它打印的標籤列表第二個網址也是如此。
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
labels = soup.findAll('li', {'class':"_1KuY3T row"})
pdb.set_trace()
print labels
...
[<truncated>..."vmXPri col col-3-12">Depth</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">8 mm</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Warranty Summary</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">1 Year Manufacturer Warranty</li></ul></li>]
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(12)<module>()
-> for url in listUrls:
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(13)<module>()
-> browser.get(url)
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(15)<module>()
-> soup = BeautifulSoup(browser.page_source, "html.parser") #put all html in soup
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(16)<module>()
-> labels = soup.findAll('li', {'class':"_1KuY3T row"})
(Pdb) n
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(17)<module>()
-> pdb.set_trace()
(Pdb)
> /Users/aamnasimpl/Desktop/Scraper/web-scraper.py(18)<module>()
-> print labels
(Pdb) n
[<li class="_1KuY3T row"><div class="vmXPri col col-3-12">Sales Package</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">Handset, Adapter, Earphone, User Manual</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Model Number</div><ul class="_3dG3ix col col-9-12"><li class="sNqDog">J710FZDGINS</li></ul></li>, <li class="_1KuY3T row"><div class="vmXPri col col-3-12">Model Name</...<truncated>]
Image: Result when I run the code with stack trace
我也單獨檢查每個URL的循環,它工作正常。我是編程新手,現在我處於虧損狀態,對於發生這種情況的原因真的很感激。謝謝!
這將有助於如果你可以添加結果/堆棧作爲文本追蹤問題而不是圖像。 –
@TeemuRisikko完成。 – dontpanic