在python中獲取下一頁網址

現在我試圖從網頁上刮掉所有的url。它共有5個類別，每個類別都有不同的頁面（每頁有10篇文章）。在python中獲取下一頁網址

例如：

Categories Pages 
Banana   5 
Apple   14 
Cherry   7 
Melon   6 
Berry   2

代碼：

import requests 
from bs4 import BeautifulSoup 
import re 
from urllib.parse import urljoin 


res = requests.get('http://www.abcde.com/SearchParts') 
soup = BeautifulSoup(res.text,"lxml") 
href = [ a["href"] for a in soup.findAll("a", {"id" : re.compile("parts_img.*")})] 
b1 =[] 
for url in href: 
    b1.append("http://www.abcde.com"+url) 
print (b1)

從主頁「http://www.abcde.com/SearchParts」我可以湊每個類別的第一頁的URL。 b1是首頁的網址列表。

像這樣：

Categories Pages      url 
Banana   1  http://www.abcde.com/A 
Apple   1  http://www.abcde.com/B 
Cherry   1  http://www.abcde.com/C 
Melon   1  http://www.abcde.com/E 
Berry   1  http://www.abcde.com/F

然後我用B1的源代碼來湊下一個頁面的網址。所以b2是第二頁網址的列表。

代碼：

b2=[] 
for url in b1: 
    res2 = requests.get(url).text 
    soup2 = BeautifulSoup(res2,"lxml") 
    url_n=soup2.find('',rel = 'next')['href'] 
    b2.append("http://www.abcde.com"+url_n) 
print(b2)

像這樣：

Categories Pages      url 
    Banana   1  http://www.abcde.com/A/s=1&page=2 
    Apple   1  http://www.abcde.com/B/s=9&page=2 
    Cherry   1  http://www.abcde.com/C/s=11&page=2 
    Melon   1  http://www.abcde.com/E/s=7&page=2 
    Berry   1  http://www.abcde.com/F/s=5&page=2

現在，當我嘗試做第三個，這是一個錯誤，因爲貝瑞的第二頁是最後一頁，它有沒有「下「在源代碼中。特別是當每個類別都有不同的頁面/網址時，我該怎麼做？

整個代碼（直到它遇到錯誤）：自那以後

import requests 
from bs4 import BeautifulSoup 
import re 
from urllib.parse import urljoin 


res = requests.get('http://www.ca2-health.com/frontend/SearchParts') 
soup = BeautifulSoup(res.text,"lxml") 
href = [ a["href"] for a in soup.findAll("a", {"id" : re.compile("parts_img.*")})] 
b1 =[] 
for url in href: 
    b1.append("http://www.ca2-health.com"+url) 
print (b1) 
print("===================================================") 
b2=[] 
for url in b1: 
    res2 = requests.get(url).text 
    soup2 = BeautifulSoup(res2,"lxml") 
    url_n=soup2.find('',rel = 'next')['href'] 
    b2.append("http://www.ca2-health.com"+url_n) 
print(b2) 
print("===================================================") 
b3=[] 
for url in b2: 
    res3 = requests.get(url).text 
    soup3 = BeautifulSoup(res3,"lxml") 
    url_n=soup3.find('',rel = 'next')['href'] 
    b3.append("http://www.ca2-health.com"+url_n) 
print(b3)

並在此之後，我會讓B1，B2，B3和......作爲一個名單，我將擁有所有從該頁面的網址。提前致謝。

來源

2017-10-16 Makiyo

如果您發佈錯誤追溯，這將有所幫助。但我猜你得到KeyError。處理異常並繼續循環。如果您收到KeyError做到以下幾點：

try: 
    url_n=soup3.find('',rel = 'next')['href'] 
except KeyError: 
    continue

try: 
    url_n=soup3.find('',rel = 'next').get('href') 
except AttributeError: 
    continue

讓我知道，如果這有助於。

來源

2018-01-03 09:06:09

在python中獲取下一頁網址

回答

相關問題