得到最後一頁的號碼 - 網絡抓取

我想刮多個網頁的網站。我想建立一個函數，返回一組頁面中的頁面數量。得到最後一頁的號碼 - 網絡抓取

以下是一個示例起始頁面。

有跡象表明，導致頁面內29個頁面，最好的功能會因此返回29

通過子頁面我的意思是，第1頁29，29等等等等

2這是HTML其中包含上面顯示的鏈接中的最後一頁信息。

<div id="paging-wrapper-btm" class="paging-wrapper"> 
     <ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>2</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=2&pgesize=36&sort=-1'>3</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=3&pgesize=36&sort=-1'>4</a></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=4&pgesize=36&sort=-1'>5</a></li><li #LIVALUES#>...</li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=28&pgesize=36&sort=-1'>29</a></li><li class="page-skip"><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</a></li></ol>

我有下面的代碼，會發現所有的OL標籤，但無法弄清楚如何訪問每個「A」內包含的內容。

a = soup.find_all('ol') 
b = [x['a'] for x in a] <-- this part returns an error. 
< further processing >

任何幫助/建議非常感謝。

來源

2016-04-14 MarcelKlockman

的子頁面，你的意思是在同一個域中頁面的鏈接？ – trans1st0r

按子頁我的意思，第29頁第29頁，3/29,4/29等等。 – MarcelKlockman

得到裏面的東西，你可以到a.text我想 – Whitefret

試試這個：

ols = soup.find_all('ol') 
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list 
all_as = [] 
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list 
all_as.extend(a) 
print all_as

來源

2016-04-14 14:22:02 trans1st0r

啊..我發現了一個簡單的解決方案。

for item in soup.select("ol a"): 
    x = item.text 
    print x

我然後可以排序並選擇最大的數字。

來源

2016-04-14 14:22:56 MarcelKlockman

嘗試a.text，如果你只是想要的數字，我認爲它的作品（不能用我的配置測試抱歉） – Whitefret

是的你的權利，這是更好的。 – MarcelKlockman

下將提取的最後一頁號：

from bs4 import BeautifulSoup 
import requests 


html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1") 
soup = BeautifulSoup(html.text) 

ol = soup.find('ol', class_='page-nos') 
pages = [li.text for li in ol.find_all('li')] 
last_page = pages[-2] 

print last_page

這對於你的網站會顯示：

來源

2016-04-14 14:36:43

得到最後一頁的號碼 - 網絡抓取

回答

相關問題