2017-05-03 57 views
0

我試圖從網站中提取鏈接。該網頁有多個頁面,所以我使用循環遍歷不同的頁面。然而,這個問題是湯和新鏈接中的內容重複。 requests.get中使用的URL發生更改,並且我已經仔細檢查了鏈接以確保URL的內容發生更改,並且確實如此。在Python中使用請求和美化你的頁面迭代

new_links仍然是一樣的,不管循環的迭代的

誰能請解釋我如何能夠解決這一問題?

def get_links(root_url): 

    list_of_links = [] 

    # how many pages should we scroll through ? currently set to 20 
    for i in range(1,3): 
     r = requests.get(root_url+"&page={}.".format(i)) 
     soup = BeautifulSoup(r.content, 'html.parser') 
     new_links = soup.find_all("li", {"class": "padding-all"}) 
     list_of_links.extend(new_links) 

    print(list_of_links) 

    return list_of_links 
+0

這將有助於瞭解網址 –

+0

root_url = http://borsen.dk/soegning.html?query=iot –

回答

0

您需要枚舉您正在查找的li中的鏈接。最好將每個添加到set()以刪除重複項。這可以被轉換成一個排序名單上的回報:

from bs4 import BeautifulSoup 
import requests 

def get_links(root_url): 
    set_of_links = set() 

    # how many pages should we scroll through ? currently set to 20 
    for i in range(1, 3): 
     r = requests.get(root_url+"&page={}".format(i)) 
     soup = BeautifulSoup(r.content, 'html.parser') 

     for li in soup.find_all("li", {"class": "padding-all"}): 
      for a in li.find_all('a', href=True): 
       set_of_links.update([a['href']]) 

    return sorted(set_of_links) 

for index, link in enumerate(get_links("http://borsen.dk/soegning.html?query=iot"), start=1): 
    print(index, link) 

給你:

1 http://borsen.dk/nyheder/avisen/artikel/11/102926/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
2 http://borsen.dk/nyheder/avisen/artikel/11/111767/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
3 http://borsen.dk/nyheder/avisen/artikel/11/111771/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
4 http://borsen.dk/nyheder/avisen/artikel/11/111776/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
5 http://borsen.dk/nyheder/avisen/artikel/11/111789/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
6 http://borsen.dk/nyheder/avisen/artikel/11/114652/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
7 http://borsen.dk/nyheder/avisen/artikel/11/114677/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
8 http://borsen.dk/nyheder/avisen/artikel/11/117729/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
9 http://borsen.dk/nyheder/avisen/artikel/11/122984/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
10 http://borsen.dk/nyheder/avisen/artikel/11/124160/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
11 http://borsen.dk/nyheder/avisen/artikel/11/130267/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
12 http://borsen.dk/nyheder/avisen/artikel/11/130268/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
13 http://borsen.dk/nyheder/avisen/artikel/11/130272/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
14 http://borsen.dk/nyheder/avisen/artikel/11/130882/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
15 http://borsen.dk/nyheder/avisen/artikel/11/132641/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
16 http://borsen.dk/nyheder/avisen/artikel/11/145430/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
17 http://borsen.dk/nyheder/avisen/artikel/11/149967/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
18 http://borsen.dk/nyheder/avisen/artikel/11/151618/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
19 http://borsen.dk/nyheder/avisen/artikel/11/158183/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
20 http://borsen.dk/nyheder/avisen/artikel/11/158769/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
21 http://borsen.dk/nyheder/avisen/artikel/11/44962/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
22 http://borsen.dk/nyheder/avisen/artikel/11/93884/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
23 http://borsen.dk/nyheder/avisen/artikel/11/93890/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
24 http://borsen.dk/nyheder/avisen/artikel/11/93896/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,, 
25 http://borsen.dk/nyheder/executive/artikel/11/161556/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
26 http://borsen.dk/nyheder/virksomheder/artikel/1/315489/rapport_digitale_tiltag_kan_transformere_danske_selskaber.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
27 http://borsen.dk/nyheder/virksomheder/artikel/1/337498/danske_virksomheder_overser_den_digitale_revolution.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
28 http://borsen.dk/opinion/blogs/view/17/3614/tingenes_internet__hvornaar_bliver_det_til_virkelighed.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
29 http://borsen.dk/opinion/blogs/view/17/4235/digitalisering_og_nye_forretningsmodeller.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
30 http://ledelse.borsen.dk/artikel/1/323424/burde_digitalisering_vaere_hoejere_paa_listen_over_foretrukne_ledelsesvaerktoejer.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 
31 http://pleasure.borsen.dk/gadget/artikel/1/305849/digital_butler_styrer_din_kommende_bolig.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,, 

它可能會更有意義也只是搜索在next page按鈕鏈接而不是猜測要迭代多少頁,例如:

from bs4 import BeautifulSoup 
import requests 

def get_links(root_url): 
    links = [] 

    while True: 
     print(root_url) 
     r = requests.get(root_url) 
     soup = BeautifulSoup(r.content, 'html.parser') 

     for li in soup.find_all("li", {"class": "padding-all"}): 
      for a in li.find_all('a', href=True)[:1]: 
       links.append(a['href']) 

     next_page = soup.find("div", {"class": "next-container"}) 

     if next_page: 
      next_url = next_page.find("a", href=True) 

      if next_url: 
       root_url = next_url['href'] 
      else: 
       break 
     else: 
      break 

    return links 
+0

嗨馬丁謝謝你的迴應,我應該得到31個鏈接,而不是20個。問題是要麼美麗的湯只是處理for循環中的第一個鏈接 –

+0

您在停止第二頁工作的URL中有一個額外的'.'。 –

+0

上帝,我覺得很愚蠢,非常感謝你指出, –