從python網站拉鍊接

我想創建一個程序來從網頁中拉出所有鏈接並將它們放入列表中。從python網站拉鍊接

import urllib.request as ur 

#user defined functions 
def findLinks(website): 
    links = [] 
    line = website.readline() 
    while 'href=' not in line: 
     line = website.readline() 
     p 
    while '</a>' not in line : 
     links.append(line) 
     line = website.readline() 



#connect to a URL 
website = ur.urlopen("https://www.cs.ualberta.ca/") 
findLinks(website)

當我運行這個程序時，它延遲並返回一個TypeError：字符串不支持緩衝區干擾。

任何人有任何指針？

來源

2016-01-12 spaceinvaders101

哪個版本的python？ – Logan

有很多工具可以使這更容易，你假設在html中有換行符，或者鏈接沒有換行符。你應該谷歌，找到鏈接Python - 這應該帶你回到這裏一些有用的問答。 – PyNEwbie

可能重複的[如何從html代碼獲取href鏈接]（http://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-code） – PyNEwbie

Python不能在字符串中使用字節，使其「工作」我不得不將"href="更改爲b"href="和"</a>"到b"</a>"。
雖然沒有提取鏈接。使用re，我能做到這一點：

def findthem(website): 
    import re 

    links = [] 
    line = website.readline() 
    while len(line) != 0: 
     req = re.findall('href="(.*?)"', line.decode()) 
     for l in req: 
      links.append(l) 

     line = website.readline() 

    return links

來源

2016-01-12 16:52:11 Rolbrok

順便說一下 - http://stackoverflow.com/a/1732454/2308683 –

哦，不錯的文章，我正在尋找一種簡單的方法，但我不知道任何其他的解決方案，除了閱讀其他stackoverflow的帖子。謝謝。 – Rolbrok

是的，這是一個書籤。當你建議使用正則表達式來解析HTML時，這裏的人會非常沮喪。 –

一個更好的方式來獲得所有從URL的鏈接將解析使用庫像BeautifulSoup的HTML。

下面是一個示例，它抓取URL中的所有鏈接並將其打印出來。

import requests 
from bs4 import BeautifulSoup 

html = requests.get("https://www.cs.ualberta.ca/").text 
soup = BeautifulSoup(html, "html.parser") 

for a in soup.find_all("a"): 
    link = a.get("href") 
    if link: 
     print(link)

來源

2016-01-12 17:43:16 GKBRK

從python網站拉鍊接

回答

相關問題