得到所有鏈接網站在源代碼html（python）

-2

我想獲得所有鏈接在一個網頁，這個功能只有一個鏈接，但需要得到所有的鏈接！我當然知道所需要的一環真實的，但我不知道使用得到所有鏈接網站在源代碼html（python）

我需要得到所有鏈接

def get_next_target(page): 
start_link = page.find('<a href=') 
start_quote = page.find('"', start_link) 
end_quote = page.find('"', start_quote + 1) 
url = page[start_quote + 1:end_quote] 
return url, end_quote

來源

2013-10-16 aliweb

請明確說明您的需求。 – ajkumar25

你是什麼意思的「一個環」？ – hexafraction

如果需要，請使用'html.parse'軟件包中的'HTMLParser'和'urllib.parser.urljoin'。不要試圖只搜索一個子字符串或甚至一個正則表達式，這是行不通的（好吧，不是所有情況下）。當然，如果你只有一個頁面需要處理，你可以編寫一個快速而髒的程序，但是如果你有很多來自不同源的頁面，這並不明智。 – 2013-10-16 10:27:27

您可以使用lxml爲：

import lxml.html 

def get_all_links(page): 
    document = lxml.html.parse(page) 
    return document.xpath("//a")

來源

2013-10-16 10:25:40

這是一個HTML解析器派上用場。我建議BeautifulSoup：

from bs4 import BeautifulSoup as BS 
def get_next_target(page) 
    soup = BS(page) 
    return soup.find_all('a', href=True)

來源

2013-10-16 10:25:43 TerryA

site = urllib.urlopen('http://somehwere/over/the/rainbow.html') 
site_data = site.read() 
for link in BeautifulSoup(site_data, parseOnlyThese=SoupStrainer('a')): 
    if link.has_attr('href'): 
     print(link['href'])

來源

2013-10-16 10:27:57 Torxed

用「BS」做的另一種方法。 – Torxed

得到所有鏈接網站在源代碼html（python）

回答

相關問題