從html頁面獲取相對鏈接

我想從html頁面中只提取相關的url;有人有這個提示：從html頁面獲取相對鏈接

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)', re.IGNORECASE)

但它返回：從頁面

1 /所有的絕對和相對URL。

2 /該網址可以通過""或''隨機報出。

來源

2014-06-29 esnadr

你可以嘗試這樣的東西：http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags？ –

使用the tool for the job：HTML parser，如BeautifulSoup。

可以pass a function作爲一個屬性值find_all()，並檢查是否href開始與http：

from bs4 import BeautifulSoup 

data = """ 
<div> 
<a href="http://google.com">test1</a> 
<a href="test2">test2</a> 
<a href="http://amazon.com">test3</a> 
<a href="here/we/go">test4</a> 
</div> 
""" 
soup = BeautifulSoup(data) 
print soup.find_all('a', href=lambda x: not x.startswith('http'))

或者，使用urlparse和checking for network location part：

def is_relative(url): 
    return not bool(urlparse.urlparse(url).netloc) 

print soup.find_all('a', href=is_relative)

這兩種解決方案打印：

[<a href="test2">test2</a>, 
<a href="here/we/go">test4</a>]

來源

2014-06-29 03:43:02 alecxe

從html頁面獲取相對鏈接

回答

相關問題