Python BeautifulSoup提取特定的URL

只能獲取特定的URL嗎？Python BeautifulSoup提取特定的URL

像：

<a href="http://www.iwashere.com/washere.html">next</a> 
<span class="class">...</span> 
<a href="http://www.heelo.com/hello.html">next</a> 
<span class="class">...</span> 
<a href="http://www.iwashere.com/wasnot.html">next</a> 
<span class="class">...</span>

輸出應該只有網址，http://www.iwashere.com/

一樣，輸出網址：

http://www.iwashere.com/washere.html 
http://www.iwashere.com/wasnot.html

我通過串邏輯做到了。有沒有使用BeautifulSoup的直接方法？

來源

2013-03-09 Zero

可以匹配多個方面，包括使用正則表達式的屬性值：

import re 
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

相匹配（對於你的例子）：

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

所以任何<a>標籤與href屬性該值具有以字符串http://www.iwashere.com/開頭的值。

你也可以遍歷結果，並挑選出只是href屬性：

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')): 
...  print elem['href'] 
... 
http://www.iwashere.com/washere.html 
http://www.iwashere.com/wasnot.html

匹配所有相對路徑來代替，使用負先行斷言測試如果此值不不下手方案（例如http:或mailto:）或雙斜線（//hostname/path）;任何這樣的價值必須是不是相對路徑：

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

來源

2013-03-09 16:54:37

它完美地工作。對於不瞭解圖書館的人員。您需要 'from bs4 import BeautifulSoup import re' – Zero 2013-03-09 17:20:51

我還有一個問題。如果它們位於http：//www.iwashere.com/xyz ... abc.html格式，我們可以完美地提取鏈接。但是，如果鏈接是本地的。說，像'[next，next]'。我如何提取底層鏈接？當看到HTML代碼時，鏈接與適當的位置進行超鏈接。任何方式來提取這樣的鏈接？ – Zero 2013-03-09 20:39:47

@searcoding：您需要匹配任何不以計劃或雙斜槓開始的內容;任何不以*開頭的'href'值都是相對URL。使用'href = re.compile（r'^（？！（？：[a-zA-Z] [a-zA-Z0-9 + .-] *：| //））''''提前測試一個方案或雙斜線，任何有*不匹配的*）。 – 2013-03-09 23:05:45

如果您使用BeautifulSoup 4.0.0或更大：

soup.select('a[href^="http://www.iwashere.com/"]')

來源

2013-03-10 15:12:57 Droogans

Python BeautifulSoup提取特定的URL

回答

相關問題