通過使用Beautifulsoup找到文本的完全匹配

我想通過使用beautifulsoup從html中提取文本的確切匹配值。但我用我的確切文本獲得幾乎幾乎匹配的文本。我的代碼是：通過使用Beautifulsoup找到文本的完全匹配

from bs4 import BeautifulSoup 
import urllib2enter code here 
url="http://www.somesite.com" 
page=urllib2.urlopen(url) 
soup=BeautifulSoup(page,"lxml") 
for elem in soup(text=re.compile("exact text")): 
    print elem

對上述代碼的輸出是這樣的：

1.exact text 
2.almost exact text

我怎樣才能使用beautifulsoup只得到精確匹配？注：變量（ELEM）應在<class 'bs4.element.Comment'>型

來源

2017-05-22 karthi

使用BeautifulSoup的find_all方法，其string論證這一點。

作爲一個例子，我在這裏解析了一個關於牙買加的地方的維基百科小頁面。我尋找所有文字爲'牙買加存根'的字符串，但我希望找到一個。當我找到它時，顯示文本及其父項。

>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece' 
>>> from bs4 import BeautifulSoup 
>>> import requests 
>>> page = requests.get(url).text 
>>> soup = BeautifulSoup(page, 'lxml') 
>>> for item in soup.find_all(string="Jamaica stubs"): 
...  item 
...  item.findParent() 
... 
'Jamaica stubs' 
<a href="/wiki/Category:Jamaica_stubs" title="Category:Jamaica stubs">Jamaica stubs</a>

退一步來說，閱讀評論之後，一個更好的方式是：

>>> url = 'https://en.wikipedia.org/wiki/Hockey' 
>>> from bs4 import BeautifulSoup 
>>> import requests 
>>> import re 
>>> page = requests.get(url).text 
>>> soup = BeautifulSoup(page, 'lxml') 
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))): 
...  i, item.findParent().text[:100] 
... 
(0, "Women's Bandy World Championships") 
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b") 
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)') 
(3, "women's")

我的正則表達式使用IGNORECASE這樣既「女性」和「女性」在維基百科中找到文章。我在for循環中使用了enumerate，這樣我可以對顯示的項目進行編號以便於閱讀。

來源

2017-05-22 13:46:41

感謝您的幫助.. 上述代碼不適合我。 'soup.find_all（string =「Jamaica stubs」）：'什麼都不返回。 – karthi

您最好提供一個您嘗試搜索的HTML示例或一些示例。 –

我想我已經在第二個版本中進行了改進。 –

您可以在soup搜索所需的元素，使用它的tag任何attribute值。

即：此代碼將搜索所有a元素，id等於some_id_value。

然後它將loop找到每個元素，測試它的值是否等於"exact text"。

如果是這樣，它會打印整個element。

for elem in soup.find_all('a', {'id':'some_id_value'}): 
    if elem.text == "exact text": 
     print(elem)

來源

2017-05-22 12:26:24

感謝您的回覆......我只是想搜索文本的發生而不使用任何標籤.. – karthi

通過使用Beautifulsoup找到文本的完全匹配

回答

相關問題