如何查找帶有特定文本的HTML標籤？ - BeautifulSoup

這裏的源：如何查找帶有特定文本的HTML標籤？ - BeautifulSoup

<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span> 

<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span> 

<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>

我想找到所有<span class="new">與它do something at，這裏是我的代碼，我只是不知道爲什麼它不工作：

soup = bs4.BeautifulSoup(html, "lxml") 
all_tags = soup.findAll(name = "span", attrs = {"class": "new"}, text = re.compile('do something.*'))

沒有找到。如果我刪除text = re.compile('.*do something.*')以上所有標籤都可以找到，我知道我的正則表達式應該有什麼問題，那麼正確的表單是什麼？

來源

2012-10-25 Shane

你總是可以嘗試一種混合的方法：

soup = bs4.BeautifulSoup(html, "lxml") 
spans = soup.findAll("span", attrs = {"class": "new"}) 
regex = re.compile('.*do something at.*') 
desired_tags = [span for span in spans if regex.match(span.text)]

來源

2012-10-25 01:51:24

謝謝，這是行不通的。我不明白的是，當你將上面所有內容合併到一行時，它怎麼會不起作用：'span = soup.findAll（「span」，attrs = {「class」：「new」}，text = re.compile（'。* *）'）' – Shane

我的猜測是'text ='只適用於有文本但沒有其他標籤的標籤。在您的HTML中，每個「」都有''標籤混合在文本中。如果跨度沒有子標籤，我認爲它可以正常工作。 –

遍歷html文件內容並打印匹配的行。在這裏，我替換的文件內容與列表L：

>>> l = ['<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>' ] 
>>> for i in range(len(l)): 
    if re.search('<span class="new">.*do something.*', l[i]): 
     print l[i] 


<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span> 
<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span> 
>>>

來源

2012-10-25 01:50:17 tetris555

這絕對有效，但問題是，我需要解析所選標籤，抓取網址和類似內容。 BeautifulSoup會爲此做得更好。 – Shane

這是我通常會找到的文本。

spans = soup.findAll("span", attrs = {"class": "new"}) 
for s in spans: 
    if "do something" in str(s):

來源

2012-10-26 04:34:31 Seeya

如何查找帶有特定文本的HTML標籤？ - BeautifulSoup

回答

相關問題