2012-10-25 77 views
0

這裏的源:如何查找帶有特定文本的HTML標籤? - BeautifulSoup

<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span> 

<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span> 

<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span> 

我想找到所有<span class="new">與它do something at,這裏是我的代碼,我只是不知道爲什麼它不工作:

soup = bs4.BeautifulSoup(html, "lxml") 
all_tags = soup.findAll(name = "span", attrs = {"class": "new"}, text = re.compile('do something.*')) 

沒有找到。如果我刪除text = re.compile('.*do something.*')以上所有標籤都可以找到,我知道我的正則表達式應該有什麼問題,那麼正確的表單是什麼?

回答

1

你總是可以嘗試一種混合的方法:

soup = bs4.BeautifulSoup(html, "lxml") 
spans = soup.findAll("span", attrs = {"class": "new"}) 
regex = re.compile('.*do something at.*') 
desired_tags = [span for span in spans if regex.match(span.text)] 
+0

謝謝,這是行不通的。我不明白的是,當你將上面所有內容合併到一行時,它怎麼會不起作用:'span = soup.findAll(「span」,attrs = {「class」:「new」},text = re.compile('。* *)')' – Shane

+1

我的猜測是'text ='只適用於有文本但沒有其他標籤的標籤。在您的HTML中,每個「」都有''標籤混合在文本中。如果跨度沒有子標籤,我認爲它可以正常工作。 –

0

遍歷html文件內容並打印匹配的行。在這裏,我替換的文件內容與列表L:

>>> l = ['<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever2.com" rel="nofollow">whatever2</a> do other things at <a class="others" href="http://example2.com" rel="nofollow">example2</a></span>', 

'<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span>' ] 
>>> for i in range(len(l)): 
    if re.search('<span class="new">.*do something.*', l[i]): 
     print l[i] 


<span class="new"> <a class="blog" href="http://whatever1.com" rel="nofollow">whatever1</a> do something at <a class="others" href="http://example1.com" rel="nofollow">example1</a></span> 
<span class="new"> <a class="blog" href="http://whatever3.com" rel="nofollow">whatever3</a> do something at <a class="others" href="http://example3.com" rel="nofollow">example3</a></span> 
>>> 
+0

這絕對有效,但問題是,我需要解析所選標籤,抓取網址和類似內容。 BeautifulSoup會爲此做得更好。 – Shane

0

這是我通常會找到的文本。

spans = soup.findAll("span", attrs = {"class": "new"}) 
for s in spans: 
    if "do something" in str(s):