從Beautifulsoup提取標籤「提取」中的內容

我有一個xml語料庫，其中一個標籤名爲<EXTRACT>。但是該術語是Beautifulsoup中的關鍵字。我如何提取這個標籤的內容。當我寫entry.extract.text它返回錯誤，當我使用entry.extract時，整個內容被提取。從Beautifulsoup提取標籤「提取」中的內容

從我所瞭解的Beautifulsoup，它執行標籤的案例摺疊。如果有一些方法可以解決這個問題，那也可能對我有所幫助。

注：我暫時用下面的方法解決了問題。

extra = entry.find('extract') 
absts.write(str(extra.text))

但我想知道是否有什麼辦法，因爲我們與其他標籤使用像entry.tagName

來源

2014-03-01 Amrith Krishna

根據BS源代碼tag.tagname使用它實際上是引擎蓋下稱tag.find("tagname")。這裏有一個Tag類的__getattr__()方法的樣子：

def __getattr__(self, tag): 
    if len(tag) > 3 and tag.endswith('Tag'): 
     # BS3: soup.aTag -> "soup.find("a") 
     tag_name = tag[:-3] 
     warnings.warn(
      '.%sTag is deprecated, use .find("%s") instead.' % (
       tag_name, tag_name)) 
     return self.find(tag_name) 
    # We special case contents to avoid recursion. 
    elif not tag.startswith("__") and not tag=="contents": 
     return self.find(tag) 
    raise AttributeError(
     "'%s' object has no attribute '%s'" % (self.__class__, tag))

看到，它是完全基於find()，所以這是非常好的，你的情況使用tag.find("extract")：

from bs4 import BeautifulSoup 


data = """<test><EXTRACT>extract text</EXTRACT></test>""" 
soup = BeautifulSoup(data, 'html.parser') 
test = soup.find('test') 
print test.find("extract").text # prints 'extract text'

此外，您還可以使用test.extractTag.text，但它已被棄用，我不會推薦它。

希望有所幫助。

來源

2014-03-01 05:46:55 alecxe

從Beautifulsoup提取標籤「提取」中的內容

回答

相關問題