beautifulsoup的findall

我有一些XML：beautifulsoup的findall

<article> 
<uselesstag></uslesstag> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<article> 
<uselesstag></uslesstag> 
<topic>food</topic> 
<body>body text</body> 
</article> 

<article> 
<uselesstag></uslesstag> 
<topic>cars</topic> 
<body>body text</body> 
</article>

有許多，許多無用的標籤。我想使用beautifulsoup來收集body標籤中的所有文本及其相關的主題文本以創建一些新的xml。

我是新來的蟒蛇，但我懷疑某種形式的

import arff 
from xml.etree import ElementTree 
import re 
from StringIO import StringIO 

import BeautifulSoup 
from BeautifulSoup import BeautifulSoup 

totstring="" 

with open('reut2-000.sgm', 'r') as inF: 
    for line in inF: 
     string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) 
    totstring+=string 


soup = BeautifulSoup(totstring) 

body = soup.find("body") 



for anchor in soup.findAll('body'): 
    #Stick body and its topics in an associated array? 




file.close

會工作。

1）我該怎麼做？ 2）我應該添加一個根節點到XML？否則它是不正確的XML？

非常感謝

編輯：

我想落得是：

<article> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>food</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>cars</topic> 
<body>body text</body> 
</article>

有許多，許多無用的標籤。

來源

2012-05-09 RNs_Ghost

所以，你要得到一個標籤，B，C的含量或得到的所有標籤內容，忽略標籤d，E，F？ –

是的，我想要2種標籤（正文和主題），並忽略其他東西（日期，時間等） –

好的。這裏是解決方案，

第一，確保ü有「beautifulsoup4」安裝：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

這裏是我的代碼來獲取所有的身體和主題標籤：

from bs4 import BeautifulSoup 
html_doc= """ 
<article> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>food</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>cars</topic> 
<body>body text</body> 
</article> 
""" 
soup = BeautifulSoup(html_doc) 

bodies = [a.get_text() for a in soup.find_all('body')] 
topics = [a.get_text() for a in soup.find_all('topic')]

來源

2012-05-09 15:49:14

嘿，謝謝你的幫助，@阿瑟內韋斯，但我得到回溯（最近呼叫最後）：文件「 convert.py」，第23行，在體= [a.get_text（）爲一個在soup.find_all（ '主體'）] 類型錯誤： 'NoneType' 對象不是可調用不要我需要定義一個？ –

對我來說它工作得很好。試試這個：curl https://raw.github.com/gist/2646540/129f95c11cffa159daeec184ba47a57217379060/convert.py> convert.py; python convert.py –

from bs4爲我做了（d'oh） –

另一種方式來刪除空xml或html標記是使用遞歸函數來搜索空標記並使用.extract（）將其刪除。這樣，您不必手動列出要保留的標籤。它還可以清除嵌套的空標籤。

from bs4 import BeautifulSoup 
import re 
nonwhite=re.compile(r'\S+',re.U) 

html_doc1=""" 
<article> 
<uselesstag2> 
<uselesstag1> 
</uselesstag1> 
</uselesstag2> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<p>21.09.2009</p> 
<p> </p> 
<p1><img src="http://www.www.com/"></p1> 
<p></p> 

<!--- This article is about cars---> 
<article> 
<topic>cars</topic> 
<body>body text</body> 
</article> 
""" 

def nothing_inside(thing): 
    # select only tags to examine, leave comments/strings 
    try: 
     # check for img empty tags 
     if thing.name=='img' and thing['src']<>'': 
      return False 
     else: 
      pass 
     # check if any non-whitespace contents 
     for item in thing.contents: 
      if nonwhite.match(item): 
       return False 
      else: 
       pass 
     return True 
    except: 
     return False 

def scrub(thing): 
    # loop function as long as an empty tag exists 
    while thing.find_all(nothing_inside,recursive=True) <> []: 
     for emptytag in thing.find_all(nothing_inside,recursive=True): 
      emptytag.extract() 
      scrub(thing) 
    return thing 

soup=BeautifulSoup(html_doc1) 
print scrub(soup)

結果：

<article> 

<topic>oil, gas</topic> 
<body>body text</body> 
</article> 
<p>21.09.2009</p> 

<p1><img src="http://www.www.com/"/></p1> 

<!--- This article is about cars---> 
<article> 
<topic>cars</topic> 
<body>body text</body> 
</article>

來源

2012-08-16 16:51:45 Kao

beautifulsoup的findall

回答

相關問題