剝離（XML？）從使用python

我有文件，該文件包含以下格式 <scientist_names> <scientist>abc</scientist> </scientist_names> 我想使用Python來去掉上面的格式科學家，我應該如何去做的名科學家的名字文檔標記？我想用普通epressions但不知道如何使用它......請幫助剝離（XML？）從使用python

來源

2012-02-13 username_4567

這看起來像XML。查看[xml.dom.minidom]（http://docs.python.org/library/xml.dom.minidom.html）。 – 2012-02-13 11:55:00

如果我有這樣的連續行' abc xzz'那麼任何人都可以告訴我最快的方式來提取數據 – 2012-02-13 18:47:09

如上所述，這似乎是xml。在這種情況下，您應該使用xml解析器來解析此文檔;我建議lxml（http://lxml.de）。

考慮您的要求，您可能發現它更方便地使用SAX風格的解析，而不是DOM的風格，因爲SAX解析只涉及註冊處理程序時解析器遇到一個特定的標籤，只要意思一個標籤不依賴於上下文，並且你有多種類型的標籤可供處理（這裏可能不是這種情況）。

在情況可能會錯誤地形成的輸入文檔，你不妨用美麗的湯：http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML

來源

2012-02-13 11:58:12 Marcin

我不擅長XML我怎麼才能簡單地使用字符串函數？ – 2012-02-13 16:02:28

@ user997704：你沒有。學會使用正確的工具來完成這項工作。 – Marcin 2012-02-13 16:05:45

我想使用，但我沒有得到快速入門指南學習SAX – 2012-02-13 16:10:27

這是XML，你應該使用XML解析器像lxml，而不是正則表達式（因爲XML是不是一個正規語言）。

下面是一個例子：

from lxml import etree 
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>""" 

tree = etree.fromstring(text) 
for scientist in tree.xpath("//scientist"): 
    print scientist.text

來源

2012-02-13 11:54:10

不要使用正則表達式！（所有原因都很好地解釋了[here]）

使用xml/html解析器，看看BeautifulSoup。

來源

2012-02-13 11:55:42

你可能想看看BS的來源。你會感到驚訝。 – georg 2012-02-13 13:27:50

@ thg435：你在比較蘋果和橘子。沒有人應該使用正則表達式編寫他們自己的定製解析，因爲這種方法很脆弱。美麗的湯使用正則表達式來處理格式不正確的標記，作爲編寫一個經過良好測試，設計良好的庫的努力的一部分來做到這一點。 – Marcin 2012-02-13 16:07:21

下面是一個簡單的例子，應該處理的XML標籤爲您

#import library to do http requests: 
import urllib2 

#import easy to use xml parser called minidom: 
from xml.dom.minidom import parseString 
#all these imports are standard on most modern python implementations 

#download the file if it's not on the same machine otherwise just use a path: 
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml') 
#convert to string: 
data = file.read() 
#close file because we dont need it anymore: 
file.close() 
#parse the xml you downloaded 
dom = parseString(data) 
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName, 
#in your case <scientist>: 
xmlTag = dom.getElementsByTagName('scientist')[0].toxml() 
#strip off the tag (<tag>data</tag> ---> data): 
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','') 
#print out the xml tag and data in this format: <tag>data</tag> 
print xmlTag 
#just print the data 
print xmlData

如果你發現任何不清楚的只是讓我知道

來源

2012-02-13 12:07:14

錯誤，同時執行'data = file.read（）'str對象沒有atrribute'讀' – 2012-02-13 12:43:27

剝離（XML？）從使用python

回答

相關問題