從大型格式不正確的XML文件的特定元素中提取文本

我有一個大的（〜50Mb）文件，其中包含描述位於<item> </item>標籤之間的文檔和屬性的格式不良的XML文件，我想從中提取文本。從大型格式不正確的XML文件的特定元素中提取文本

Python的標準XML解析實用程序（dom，sax，expat）對錯誤的格式造成了阻塞，而更多的原諒庫（sgmllib，BeautifulSoup）會解析整個文件並花費太長時間。

<item> <title>some title</title> <author>john doe</author> <lang>en</lang> <document> .... </document> </item>

有誰知道一種方法來提取文本<document> </document>只有的lang=en之間不必解析整個文檔？

其他信息：爲什麼它的「格式不對」

有些文件有一個屬性<dc:link></dc:link>導致與解析器問題。 Python的xml.minidom抱怨：

ExpatError: unbound prefix: line 13, column 0

來源

2009-11-10 trope

「格式不正確的XML」是什麼意思？這是一個無效的XML嗎？如果您的XML文件無效，所有解析器都會窒息，您需要手動解析。 – 2009-11-10 20:20:56

什麼樣的過程排放不完整的XML？ – 2009-11-10 20:28:49

gawk 'BEGIN{ 
RS="</item>" 
startpat="<document>" 
endpat="</document>" 
lpat=length(startpat) 
epat=length(endpat) 
} 
/<lang>en<\/lang>/{ 
    match($0,"<document>") 
    start=RSTART 
    match($0,"</document>") 
    end=RSTART 
    print substr($0,start+lpat,end-(start+lpat)) 
}' file

輸出

$ more file 
Junk 
Junk 
<item> 
    <title>some title</title> 
    <author>john doe</author> 
    <lang>en</lang> 
    <document> text 
     i want blah ............ </document> 
</item> 
junk 
junk 
<item> 
    <title>some title</title> 
    <author>jane doe</author> 
    <lang>ch</lang> 
    <document> junk text 
      ..  ............ </document> 
</item> 
junk 
blahblah.. 
<item> 
    <title>some title</title> 
    <author>GI joe</author> 
    <lang>en</lang> 
    <document> text i want ..... in one line </document> 
</item> 
aksfh 
aslkfj 
dflkas 

$ ./shell.sh 
text 
     i want blah ............ 
    text i want ..... in one line

來源

2009-11-11 01:24:10 ghostdog74

非常感謝 - 這正是我一直在尋找的。一些不關心XML約定的東西。 – trope 2009-11-11 02:28:29