0
大家早上好, 我試圖提取使用此代碼SGML文件,但我得到空文件,結果,這是我的Python代碼:python text.strip()返回空
from os import listdir
from os import makedirs
from os.path import isfile, join
from re import sub
import ast
import numpy
import xml.etree.ElementTree as ElementTree
from lxml import etree
parser = etree.XMLParser(recover=True) # escaping malformed strings
pathCol="C:/Users/Desktop/FR"
pathExtr="C:/Users/Desktop/FRExt"
i=0
for f in listdir(pathCol):
with open(join(pathCol,f), 'r') as f: # Reading file
xml = f.read()
xml = '<ROOT>' + xml + '</ROOT>' # Let's add a root tag
root = etree.fromstring(xml, parser=parser)
for doc in root:
try :
docNo=doc.find('DOCNO').text.strip()
except :
i+=1
docNo="LATIMES"+str(i)
try :
text=doc.find('TEXT').text.strip()
except :
try :
text=doc.find('HEADLINE').text.strip()
except :
try :
text=doc.find('GRAPHIC').text.strip()
except :
text=" "
fi=open(join(pathExtr,docNo),'w')
fi.write(text)
fi.close()
print("%s OK" %(docNo))
f.close()
這是一個樣本文檔的結構:
<DOC>
<DOCNO> LA010189-0001 </DOCNO>
<DOCID> 1 </DOCID>
<DATE>
<P>
January 1, 1989, Sunday, Home Edition
</P>
</DATE>
<SECTION>
<P>
Book Review; Page 1; Book Review Desk
</P>
</SECTION>
<LENGTH>
<P>
1206 words
</P>
</LENGTH>
<HEADLINE>
<P>
NEW FALLOUT FROM CHERNOBYL;
</P>
<P>
THE SOCIAL IMPACT OF THE ...
</P>
</HEADLINE>
<BYLINE>
<P>
By James E. ...
</P>
</BYLINE>
<TEXT>
<P>
The onset of the new Gorbachev policy of glasnost,...
</P>
...
</TEXT>
</DOC>
<DOC>
... etc
</DOC>
我想要得到<DOC>
和</DOC>
之間的每個文檔<TEXT>
標籤之間的內容,而不是我得空文件:( 請,有沒有人能幫助我嗎? 非常感謝。
感謝您回覆,我會嘗試這樣的。 –