2013-11-04 84 views
1

我的.xml文件的文件夾,看起來像這樣:的Python解析XML並保存爲txt

<PubmedArticleSet> 
    <PubmedArticle> 
    <MedlineCitation Owner="NLM" Status="MEDLINE"> 
     <PMID Version="1">23458631</PMID> 
     <DateCreated> 
     <Year>2013</Year> 
     <Month>04</Month> 
     <Day>08</Day> 
     </DateCreated> 
     <MeshHeadingList> 
     <MeshHeading> 
      <DescriptorName MajorTopicYN="N">Animals</DescriptorName> 
     </MeshHeading> 
     <MeshHeading> 
      <DescriptorName MajorTopicYN="N">Calcium</DescriptorName> 
      <QualifierName MajorTopicYN="Y">metabolism</QualifierName> 
     </MeshHeading> 
     <MeshHeading> 
      <DescriptorName MajorTopicYN="N">Calcium Chloride</DescriptorName> 
      <QualifierName MajorTopicYN="N">administration &amp; dosage</QualifierName> 
     </MeshHeading> 
     </MeshHeadingList> 
    </MedlineCitation> 
    </PubmedArticle> 
    <PubmedArticle> 
    <MedlineCitation Status="Publisher" Owner="NLM"> 
     <PMID Version="1">23458629</PMID> 
     <DateCreated> 
     <Year>2013</Year> 
     <Month>3</Month> 
     <Day>20</Day> 
     </DateCreated> 
     <MeshHeadingList> 
     <MeshHeading> 
      <DescriptorName MajorTopicYN="N">Adolescent</DescriptorName> 
     </MeshHeading> 
     <MeshHeading> 
      <DescriptorName MajorTopicYN="N">Adult</DescriptorName> 
     </MeshHeading> 
     <MeshHeading> 
      <DescriptorName MajorTopicYN="N">Anthropometry</DescriptorName> 
     </MeshHeading> 
     </MeshHeadingList> 
    </MedlineCitation> 
    </PubmedArticle> 
</PubmedArticleSet> 

我想使用Python來解析XML文件,並提取PMID,dateCreated會,所有DescriptorName和MajorTopicYN爲每篇文章。然後,該結果保存爲txt文件,看起來像:

ArticleID|CreatedDate|MeSH|IsMajor 
23458631|20130408|Animals|N 
23458631|20130408|Calcium|N 
23458631|20130408|Calcium Chloride|N 
23458629|20130320|Adolescent|N 
23458629|20130320|Adult|N 
23458629|20130320|Anthropometry|N 
+1

看一看HTTP: //stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python –

回答

1

你在這裏。

import xml.etree.ElementTree as ET 
tree = ET.parse('data.xml') 
root = tree.getroot() 
with open('my_text_file.txt', 'w') as f: 
    f.write('ArticleID|CreatedDate|MeSH|IsMajor\n') 
for pubmed_article in root.findall('PubmedArticle'): 
    ArticleID = pubmed_article.find('MedlineCitation').find('PMID').text 
    year = pubmed_article.find('MedlineCitation').find('DateCreated').find('Year').text 
    month = pubmed_article.find('MedlineCitation').find('DateCreated').find('Month').text 
    day = pubmed_article.find('MedlineCitation').find('DateCreated').find('Day').text 
    CreatedDate = year + month + day 
    for mesh_heading in pubmed_article.find('MedlineCitation').find('MeshHeadingList').findall('MeshHeading'): 
     MeSH = mesh_heading.find('DescriptorName').text 
     IsMajor = mesh_heading.find('DescriptorName').get('MajorTopicYN') 
     line_to_write = ArticleID + '|' + CreatedDate + '|' + MeSH + '|' + IsMajor + '\n' 
     with open('my_text_file.txt', 'a') as f: 
      f.write(line_to_write) 

這裏是輸出文件

ArticleID|CreatedDate|MeSH|IsMajor 
23458631|20130408|Animals|N 
23458631|20130408|Calcium|N 
23458631|20130408|Calcium Chloride|N 
23458629|20130320|Adolescent|N 
23458629|20130320|Adult|N 
23458629|20130320|Anthropometry|N 
+0

你是否改變了輸入文件?據我所見,這段代碼會導致一些日期顯示爲2013320,而不是20130320。 – ChrisProsser

0

這裏是我的版本:

import xml.etree.ElementTree as ET 

xml_path = r'Y:\Misc\stack_overflow\Python\xml_extract\data.xml' 
output_file_path = 'output.txt' 
f = open(output_file_path, 'wb') 
f.write('ArticleID|CreatedDate|MeSH|IsMajor\n') 

tree = ET.parse(xml_path) 
root = tree.getroot() 

for pa in root.iter('PubmedArticle'): 
    ArticleID = pa.find('MedlineCitation/PMID').text 
    CreatedDate = pa.find('MedlineCitation/DateCreated/Year').text+\ 
        pa.find('MedlineCitation/DateCreated/Month').text.zfill(2)+\ 
        pa.find('MedlineCitation/DateCreated/Day').text.zfill(2) 
    for mh in pa.iter('MeshHeading'): 
     DescriptorName = mh.find('DescriptorName').text 
     MajorTopicYN = mh.find('DescriptorName').attrib['MajorTopicYN'] 
     f.write(ArticleID+'|'+CreatedDate+'|'+DescriptorName+'|'+MajorTopicYN+'\n') 
f.close() 

文件中的輸出是:

ArticleID|CreatedDate|MeSH|IsMajor 
23458631|20130408|Animals|N 
23458631|20130408|Calcium|N 
23458631|20130408|Calcium Chloride|N 
23458629|20130320|Adolescent|N 
23458629|20130320|Adult|N 
23458629|20130320|Anthropometry|N