2017-08-08 123 views
-4

我有幾個xml(s)如下。我想使用Python中的Beautiful Soup按照以下預期的輸出從xml中提取內容(作爲數據框)。請幫助我。用美麗的湯提取XML標籤的內容

示例XML:

<Author AffiliationIDS="Aff1 Aff2" CorrespondingAffiliationID="Aff1" ORCID="http://orcid.org/0000-0003-4649-327X"> 
    <AuthorName DisplayOrder="Western"> 
     <GivenName>Anouk</GivenName> 
     <GivenName>van der</GivenName> 
     <FamilyName>Hoorn</FamilyName> 
    </AuthorName> 
    <Contact> 
     <Phone>+31-50-3612400</Phone> 
     <Fax>+31-50-3611707</Fax> 
     <Email>[email protected]</Email> 
    </Contact> 
</Author> 
<Author AffiliationIDS="Aff1"> 
<AuthorName DisplayOrder="Western"> 
    <GivenName>Kamal</GivenName> 
    <GivenName>M.</GivenName> 
    <FamilyName>Aden</FamilyName> 
</AuthorName> 
</Author> 
<Author AffiliationIDS="Aff1 Aff2"> 
<AuthorName DisplayOrder="Western"> 
    <GivenName>Peter</GivenName> 
    <GivenName>Jan</GivenName> 
    <FamilyName>van Laar</FamilyName> 
</AuthorName> 
</Author> 

預期輸出:

Anouk van der Hoorn   AuthorName 
Kamal M. Aden    AuthorName 
Peter Jan var Laar   AuthorName 
+2

請通過發佈您嘗試過的代碼並確定您擁有的問題來幫助我們。 – mhawke

回答

1

這裏是代碼,只需幾行:

from bs4 import BeautifulSoup as b 
with open("sample.xml", "r") as f: # opening xml file 
    content = f.read() 
soup = b(content, "lxml") 
authornames = ([values.find("authorname").text.replace("\n", ' ') for values in soup.findAll("author")]) 
print authornames 

輸出:

[u' Anouk van der Hoorn ', u' Kamal M. Aden ', u' Peter Jan van Laar ']