通過條目遞歸搜索xml

我有一個來自Google的XML源，大約有300個條目。它看起來像這樣：通過條目遞歸搜索xml

<?xml version="1.0"?> 
-<ns0:feed ns1:etag="W/"LIESANDCRAPfyt7I2A9WhHERE."" xmlns:ns4="http://www.w3.org/2007/app" xmlns:ns3="http://schemas.google.com/contact/2008" xmlns:ns2="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:ns1="http://schemas.google.com/g/2005" xmlns:ns0="http://www.w3.org/2005/Atom"> 
    <ns0:updated>2012-01-25T14:52:12.867Z</ns0:updated> 
    <ns0:category term="http://schemas.google.com/contact/2008#profile" scheme="http://schemas.google.com/g/2005#kind"/> 
    <ns0:id>domain.com</ns0:id> 
    <ns0:generator version="1.0" uri="http://www.google.com/m8/feeds">Contacts</ns0:generator> 
    <ns0:author> 
     <ns0:name>domain.com</ns0:name> 
    </ns0:author> 
    <ns0:link type="text/html" rel="alternate" href="http://www.google.com/"/> 
    <ns0:link type="application/atom+xml" rel="http://schemas.google.com/g/2005#feed" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full"/> 
    <ns0:link type="application/atom+xml" rel="http://schemas.google.com/g/2005#batch" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/batch"/> 
    <ns0:link type="application/atom+xml" rel="self" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full?max-results=300"/> 
    <ns2:startIndex>1</ns2:startIndex> 
    <ns2:itemsPerPage>300</ns2:itemsPerPage> 
    <ns0:entry ns1:etag=""CRAPQR4KTit7I2A4""> 
     <ns0:category term="http://schemas.google.com/contact/2008#profile" scheme="http://schemas.google.com/g/2005#kind"/> 
     <ns0:id>http://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson</ns0:id> 
     <ns1:name> 
      <ns1:familyName>Person</ns1:familyName> 
      <ns1:fullName>Name Person</ns1:fullName> 
      <ns1:givenName>Name</ns1:givenName> 
     </ns1:name> 
     <ns0:updated>2012-01-25T14:52:13.081Z</ns0:updated> 
     <ns1:organization rel="http://schemas.google.com/g/2005#work" primary="true"> 
      <ns1:orgTitle>JobField</ns1:orgTitle> 
      <ns1:orgDepartment>DepartmentField</ns1:orgDepartment> 
      <ns1:orgName>CompanyField</ns1:orgName> 
     </ns1:organization> 
     <ns3:status indexed="true"/> 
     <ns0:title>Name Person</ns0:title> 
     <ns0:link type="image/*" rel="http://schemas.google.com/contacts/2008/rel#photo" href="https://www.google.com/m8/feeds/photos/profile/domain.com/nperson"/> 
     <ns0:link type="application/atom+xml" rel="self" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson"/> 
     <ns0:link type="application/atom+xml" rel="edit" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson"/> 
     <ns1:email rel="http://schemas.google.com/g/2005#other" address="[email protected]"/> 
     <ns1:email rel="http://schemas.google.com/g/2005#other" primary="true" address="[email protected]"/> 
     <ns4:edited>2012-01-25T14:52:13.081Z</ns4:edited> 
    </ns0:entry> 
    <ns0:title>domain.com's Profiles</ns0:title> 
</ns0:feed>

我能夠拉動數據形成的名稱，組織和電子郵件領域與beautifulstonesoup與此代碼：

profiles_feed = gd_client.GetProfilesFeed('https://www.google.com/m8/feeds/profiles/domain/domain.com/full?max-results=300') 

soup = BeautifulSoup(str(profiles_feed)) 


for tag in soup.findAll('ns1:name'): 
    print tag.find('ns1:familyname').text 
    print tag.find('ns1:fullname').text 
    print tag.find('ns1:givenname').text 

for tag in soup.findAll('ns1:organization'): 
    print tag.find('ns1:orgtitle').text 
    print tag.find('ns1:orgdepartment').text 
    print tag.find('ns1:orgname').text 

for tag in soup.findAll('ns1:email',address=True): 
    print tag['address']

我希望能夠抓住數據組從每一個NS0在一起：入口節點，所以它的輸出線，如：姓，給定的名稱，組織名稱，組織名稱，電子郵件

我已經嘗試使用：

for tag in soup('ns0:entry'): 
    print tag.name.familyName.text

但將其視爲屬性

我想過使用XPath，但我找不到屬於beautifulstonesoup和XPath的任何文檔，所以我不知道它支持它本土。那麼，我如何搜索每個入口節點並返回所有特定於該條目的數據，而不是按標記分組。從文檔（http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing XML）

來源

2012-01-26 Kevin

>>> from BeautifulSoup import BeautifulStoneSoup 
>>> xml = """<ns0:feed ns1:etag="W/"LIESANDCRAPfyt7I2A9WhHERE."" xmlns:ns4="http://www.w3.org/2007/app" xmlns:ns3="http://schemas.google.com/contact/2008" xmlns:ns2="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:ns1="http://schemas.google.com/g/2005" xmlns:ns0="http://www.w3.org/2005/Atom"> 
...    <ns0:updated>2012-01-25T14:52:12.867Z</ns0:updated> 
...    <ns0:category term="http://schemas.google.com/contact/2008#profile" scheme="http://schemas.google.com/g/2005#kind"/> 
...    <ns0:id>domain.com</ns0:id> 
...    <ns0:generator version="1.0" uri="http://www.google.com/m8/feeds">Contacts</ns0:generator> 
...    <ns0:author> 
...     <ns0:name>domain.com</ns0:name> 
...    </ns0:author> 
...    <ns0:link type="text/html" rel="alternate" href="http://www.google.com/"/> 
...    <ns0:link type="application/atom+xml" rel="http://schemas.google.com/g/2005#feed" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full"/> 
...    <ns0:link type="application/atom+xml" rel="http://schemas.google.com/g/2005#batch" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/batch"/> 
...    <ns0:link type="application/atom+xml" rel="self" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full?max-results=300"/> 
...    <ns2:startIndex>1</ns2:startIndex> 
...    <ns2:itemsPerPage>300</ns2:itemsPerPage> 
...    <ns0:entry ns1:etag=""CRAPQR4KTit7I2A4""> 
...     <ns0:category term="http://schemas.google.com/contact/2008#profile" scheme="http://schemas.google.com/g/2005#kind"/> 
...     <ns0:id>http://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson</ns0:id> 
...     <ns1:name> 
...      <ns1:familyName>Person</ns1:familyName> 
...      <ns1:fullName>Name Person</ns1:fullName> 
...      <ns1:givenName>Name</ns1:givenName> 
...     </ns1:name> 
...     <ns0:updated>2012-01-25T14:52:13.081Z</ns0:updated> 
...     <ns1:organization rel="http://schemas.google.com/g/2005#work" primary="true"> 
...      <ns1:orgTitle>JobField</ns1:orgTitle> 
...      <ns1:orgDepartment>DepartmentField</ns1:orgDepartment> 
...      <ns1:orgName>CompanyField</ns1:orgName> 
...     </ns1:organization> 
...     <ns3:status indexed="true"/> 
...     <ns0:title>Name Person</ns0:title> 
...     <ns0:link type="image/*" rel="http://schemas.google.com/contacts/2008/rel#photo" href="https://www.google.com/m8/feeds/photos/profile/domain.com/nperson"/> 
...     <ns0:link type="application/atom+xml" rel="self" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson"/> 
...     <ns0:link type="application/atom+xml" rel="edit" href="https://www.google.com/m8/feeds/profiles/domain/domain.com/full/nperson"/> 
...     <ns1:email rel="http://schemas.google.com/g/2005#other" address="[email protected]"/> 
...     <ns1:email rel="http://schemas.google.com/g/2005#other" primary="true" address="[email protected]"/> 
...     <ns4:edited>2012-01-25T14:52:13.081Z</ns4:edited> 
...    </ns0:entry> 
...    <ns0:title>domain.com's Profiles</ns0:title> 
...   </ns0:feed>"""

注：

BeautifulStoneSoup最常見的缺點是，它不不知道自動關閉標籤。 HTML具有一組固定的自閉標籤，但使用XML時，它取決於DTD所說的內容。你可以告訴BeautifulStoneSoup某些標籤是自閉在他們的名字傳遞作爲selfClosingTags參數的構造器：

>>> soup = BeautifulStoneSoup(xml, selfClosingTags=['ns0:category','ns3:status', 'ns0:link','ns1:email']) 
>>> a = soup.findAll('ns0:entry') 
>>> a[0].find('ns1:familyname') 
<ns1:familyname>Person</ns1:familyname> 
>>> a[0].find('ns1:familyname').text 
u'Person' 
>>> a[0].find('ns1:givenname') 
<ns1:givenname>Name</ns1:givenname> 
>>> a[0].find('ns1:givenname').text 
u'Name' 
>>> for entry in a: 
...  print ', '.join([entry.find('ns1:familyname').text, entry.find('ns1:givenname').text, entry.find('ns1:orgtitle').text, entry.find('ns1:orgname').text, entry.find('ns1:email')['address']]) 
... 
Person, Name, JobField, CompanyField, [email protected]

希望這有助於。

來源

2012-01-27 01:02:58 sgallen

這完美的作品，謝謝你幫我明白這一點。只是一個問題;解析器在找到一個空的attrib時停下來，我在文檔中找不到任何東西，你知道是否有一個標誌忽略空的attribs和標籤嗎？ – Kevin

'entry.find（'ns1：email'）。get（'address'，'No-address-found'）'你可以改變''無地址找到''是你想要的任何字符串。請參閱：http://docs.python.org/library/stdtypes.html#dict.get – sgallen

對不起，我應該更具體一點，如果文本字段是空白的，地址字段的attrib很可能不會在我將要使用的xml中是空白的，但是像orgname這樣的文本字段可能會。 – Kevin

通過條目遞歸搜索xml

回答

相關問題