使用BeautifulSoup或XPATH獲取內容屬性值對

對於以下xhtml片段，我需要使用BS4或xpath從結構化html中獲取屬性值對，屬性名稱存在於h5標記中，並且其值隨後在span標籤或ap標籤。使用BeautifulSoup或XPATH獲取內容屬性值對

下面的代碼，我應該得到以下爲詞典輸出：

Husbandary管理： '動物：牛農民：史密斯先生，'

Milch的類別： '牛奶供應'

服務：」牛奶，酥油」

動物的顏色： '紅色，吉恩......'

<div id="animalcontainer" class="container last fixed-height"> 

       <h5> 
        Husbandary Management 
       </h5> 
       <span> 
        Animal: Cow 
       </span> 
       <span> 
        Farmer: Mr smith 
       </span> 
       <h5> 
        Milch Category 
       </h5> 
       <p> 
        Milk supply 
       </p> 
       <h5> 
        Services 
       </h5> 
       <p> 
        cow milk, ghee 
       </p> 
       <h5> 
        animal colors 
       </h5> 
       <span> 
        green,red 
       </span> 


       </div>

htmlcode.findAll（ 'H5'）發現H5要素，而是我想同時得到H5元件和前另一個 'H5'

來源

2014-04-22 stackit

lxml.html使用和XPath實施例的解決方案的後繼：

選擇所有h5元件
並且對於每個h5元件，
1. 選擇下一個同級元素 - following-sibling::*
2. 不在h5的mselves， - [not(self::h5)]
3. 並且具有到當前h5數前置兄弟 - [count(preceding-sibling::h5) = 1]然後2，然後3 ...

（與for環enumerate()從1開始）

示例代碼，與所述元素的文本內容的簡單打印（使用上lxml.html元件的.text_content()）：

import lxml.html 
html = """<div id="animalcontainer" class="container last fixed-height"> 

       <h5> 
        Husbandary Management 
       </h5> 
       <span> 
        Animal: Cow 
       </span> 
       <span> 
        Farmer: Mr smith 
       </span> 
       <h5> 
        Milch Category 
       </h5> 
       <p> 
        Milk supply 
       </p> 
       <h5> 
        Services 
       </h5> 
       <p> 
        cow milk, ghee 
       </p> 
       <h5> 
        animal colors 
       </h5> 
       <span> 
        green,red 
       </span> 


       </div>""" 
doc = lxml.html.fromstring(html) 
headers = doc.xpath('//div/h5') 
for i, header in enumerate(headers, start=1): 
    print "--------------------------------" 
    print header.text_content().strip() 
    for following in header.xpath("""following-sibling::* 
            [not(self::h5)] 
            [count(preceding-sibling::h5) = %d]""" % i): 
     print "\t", following.text_content().strip()

個

此輸出：

-------------------------------- 
Husbandary Management 
    Animal: Cow 
    Farmer: Mr smith 
-------------------------------- 
Milch Category 
    Milk supply 
-------------------------------- 
Services 
    cow milk, ghee 
-------------------------------- 
animal colors 
    green,red

來源

2014-04-22 10:34:58

我接着說：[沒有（ self :: h5）]'爲了在選擇兄弟姐妹 –

時不包含以下'h5'元素，但是它似乎很難理解解決方案，可以用美麗的湯更清楚地做到 – stackit

解釋：'以下兄弟姐妹：：* [not（self :: h5）] [count（before-sibling :: h5）=％d]'％i） – stackit

我終於用BS也如此，現在看來，這可以更有效地爲以下解決方案再生的兄弟姐妹每一次完成：

h5s=addinfo.findAll('h5') 
txtcontents=[] 
datad={} 
for h5el in h5s: 
    hcontents=list(h5el.nextSiblingGenerator()) 
    txtcontents=[] 
    for con in hcontents: 
     try: 
      if con.name=='h5': 
       break 
     except AttributeError: 
      print "error:",con 

      continue 
     txtcontents.append(con.contents) 
    datad["\n".join(h5el.contents)]=txtcontents 
print datad

來源

2014-04-22 15:32:47 stackit

使用BeautifulSoup或XPATH獲取內容屬性值對

回答

相關問題