2014-04-22 91 views
1

對於以下xhtml片段,我需要使用BS4或xpath從結構化html中獲取屬性值對,屬性名稱存在於h5標記中,並且其值隨後在span標籤或ap標籤。使用BeautifulSoup或XPATH獲取內容屬性值對

下面的代碼,我應該得到以下爲詞典輸出:

Husbandary管理: '動物:牛農民:史密斯先生,'

Milch的類別: '牛奶供應'

服務:」牛奶,酥油」

動物的顏色: '紅色,吉恩......'

<div id="animalcontainer" class="container last fixed-height"> 

       <h5> 
        Husbandary Management 
       </h5> 
       <span> 
        Animal: Cow 
       </span> 
       <span> 
        Farmer: Mr smith 
       </span> 
       <h5> 
        Milch Category 
       </h5> 
       <p> 
        Milk supply 
       </p> 
       <h5> 
        Services 
       </h5> 
       <p> 
        cow milk, ghee 
       </p> 
       <h5> 
        animal colors 
       </h5> 
       <span> 
        green,red 
       </span> 


       </div> 

htmlcode.findAll( 'H5')發現H5要素,而是我想同時得到H5元件和前另一個 'H5'

回答

2

lxml.html使用和XPath實施例的解決方案的後繼:

  1. 選擇所有h5元件
  2. 並且對於每個h5元件,
    1. 選擇下一個同級元素 - following-sibling::*
    2. 不在h5的mselves, - [not(self::h5)]
    3. 並且具有到當前h5數前置兄弟 - [count(preceding-sibling::h5) = 1]然後2,然後3 ...

(與forenumerate()從1開始)

示例代碼,與所述元素的文本內容的簡單打印(使用上lxml.html元件的.text_content()):

import lxml.html 
html = """<div id="animalcontainer" class="container last fixed-height"> 

       <h5> 
        Husbandary Management 
       </h5> 
       <span> 
        Animal: Cow 
       </span> 
       <span> 
        Farmer: Mr smith 
       </span> 
       <h5> 
        Milch Category 
       </h5> 
       <p> 
        Milk supply 
       </p> 
       <h5> 
        Services 
       </h5> 
       <p> 
        cow milk, ghee 
       </p> 
       <h5> 
        animal colors 
       </h5> 
       <span> 
        green,red 
       </span> 


       </div>""" 
doc = lxml.html.fromstring(html) 
headers = doc.xpath('//div/h5') 
for i, header in enumerate(headers, start=1): 
    print "--------------------------------" 
    print header.text_content().strip() 
    for following in header.xpath("""following-sibling::* 
            [not(self::h5)] 
            [count(preceding-sibling::h5) = %d]""" % i): 
     print "\t", following.text_content().strip() 

此輸出:

-------------------------------- 
Husbandary Management 
    Animal: Cow 
    Farmer: Mr smith 
-------------------------------- 
Milch Category 
    Milk supply 
-------------------------------- 
Services 
    cow milk, ghee 
-------------------------------- 
animal colors 
    green,red 
+0

我接着說:[沒有( self :: h5)]'爲了在選擇兄弟姐妹 –

+0

時不包含以下'h5'元素,但是它似乎很難理解解決方案,可以用美麗的湯更清楚地做到 – stackit

+0

解釋:'以下兄弟姐妹: :* [not(self :: h5)] [count(before-sibling :: h5)=%d]'%i) – stackit

0

我終於用BS也如此,現在看來,這可以更有效地爲以下解決方案再生的兄弟姐妹每一次完成:

h5s=addinfo.findAll('h5') 
txtcontents=[] 
datad={} 
for h5el in h5s: 
    hcontents=list(h5el.nextSiblingGenerator()) 
    txtcontents=[] 
    for con in hcontents: 
     try: 
      if con.name=='h5': 
       break 
     except AttributeError: 
      print "error:",con 

      continue 
     txtcontents.append(con.contents) 
    datad["\n".join(h5el.contents)]=txtcontents 
print datad