如何在Python中提取使用Beautifulsoup從HTML標籤

我試圖通過簡化的HTML頁面解析如下：如何在Python中提取使用Beautifulsoup從HTML標籤

<div class="anotherclass part" 
    <a href="http://example.com" > 
    <div class="column abc"><strike>&#163;3.99</strike><br>&#163;3.59</div> 
    <div class="column def"></div> 
    <div class="column ghi">1 Feb 2013</div> 
    <div class="column jkl"> 
     <h4>A title</h4> 
     <p> 
     <img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p> 
    </div> 
    </a> 
</div>

我在編碼蟒蛇一個初學者，我已閱讀和重新閱讀在http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

的beautifulsoup文檔我有這樣的代碼：

from BeautifulSoup import BeautifulSoup 

with open("file.html") as fp: 
    html = fp.read() 

soup = BeautifulSoup(html) 

parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE}) 
for part in parts: 
    mypart={} 

    # ghi 
    mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')}).string 
    # def 
    mypart['def'] = part.find(attrs={"class": re.compile('def')}).string 
    # h4 
    mypart['title'] = part.find('h4').string 

    # jkl 
    mypart['other'] = part.find('p').string 

    # abc 
    pattern = re.compile(r'\&\#163\;(\d{1,}\.?\d{2}?)') 
    theprices = re.findall(pattern, str(part)) 
    if len(theprices) == 2: 
    mypart['price'] = theprices[1] 
    mypart['rrp'] = theprices[0] 
    elif len(theprices) == 1: 
    mypart['price'] = theprices[0] 
    mypart['rrp'] = theprices[0] 
    else: 
    mypart['price'] = None 
    mypart['rrp'] = None

我想提取的類def和0的任何文本我認爲我的腳本可以正確執行。

我也想提取abc這兩個價格，我的腳本目前以相當笨拙的方式進行操作。有時候有兩種價格，有時候是一種，有時甚至沒有。

最後，我想提取jkl的"A, List, Of, Terms, To, Extract"部分，這是我的腳本無法做到的。我認爲獲得p標籤的字符串部分將工作，但我不明白爲什麼它不。此部分中的日期始終與ghi類中的日期相匹配，因此應該很容易更換/刪除它。

有什麼建議嗎？謝謝！

來源

2013-02-02 user1464409

首先，如果添加到convertEntities=bs.BeautifulSoup.HTML_ENTITIES

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

那麼HTML實體，如£將被轉換成其對應的Unicode字符，如£。這將允許您使用更簡單的正則表達式來識別價格。

現在，給part，您可以使用其contents屬性發現在<div>與價格的文本內容：

In [37]: part.find(attrs={"class": re.compile('abc')}).contents 
Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']

所有我們需要做的是從每個項目中提取的數量，或跳過它，如果沒有數：

def parse_price(text): 
    try: 
     return float(re.search(r'\d*\.\d+', text).group()) 
    except (TypeError, ValueError, AttributeError): 
     return None 

price = [] 
for item in part.find(attrs={"class": re.compile('abc')}).contents: 
    item = parse_price(item.string) 
    if item: 
     price.append(item)

此時price將是0，1或2漂浮的列表。我們想說

mypart['rrp'], mypart['price'] = price

但如果price是[]或只包含一個項目，將無法正常工作。

您使用if..else來處理這三個案例的方法沒問題 - 這是最直接，最可靠的方法。但它也有點平淡無奇。如果你想的東西多一點簡潔，你可以做到以下幾點：

既然我們要重複相同的價格，如果price包含一個項目，你可能會導致思考itertools.cycle。

如果price是空列表，[]，我們想要itertools.cycle([None])，但是我們可以使用itertools.cycle(price)。

所以這兩種情況合併成一個表達，我們可以使用

price = itertools.cycle(price or [None]) 
mypart['rrp'], mypart['price'] = next(price), next(price)

的next功能剝離值在迭代price一個接一個。由於price正在循環其價值，它永遠不會結束;它會繼續按順序產生值，然後在必要時重新開始 - 這正是我們想要的。

的A, List, Of, Terms, To, Extract - 1 Feb 2013可能再次通過使用contents屬性來獲得：

# jkl 
mypart['other'] = [item for item in part.find('p').contents 
        if not isinstance(item, bs.Tag) and item.string.strip()]

所以，完整的可運行的代碼是這樣：

import BeautifulSoup as bs 
import os 
import re 
import itertools as IT 

def parse_price(text): 
    try: 
     return float(re.search(r'\d*\.\d+', text).group()) 
    except (TypeError, ValueError, AttributeError): 
     return None 

filename = os.path.expanduser("~/tmp/file.html") 
with open(filename) as fp: 
    html = fp.read() 

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES) 

for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}): 
    mypart = {} 
    # abc 
    price = [] 
    for item in part.find(attrs={"class": re.compile('abc')}).contents: 
     item = parse_price(item.string) 
     if item: 
      price.append(item) 

    price = IT.cycle(price or [None]) 
    mypart['rrp'], mypart['price'] = next(price), next(price) 

    # jkl 
    mypart['other'] = [item for item in part.find('p').contents 
         if not isinstance(item, bs.Tag) and item.string.strip()] 

    print(mypart)

其收益率

{'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}

來源

2013-02-02 14:03:33 unutbu

這是一個美麗的解決方案unutbu :-)非常感謝您的時間和精力。我有很多東西需要消化，我從你的回答中學到了很多東西。非常感謝。 – user1464409

...只需添加：真正令人印象深刻的努力 - 讓我指出一個隱藏在代碼中的小錯誤：您在'parse_price（text）'內使用'item.string'而不是'text' –

@TheodrosZelleke ：非常感謝，趕上！ – unutbu

如何在Python中提取使用Beautifulsoup從HTML標籤

回答

相關問題