Python的 - 解析HTML類

我在憤怒試圖解析以下代表HTML提取物，使用BeautifulSoup和LXML：Python的 - 解析HTML類

[<p class="fullDetails"> 
<strong>Abacus Trust Company Limited</strong> 
<br/>Sixty Circular Road 

      <br/>DOUGLAS 

      <br/>ISLE OF MAN 
      <br/>IM1 1SA 
      <br/> 
<br/>Tel: 01624 689600 
      <br/>Fax: 01624 689601 
     <br/> 
<br/> 
<span class="displayBlock" id="ctl00_ctl00_bodycontent_MainContent_Email">E-mail: </span> 
<a href="mailto:[email protected]" id="ctl00_ctl00_bodycontent_MainContent_linkToEmail">[email protected]</a> 
<br/> 
<span id="ctl00_ctl00_bodycontent_MainContent_Web">Web: </span> 
<a href="http://www.abacusiom.com" id="ctl00_ctl00_bodycontent_MainContent_linkToSite">http://www.abacusiom.com</a> 
<br/> 
<br/><b>Partners(s) - ICAS members only:</b> S H Fleming, M J MacBain 
     </p>]

我想要做什麼：

提取物 '強'文成COMPANY_NAME
提取物 'BR' 標記文本company_line_x
提取 'MainContent_Email' 文本company_email
提取 'MainContent_Web' 文本company_web

我有這些問題：

1）I可以提取通過使用.findall所有文本（文本= True），但每行有很多填充

2）非ASCII字符有時被返回，這會導致csv.writer失敗..我不是100％確定如何處理這個正確。（我以前只是用unicodecsv.writer）

任何意見將非常感謝！

此刻，我的功能只是接收頁面數據，並使用findall()

隔離「P級」

def get_company_data(page_data): 
    if not page_data: 
     pass 
    else: 
     company_dets=page_data.findAll("p",{"class":"fullDetails"}) 
     print company_dets 
     return company_dets

來源

2014-09-02 Chris Finlayson

如何獲取頁面數據？ – alecxe 2014-09-02 12:01:22

感謝您的回覆。我使用請求模塊提取數據，並將頁面數據傳遞給此函數 – 2014-09-02 12:25:42

好的，您使用的是響應文本還是內容屬性？ – alecxe 2014-09-02 12:49:35

下面是一個完整的解決方案：

from bs4 import BeautifulSoup, NavigableString, Tag 

data = """ 
your html here 
""" 

soup = BeautifulSoup(data) 
p = soup.find('p', class_='fullDetails') 

company_name = p.strong.text 
company_lines = [] 
for element in p.strong.next_siblings: 
    if isinstance(element, NavigableString): 
     text = element.strip() 
     if text: 
      company_lines.append(text) 

company_email = p.find('span', text=lambda x: x.startswith('E-mail:')).find_next_sibling('a').text 
company_web = p.find('span', text=lambda x: x.startswith('Web:')).find_next_sibling('a').text 

print company_name 
print company_lines 
print com[enter link description here][1]pany_email, company_web

打印：

Abacus Trust Company Limited 
[u'Sixty Circular Road', u'DOUGLAS', u'ISLE OF MAN', u'IM1 1SA', u'Tel: 01624 689600', u'Fax: 01624 689601', u'S H Fleming, M J MacBain'] 
[email protected] http://www.abacusiom.com

注意，讓我們不得不遍歷該公司線strong標籤的next siblings並獲取所有文本節點。 company_email和company_web通過標籤檢索，換句話說，在其之前的by the textspan標籤。

來源

2014-09-02 12:13:28 alecxe

你一樣也做了p數據，（我用lxml爲下面的示例代碼）

要獲得公司名稱：

company_name = '' 
for strg in root.findall('strong'): 
    company_name = strg.text  # this will give you Abacus Trust Company Limited

要獲得公司線/詳細信息：

company_line_x = '' 
lines = [] 
for b in root.findall('br'): 
    if b.tail: 
     addr_line = b.tail.strip() 
     lines.append(addr_line) if addr_line != '' else None 

company_line_x = ', '.join(lines) # this will give you Sixty Circular Road, DOUGLAS, ISLE OF MAN, IM1 1SA, Tel: 01624 689600, Fax: 01624 689601

來源

2014-09-02 12:09:16 sk11

OP使用'BeautifulSoup'。 – alecxe 2014-09-02 12:11:48

OP說_使用BeautifulSoup和lxml_，所以我根據我對lxml的建議。無論如何，這個想法仍然差不多。 – sk11 2014-09-02 12:14:05

你是對的，誤解了這部分。請注意，您目前缺少'email'和'web'部分。謝謝。 – alecxe 2014-09-02 12:15:45

Python的 - 解析HTML類

回答

相關問題