提取數據 - 優文庫

我想下載一些HTML頁面和提取信息，每個HTML頁有這樣table tag：提取數據

<table class="sobi2Details" style='background-image: url(http://www.imd.ir/components/com_sobi2/images/backgrounds/grey.gif);border-style: solid; border-color: #808080' > 
    <tr> 
     <td><h1>Dr Jhon Doe</h1></td> 
    </tr> 
    <tr> 
     <td></td> 
    </tr> 
    <tr> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
      <div id="sobi2outer"> 
      <br/> 
      <span id="sobi2Details_field_name" ><span id="sobi2Listing_field_name_label">name:</span>Jhon</span><br/> 
      <span id="sobi2Details_field_family" ><span id="sobi2Listing_field_family_label">family:</span> Doe</span><br/> 
      <span id="sobi2Details_field_tel1" ><span id="sobi2Listing_field_tel1_label">tel:</span> 33727464</span><br/> 
      </div> 
     </td> 
    </tr> 
</table>

我想訪問的域名（Jhone），家庭（Doe）和電話（33727464 ），我用beausiful soup通過ID來訪問這些跨度標籤：

name=soup.find(id="sobi2Details_field_name").__str__() 
family=soup.find(id="sobi2Details_field_family").__str__() 
tel=soup.find(id="sobi2Details_field_tel1").__str__()

，但我不知道如何提取數據到這些tags.I tryed使用children和content個屬性，但是當我使用的主題爲tag它返回None：

name=soup.find(id="sobi2Details_field_name") 
for child in name.children: 
    #process content inside

，但我得到這個錯誤：當我使用它STR（）

'NoneType' object has no attribute 'children'

同時，它不None !! 任何想法？

編輯：我的最終解決方案

soup = BeautifulSoup(page,from_encoding="utf-8") 
name_span=soup.find(id="sobi2Details_field_name").__str__() 
name=name_span.split(':')[-1] 
result = re.sub('</span>', '',name)

來源

2012-07-28 Asma Gheisari

什麼版本的美麗的湯您使用的是？ 'type（name）'返回什麼？對我來說它返回。我剛剛在OS X 10.8上的Python 2.7.2上安裝了帶easy_install的BS4。 – 2012-07-28 13:56:54

我已經在Python 2.6上安裝了BS4，我不知道是什麼類型（名稱），我沒有使用它！ – 2012-07-28 14:13:32

type（value）將返回值的類型，因此您可以使用它來幫助解決問題。如果你在'name = soup.find（...）'行後面加上'print type（name）'，你就可以知道BS返回了什麼類型的'find'方法的結果。 – 2012-07-28 14:21:12

我發現一對夫婦的方式來做到這一點。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open(path_to_html_file)) 

name_span = soup.find(id="sobi2Details_field_name") 

# First way: split text over ':' 
# This only works because there's always a ':' before the target field 
name = name_span.text.split(':')[1] 

# Second way: iterate over the span strings 
# The element you look for is always the last one 
name = list(name_span.strings)[-1] 

# Third way: iterate over 'next' elements 
name = name_span.next.next.next # you can create a function to do that, it looks ugly :)

告訴我，如果有幫助。

來源

2012-07-28 15:13:42

感謝U.你的第一個方法聽起來真的很好，而且工作。但我的html包含unicode，當我測試代碼時它有錯誤。你有任何建議。 – 2012-07-28 17:33:00

你能提供帶有錯誤的回溯嗎？ – 2012-07-28 23:36:11

如果您熟悉使用XPath使用LXML與etree代替：

import urllib2 
from lxml import etree 

opener = urllib2.build_opener() 
root = etree.HTML(opener.open("myUrl").read()) 

print root.xpath("//span[@id='sobi2Details_field_name']/text()")[0]

來源

2012-07-28 21:22:01 Joey

回答

相關問題