使用beautifulsoup解析HTML頁面

我開始使用beautifulsoup解析HTML。
用於例如，對於網站的「http://en.wikipedia.org/wiki/PLCB1」使用beautifulsoup解析HTML頁面

import sys 
sys.setrecursionlimit(10000) 

import urllib2, sys 
from BeautifulSoup import BeautifulSoup 

site= "http://en.wikipedia.org/wiki/PLCB1" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
req = urllib2.Request(site,headers=hdr) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page) 

table = soup.find('table', {'class':'infobox'}) 
#print table 
rows = table.findAll("th") 
for x in rows: 
    print "x - ", x.string

我得到的輸出爲無在日那裏是URL某些情況下。爲什麼是這樣？

輸出：

x - Phospholipase C, beta 1 (phosphoinositide-specific) 
x - Identifiers 
x - None 
x - External IDs 
x - None 
x - None 
x - Molecular function 
x - Cellular component 
x - Biological process 
x - RNA expression pattern 
x - Orthologs 
x - Species 
x - None 
x - None 
x - None 
x - RefSeq (mRNA) 
x - RefSeq (protein) 
x - Location (UCSC) 
x - None

例如，地點後，還有一個個包含「考研搜索」，但顯示爲無。我想知道它爲什麼發生。

and
第二：有沒有辦法在字典中獲取th和各自的td，以便它變得容易解析？

來源

2013-02-16 sam

Element.string只有當文本直接位於元素中時才包含值。不包括嵌套元素。

如果使用BeautifulSoup 4，使用Element.stripped_strings代替：

print ''.join(x.stripped_strings)

對於BeautifulSoup 3，你需要搜索所有文本元素：

print ''.join([unicode(t).strip() for t in x.findAll(text=True)])

如果你想結合<th>和<td>元素合併到一個字典中，您可以遍歷所有<th>元素，然後使用.findNextSibling()來查找相應的<td>元素，並將它合併上述.findAll(text=True)招打造自己的字典：

info = {} 
rows = table.findAll("th") 
for headercell in rows: 
    valuecell = headercell.findNextSibling('td') 
    if valuecell is None: 
     continue 
    header = ''.join([unicode(t).strip() for t in headercell.findAll(text=True)]) 
    value = ''.join([unicode(t).strip() for t in valuecell.findAll(text=True)]) 
    info[header] = value

來源

2013-02-16 14:46:35

這隻適用於bs4。相反，@sam可能會使用較早版本的BeautifulSoup。（不是我-1順便說一句） – unutbu 2013-02-16 14:48:18

@unutbu：bugger ..更新爲包括一個BS3選項 – 2013-02-16 14:48:37

它給TypeError – sam 2013-02-16 14:49:54

如果檢查HTML，

<th colspan="4" style="text-align:center; background-color: #ddd">Identifiers</th> 
</tr> 
<tr class=""> 
<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th> 
<td colspan="3" class="" style="background-color: #eee"><span class="plainlinks"><a rel="nofollow" class="external text" href="http://www.genenames.org/data/hgnc_data.php?hgnc_id=15917">PLCB1</a>; EIEE12; PI-PLC; PLC-154; PLC-I; PLC154; PLCB1A; PLCB1B</span></td> 
</tr> 
<tr class=""> 
<th style="background-color: #c3fdb8">External IDs</th>

你會看到在Identifiers和External IDs之間有一個<th>標籤，沒有文字，只有<a>標籤：

<th style="background-color: #c3fdb8"><a href="/wiki/Human_Genome_Organisation" title="Human Genome Organisation">Symbols</a></th>

這<th>有沒有T分機。所以x.string是None。

來源

2013-02-16 14:50:56 unutbu

當然'x.string'是None，但是你如何解決這個問題？ :-P – 2013-02-16 14:52:13

@MartijnPieters：我來說說這個，但你回答得太快:) – unutbu 2013-02-16 14:53:14

怎麼樣最後的情況下有以及標籤 – sam 2013-02-16 14:53:53

使用beautifulsoup解析HTML頁面

回答

相關問題