美麗的湯元素訪問

我想使用BeautifulSoup從網頁中提取信息。我的代碼是在這裏：美麗的湯元素訪問

from bs4 import BeautifulSoup 
import urllib2 
opener = urllib2.build_opener() 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 
infile = opener.open('http://en.wikipedia.org/wiki/American_films_of_1971') 
page = infile.read() 
soup = BeautifulSoup(page) 
soup.prettify().encode('utf8') 
print (soup.find_all("table", "wikitable"))

輸出

[<table class="wikitable"> 
<tr> 
<th style="width:25%;">Title</th> 
<th style="width:20%;">Director</th> 
<th style="width:30%;">Cast</th> 
<th style="width:10%;">Genre/Note</th> 
<th style="width:3%;"> 
<p><br/></p> 
</th> 
</tr> 
<tr> 
<td><i><a class="mw-redirect" href="/wiki/$" title="$">$</a> aka Dollars</i></td> 
<td><a href="/wiki/Richard_Brooks" title="Richard Brooks">Richard Brooks</a></td> 
<td><a href="/wiki/Warren_Beatty" title="Warren Beatty">Warren Beatty</a>, <a href="/wiki/Goldie_Hawn" title="Goldie Hawn">Goldie Hawn</a></td> 
<td><a href="/wiki/Comedy" title="Comedy">Comedy</a>, <a href="/wiki/Crime" title="Crime">Crime</a></td> 
<td></td> 
</tr> 
<tr> 
<td><i><a href="/wiki/200_Motels" title="200 Motels">200 Motels</a></i></td> 
<td><a href="/wiki/Tony_Palmer" title="Tony Palmer">Tony Palmer</a>, Charles Swenson</td> 
<td><a href="/wiki/Frank_Zappa" title="Frank Zappa">Frank Zappa</a>, <a href="/wiki/Ringo_Starr" title="Ringo Starr">Ringo Starr</a>, <a href="/wiki/Theodore_Bikel" title="Theodore Bikel">Theodore Bikel</a></td> 
<td><a href="/wiki/Comedy" title="Comedy">Comedy</a>, <a href="/wiki/Musical_film" title="Musical film">Musical</a></td> 
<td></td> 
</tr> 
</table>]

我想提取每個tr元素中的每個元素td。類似於

aka Dollars | Richard Brooks | Warren Beatty | Crime 
200 Models | Tony Palmer, Charles Swenson | Frank Zappa | Comedy

我不確定如何在獲取我想要的文檔部分後查看子標記。

我想知道BeautifulSoup是否是正確的工具，或者我應該看看別的東西。

來源

2012-09-06 pogo

不提供使用外部非易失性服務碼。所以解決你的問題.... –

而具體的問題是什麼？什麼不工作？你可以用Beauifulsoup，lxml或其他來解析HTML ......問題是什麼？ –

除此之外：soup.find_all（「table」，「wikitable」））對於搜索class =「wikitable」沒有任何意義。請閱讀文檔http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class –

每個結果在.find_all()列表是另一個元素的對象，這樣你就可以對這些做進一步的搜索：

for table in soup.find_all("table", "wikitable"): 
    for row in table.find_all('tr'): 
     cells = [] 
     for cell in row.find_all('td'): 
      cells.append(cell.get_text()) 
     print(' | '.join(cells))

這給了我：

$ aka Dollars | Richard Brooks | Warren Beatty, Goldie Hawn | Comedy, Crime | 
200 Motels | Tony Palmer, Charles Swenson | Frank Zappa, Ringo Starr, Theodore Bikel | Comedy, Musical |

來源

2012-09-07 09:50:28

美麗的湯元素訪問

回答

相關問題