使用python和lxml從表格中提取文本

我最近看到另一個用戶詢問了有關從Web表格中提取信息Extracting information from a webpage with python的問題。 ekhumoro的答案在其他用戶所問的網頁上效果很好。見下文。使用python和lxml從表格中提取文本

from urllib2 import urlopen 
from lxml import etree 

url = 'http://www.uscho.com/standings/division-i-men/2011-2012/' 

tree = etree.HTML(urlopen(url).read()) 

for section in tree.xpath('//section[starts-with(@id, "section_")]'): 
    print section.xpath('h3[1]/text()')[0] 
    for row in section.xpath('table/tbody/tr'): 
     cols = row.xpath('td//text()') 
     print ' ', cols[0].ljust(25), ' '.join(cols[1:]) 
    print

我的問題是使用此代碼作爲指導來解析這個頁面http://www.uscho.com/rankings/d-i-mens-poll/ 。使用以下更改，我只能打印h1和h3。

輸入

url = 'http://www.uscho.com/rankings/d-i-mens-poll/' 
tree = etree.HTML(urlopen(url).read()) 

for section in tree.xpath('//section[starts-with(@id, "rankings")]'): 
    print section.xpath('h1[1]/text()')[0] 
    print section.xpath('h3[1]/text()')[0] 
    for row in section.xpath('table/tbody/tr'): 
     cols = row.xpath('td/b/text()') 
     print ' ', cols[0].ljust(25), ' '.join(cols[1:]) 
    print

輸出

USCHO.com Division I Men's Poll 
December 12, 2011

表的結構似乎是一樣的，所以我很茫然，爲什麼我不能用類似的代碼。我只是一名機械工程師。任何幫助表示讚賞。

來源

2011-12-15 drivendaily

lxml是偉大的，但如果你不熟悉的xpath，我建議你BeautifulSoup：

from urllib2 import urlopen 
from BeautifulSoup import BeautifulSoup 

url = 'http://www.uscho.com/rankings/d-i-mens-poll/' 
soup = BeautifulSoup(urlopen(url).read()) 

section = soup.find('section', id='rankings') 
h1 = section.find('h1') 
print h1.text 
h3 = section.find('h3') 
print h3.text 
print 

rows = section.find('table').findAll('tr')[1:-1] 
for row in rows: 
    columns = [data.text for data in row.findAll('td')[1:]] 
    print '{0:20} {1:4} {2:>6} {3:>4}'.format(*columns)

此腳本的輸出是：

USCHO.com Division I Men's Poll 
December 12, 2011 

Minnesota-Duluth  (49) 12-3-3 999 
Minnesota     14-5-1 901 
Boston College   12-6-0 875 
Ohio State   (1) 13-4-1 848 
Merrimack     10-2-2 844 
Notre Dame    11-6-3 667 
Colorado College   9-5-0 650 
Western Michigan   9-4-5 647 
Boston University   10-5-1 581 
Ferris State    11-6-1 521 
Union      8-3-5 510 
Colgate     11-4-2 495 
Cornell     7-3-1 347 
Denver      7-6-3 329 
Michigan State   10-6-2 306 
Lake Superior    11-7-2 258 
Massachusetts-Lowell  10-5-0 251 
North Dakota    9-8-1 88 
Yale      6-5-1 69 
Michigan     9-8-3 62

來源

2011-12-15 06:32:36 jcollado

謝謝！我以前沒有聽說過美麗的湯。似乎也更直接。 – drivendaily 2011-12-17 15:11:49

通過'table/tr'更換'table/tbody/tr'。

來源

2011-12-15 06:43:28 jfs

當我這樣做時，我得到了一大堆其他東西。除了有問題的數據外，我收到了很多表格數據。我想知道爲什麼那麼開心，但它有點令人費解！ – ThinkCode 2012-05-21 16:31:40

表格的結構略有不同，並且列中有空白條目。

可能lxml解決方案：

from urllib2 import urlopen 
from lxml import etree 

url = 'http://www.uscho.com/rankings/d-i-mens-poll/' 
tree = etree.HTML(urlopen(url).read()) 

for section in tree.xpath('//section[@id="rankings"]'): 
    print section.xpath('h1[1]/text()')[0], 
    print section.xpath('h3[1]/text()')[0] 
    print 
    for row in section.xpath('table/tr[@class="even" or @class="odd"]'): 
     print '%-3s %-20s %10s %10s %10s %10s' % tuple(
      ''.join(col.xpath('.//text()')) for col in row.xpath('td')) 
    print

輸出：

USCHO.com Division I Men's Poll December 12, 2011 

1 Minnesota-Duluth   (49)  12-3-3  999   1 
2 Minnesota       14-5-1  901   2 
3 Boston College      12-6-0  875   3 
4 Ohio State     (1)  13-4-1  848   4 
5 Merrimack       10-2-2  844   5 
6 Notre Dame       11-6-3  667   7 
7 Colorado College      9-5-0  650   6 
8 Western Michigan      9-4-5  647   8 
9 Boston University     10-5-1  581   11 
10 Ferris State      11-6-1  521   9 
11 Union        8-3-5  510   10 
12 Colgate        11-4-2  495   12 
13 Cornell        7-3-1  347   16 
14 Denver        7-6-3  329   13 
15 Michigan State      10-6-2  306   14 
16 Lake Superior      11-7-2  258   15 
17 Massachusetts-Lowell    10-5-0  251   18 
18 North Dakota       9-8-1   88   19 
19 Yale         6-5-1   69   17 
20 Michigan        9-8-3   62   NR

來源

2011-12-15 21:27:08 ekhumoro

使用python和lxml從表格中提取文本

回答

相關問題