從HTML表格提取數據

我正在尋找一種方法在Linux shell環境中從HTML獲取某些信息。從HTML表格提取數據

這是我感興趣的一點：

<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
    <th>Tests</th> 
    <th>Failures</th> 
    <th>Success Rate</th> 
    <th>Average Time</th> 
    <th>Min Time</th> 
    <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>

而且我想在shell變量存儲或從上面的html中提取鍵值對這些呼應。例如：

Tests   : 103 
Failures  : 24 
Success Rate : 76.70 % 
and so on..

我可以在此刻要做的就是創建使用SAX解析器或HTML解析器如jsoup提取此信息的Java程序。

但是，在這裏使用java似乎是在你要執行的「包裝器」腳本中包含可運行jar的開銷。

我敢肯定，必須有「殼」的語言，有可以做同樣的也就是Perl，Python和慶典等

我的問題是，我有這些零經驗，能夠有人幫助我解決這個「相當簡單」的問題

快速更新：

我忘了提，我的html的文件有關（清晨）對不起在得到了更多的表和更多的行。

更新＃2：

試圖安裝Bsoup這樣的，因爲我沒有root訪問權限：

$ wget http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/beautifulsoup4-4.1.0.tar.gz 
$ tar -zxvf beautifulsoup4-4.1.0.tar.gz 
$ cp -r beautifulsoup4-4.1.0/bs4 . 
$ vi htmlParse.py # (paste code from) Tichodromas' answer, just in case this (http://pastebin.com/4Je11Y9q) is what I pasted 
$ run file (python htmlParse.py)

錯誤：

$ python htmlParse.py 
Traceback (most recent call last): 
    File "htmlParse.py", line 1, in ? 
    from bs4 import BeautifulSoup 
    File "/home/gdd/setup/py/bs4/__init__.py", line 29 
    from .builder import builder_registry 
     ^
SyntaxError: invalid syntax

更新＃ 3：

運行Tichodromas的回答得到這個錯誤：

Traceback (most recent call last): 
    File "test.py", line 27, in ? 
    headings = [th.get_text() for th in table.find("tr").find_all("th")] 
TypeError: 'NoneType' object is not callable

什麼想法？

來源

2012-08-03 Gandalf StormCrow

有一個不錯的python庫可以幫助：BeautifulSoup - > http://www.crummy.com/software/BeautifulSoup/bs4/doc/。 – 2012-08-03 06:53:05

@Jakob S.謝謝你的評論，因爲我告訴你我是新手，所以我下載了tarbal並試圖安裝它'python setup.py install'得到這個權限錯誤'錯誤：無法創建'/ usr/lib/python2.4/site-packages/bs4'：Permission denied'，如何指定安裝它的目錄。在安裝其他命令時是否有類似於「-prefix」的內容 – 2012-08-03 07:06:28

我不得不承認，如果您沒有root訪問權限，我不知道如何實現這一目標 - 並且此刻我還沒有Linux。原則上，應該可以簡單地將軟件包複製到與源.py文件相關的正確目錄中，以便解釋程序可以找到它。 – 2012-08-03 07:14:36

一個Python溶液。EDIT3：使用class="details"選擇table）：

from bs4 import BeautifulSoup 

html = """ 
    <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
     <th>Tests</th> 
     <th>Failures</th> 
     <th>Success Rate</th> 
     <th>Average Time</th> 
     <th>Min Time</th> 
     <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>""" 

soup = BeautifulSoup(html) 
table = soup.find("table", attrs={"class":"details"}) 

# The first tr contains the field names. 
headings = [th.get_text() for th in table.find("tr").find_all("th")] 

datasets = [] 
for row in table.find_all("tr")[1:]: 
    dataset = zip(headings, (td.get_text() for td in row.find_all("td"))) 
    datasets.append(dataset) 

print datasets

結果看起來是這樣的：

[[(u'Tests', u'103'), 
    (u'Failures', u'24'), 
    (u'Success Rate', u'76.70%'), 
    (u'Average Time', u'71 ms'), 
    (u'Min Time', u'0 ms'), 
    (u'Max Time', u'829 ms')]]

EDIT2：要產生所需的輸出，使用這樣的：

for dataset in datasets: 
    for field in dataset: 
     print "{0:<16}: {1}".format(field[0], field[1])

結果：

Tests   : 103 
Failures  : 24 
Success Rate : 76.70% 
Average Time : 71 ms 
Min Time  : 0 ms 
Max Time  : 829 ms

來源

2012-08-03 07:15:55

感謝您的回答，回答您的意見上面。我可以使用該類作爲標識符，我沒有ID？class將是'details' – 2012-08-03 07:41:00

@G andalfStormCrow是的，這可以完成。我編輯了我的答案。 – 2012-08-03 07:46:26

這個答案確實可以在Python 2.4中起作用嗎？ @Gandalf，你在評論中說你安裝了「舊版本的bsoup」（我假設BeautifulSoup 3）。說「我正在使用Python 2.4.3」的行消失了。所以這有點令人困惑。 – mzjn 2012-08-03 11:18:12

undef $/; 
$text = <DATA>; 

@tabs = $text =~ m!<table.*?>(.*?)</table>!gms; 
for (@tabs) { 
    @th = m!<th>(.*?)</th>!gms; 
    @td = m!<td>(.*?)</td>!gms; 
} 
for $i (0..$#th) { 
    printf "%-16s\t: %s\n", $th[$i], $td[$i]; 
} 

__DATA__ 
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
<tr valign="top"> 
<th>Tests</th> 
<th>Failures</th> 
<th>Success Rate</th> 
<th>Average Time</th> 
<th>Min Time</th> 
<th>Max Time</th> 
</tr> 
<tr valign="top" class="Failure"> 
<td>103</td> 
<td>24</td> 
<td>76.70%</td> 
<td>71 ms</td> 
<td>0 ms</td> 
<td>829 ms</td> 
</tr> 
</table>

輸出如下：適當跳躍：使用BeautifulSoup4（編輯

Tests : 103 Failures : 24 Success Rate : 76.70% Average Time : 71 ms Min Time : 0 ms Max Time : 829 ms

來源

2012-08-03 06:56:47 cdtits

我建議[使用XML解析器]（http://stackoverflow.com/a/1732454/647772）。 – 2012-08-03 06:57:29

@cdtits感謝您的迴應，請問工作，如果我的文件包含多個表？ – 2012-08-03 07:06:53

僅使用標準庫的Python解決方案（利用了HTML恰好是格式良好的XML這一事實）。可以處理多行數據。

（測試使用Python 2.6和2.7此問題已更新說，OP使用Python 2.4，所以這個答案可能不是在這種情況下非常有用的。在Python 2.5中加入的ElementTree）

from xml.etree.ElementTree import fromstring 

HTML = """ 
<table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
    <th>Tests</th> 
    <th>Failures</th> 
    <th>Success Rate</th> 
    <th>Average Time</th> 
    <th>Min Time</th> 
    <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
    <tr valign="top" class="whatever"> 
    <td>A</td> 
    <td>B</td> 
    <td>C</td> 
    <td>D</td> 
    <td>E</td> 
    <td>F</td> 
    </tr> 
</table>""" 

tree = fromstring(HTML) 
rows = tree.findall("tr") 
headrow = rows[0] 
datarows = rows[1:] 

for num, h in enumerate(headrow): 
    data = ", ".join([row[num].text for row in datarows]) 
    print "{0:<16}: {1}".format(h.text, data)

輸出：

Tests   : 103, A 
Failures  : 24, B 
Success Rate : 76.70%, C 
Average Time : 71 ms, D 
Min Time  : 0 ms, E 
Max Time  : 829 ms, F

來源

2012-08-03 07:39:27 mzjn

謝謝你的回答。我可以這樣指定，而不是從一個特定的html字符串中讀取：從這個html文件中得到一個包含'class =「details」'的表並且執行剛剛完成的操作？ – 2012-08-03 07:42:30

這隻適用於包含'td'的*一行*行。 – 2012-08-03 07:49:26

現在它可以處理多個數據行。我已經用Python 2.6和2.7測試過了，但現在我發現你使用2.4.3（我沒有）。所以它可能無法幫助你。無論如何，我想表明沒有額外的圖書館就可以做這種事情。 – mzjn 2012-08-03 08:56:13

假設你的HTML代碼存儲在mycode.html文件，這裏是一個bash方式：

paste -d: <(grep '<th>' mycode.html | sed -e 's,</*th>,,g') <(grep '<td>' mycode.html | sed -e 's,</*td>,,g')

注：輸出是不完全一致

來源

2012-08-03 07:53:37

感謝您的回答，我需要得到特定的表格，有多個表格 – 2012-08-03 07:59:46

我聽說用正則表達式解析HTML或XML被定義中斷。 – ychaouche 2014-01-12 14:36:44

這裏是頂級的答案，適合Python3兼容性，提高了通過剝離空白單元格：

from bs4 import BeautifulSoup 

html = """ 
    <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
     <th>Tests</th> 
     <th>Failures</th> 
     <th>Success Rate</th> 
     <th>Average Time</th> 
     <th>Min Time</th> 
     <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>""" 

soup = BeautifulSoup(s, 'html.parser') 
table = soup.find("table") 

# The first tr contains the field names. 
headings = [th.get_text().strip() for th in table.find("tr").find_all("th")] 

print(headings) 

datasets = [] 
for row in table.find_all("tr")[1:]: 
    dataset = dict(zip(headings, (td.get_text() for td in row.find_all("td")))) 
    datasets.append(dataset) 

print(datasets)

來源

2017-05-31 04:07:55

下面是一個Python正則表達式基礎的解決方案，我有在Python 2.7上測試。它不依賴於xml模塊 - 所以在xml格式不完整的情況下工作。

import re 
# input args: html string 
# output: tables as a list, column max length 
def extract_html_tables(html): 
    tables=[] 
    maxlen=0 
    rex1=r'<table.*?/table>' 
    rex2=r'<tr.*?/tr>' 
    rex3=r'<(td|th).*?/(td|th)>' 
    s = re.search(rex1,html,re.DOTALL) 
    while s: 
    t = s.group() # the table 
    s2 = re.search(rex2,t,re.DOTALL) 
    table = [] 
    while s2: 
     r = s2.group() # the row 
     s3 = re.search(rex3,r,re.DOTALL) 
     row=[] 
     while s3: 
     d = s3.group() # the cell 
     #row.append(strip_tags(d).strip()) 
     row.append(d.strip()) 

     r = re.sub(rex3,'',r,1,re.DOTALL) 
     s3 = re.search(rex3,r,re.DOTALL) 

     table.append(row) 
     if maxlen<len(row): 
     maxlen = len(row) 

     t = re.sub(rex2,'',t,1,re.DOTALL) 
     s2 = re.search(rex2,t,re.DOTALL) 

    html = re.sub(rex1,'',html,1,re.DOTALL) 
    tables.append(table) 
    s = re.search(rex1,html,re.DOTALL) 
    return tables, maxlen 

html = """ 
    <table class="details" border="0" cellpadding="5" cellspacing="2" width="95%"> 
    <tr valign="top"> 
     <th>Tests</th> 
     <th>Failures</th> 
     <th>Success Rate</th> 
     <th>Average Time</th> 
     <th>Min Time</th> 
     <th>Max Time</th> 
    </tr> 
    <tr valign="top" class="Failure"> 
    <td>103</td> 
    <td>24</td> 
    <td>76.70%</td> 
    <td>71 ms</td> 
    <td>0 ms</td> 
    <td>829 ms</td> 
    </tr> 
</table>""" 
print extract_html_tables(html)

來源

2017-10-05 03:35:53 paolov

從HTML表格提取數據

回答

相關問題