2009-04-09 32 views
5

我正在使用PHP和libtidy試圖屏蔽刮擦什麼可能是歷史上最可怕和格式不正確的HTML表使用。該網站關閉了幾個表格,tr,td,字體或粗體標籤,並始終在表格內嵌入許多不同層次的表格。屏幕抓取您生活中見過的最醜陋的HTML

例片段:

<center> 
<table border="1" bordercolor="#000000" cellspacing="0" cellpadding="0"> 
<tr> 
<td width="50%"> 
<center> 
Home Team - <b>Wildcats<td> 
<center> 
Away Team - <b>Polar Bears<tr> 
<td colspan="2"> 
<center> 
<b><font size="+1">Rosters<tr> 
<td valign="top"> 
<center> 
<table border="0" cellspacing="0"> 
<tr> 
<td> 
<font size="2">1&nbsp;<td> 
<font size="2">Baird, T<tr> 
<td> 
<font size="2">2&nbsp;<td> 
<font size="2">Knight, P<tr> 
<td> 
<font size="2">8&nbsp;<td> 
<font size="2">Miller, B<tr> 
<td> 
<font size="2">9&nbsp;<td> 
<font size="2">Huebsch, B<tr> 
<td> 
<font size="2">11&nbsp;<td> 
<font size="2">Buschmann, C<tr> 
<td> 
<font size="2">12&nbsp;<td> 
<font size="2">Reding, J<tr> 
<td> 
<font size="2">14&nbsp;<td> 
<font size="2">Simpson, S<tr> 
<td> 
<font size="2">27&nbsp;<td> 
<font size="2">Kupferschmidt, M<tr> 
<td> 
<font size="2">28&nbsp;<td> 
<font size="2">Anderson, D<tr> 
<td> 
<font size="2">31&nbsp;<td> 
<font size="2">Gehrts, J<tr> 
<td> 
<font size="2">39&nbsp;<td> 
<font size="2">McGinnis, G<tr> 
<td> 
<font size="2">42&nbsp;<td> 
<font size="2">Temple, B<tr> 
<td> 
<font size="2">44&nbsp;<td> 
<font size="2">Kemplin, A<tr> 
<td> 
<font size="2">77&nbsp;<td> 
<font size="2">Weiner, B<tr> 
<td> 
<font size="2">95&nbsp;<td> 
<font size="2"> 
Zytkoskie, D</table> 
<td valign="top"> 
<center> 
<table border="0" cellspacing="0"> 
<tr> 
<td> 
<font size="2">5&nbsp;<td> 
<font size="2">Mack, A<tr> 
<td> 
<font size="2">8&nbsp;<td> 
<font size="2">Foucault, R<tr> 
<td> 
<font size="2">11&nbsp;<td> 
<font size="2">Oberpriller, D *<tr> 
<td> 
<font size="2">12&nbsp;<td> 
<font size="2">Underwood, J<tr> 
<td> 
<font size="2">15&nbsp;<td> 
<font size="2">Oberpriller, M<tr> 
<td> 
<font size="2">19&nbsp;<td> 
<font size="2">Langfus, B<tr> 
<td> 
<font size="2">25&nbsp;<td> 
<font size="2">Carroll, R<tr> 
<td> 
<font size="2">30&nbsp;<td> 
<font size="2">Hirdler, T<tr> 
<td> 
<font size="2">33&nbsp;<td> 
<font size="2">Gibson, S<tr> 
<td> 
<font size="2">35&nbsp;<td> 
<font size="2">Marthaler, C<tr> 
<td> 
<font size="2">44&nbsp;<td> 
<font size="2">Yurik, J<tr> 
<td> 
<font size="2">58&nbsp;<td> 
<font size="2"> 
Gronemeyer, S</table> 
<tr> 
<td colspan="2"> 
<center> 
<b><font size="+1">Goals<tr> 
<td valign="top"> 
<center> 
<table border="1" cellspacing="0" width="100%"> 
<td> 
<b><font size="2">Player<td> 
<b><font size="2">Period<td> 
<b><font size="2">Time<td> 
<b><font size="2">Assist 1<td> 
<b><font size="2">Assist 2<td> 
<b><font size="2">SH<td> 
<b><font size="2">PP<tr> 
<td nowrap> 
<font size="2">Kupferschmidt,&nbsp;M<td> 
<font size="2">1<td> 
<font size="2">12:51<td nowrap> 
<font size="2">Kemplin,&nbsp;A<td nowrap> 
<font size="2">None<td> 
<font size="2"> 
<center> 
<td> 
<font size="2"> 
<center> 
<tr> 
<td nowrap> 
<font size="2">McGinnis,&nbsp;G<td> 
<font size="2">1<td> 
<font size="2">12:33<td nowrap> 
<font size="2">Huebsch,&nbsp;B<td nowrap> 
<font size="2">None<td> 
<font size="2"> 
<center> 
<td> 
<font size="2"> 
<center> 
<tr> 
<td nowrap> 
<font size="2">Kupferschmidt,&nbsp;M<td> 
<font size="2">2<td> 
<font size="2">16:01<td nowrap> 
<font size="2">None<td nowrap> 
<font size="2">None<td> 
<font size="2"> 
<center> 
<td> 
<font size="2"> 
<center> 
<tr> 
<td nowrap> 
<font size="2">Buschmann,&nbsp;C<td> 
<font size="2">3<td> 
<font size="2">00:38<td nowrap> 
<font size="2">None<td nowrap> 
<font size="2">None<td> 
<font size="2"> 
<center> 
<td> 
<font size="2"> 
<center> 
</table> 
<td valign="top"> 
<center> 
<table border="1" cellspacing="0" width="100%"> 
<td> 
<b><font size="2">Player<td> 
<b><font size="2">Period<td> 
<b><font size="2">Time<td> 
<b><font size="2">Assist 1<td> 
<b><font size="2">Assist 2<td> 
<b><font size="2">SH<td> 
<b><font size="2">PP<tr> 
<td nowrap> 
<font size="2">Oberpriller,&nbsp;D *<td> 
<font size="2">3<td> 
<font size="2">12:31<td nowrap> 
<font size="2">Gronemeyer,&nbsp;S<td nowrap> 
<font size="2">None<td> 
<font size="2"> 
<center> 
<td> 
<font size="2"> 
<center> 
</table> 
<tr> 
<td colspan="2"> 
<center> 
<b><font size="+1">Penalties<tr> 
<td valign="top"> 
<center> 
<table border="1" cellspacing="0" width="100%"> 
<td> 
<b><font size="2">Player<td> 
<font size="2"><b>Period<td> 
<font size="2"><b>Minutes<td> 
<font size="2"><b>Offense<td> 
<font size="2"><b>Start<td> 
<font size="2"><b>Expired<tr> 
<td nowrap> 
<font size="2">Buschmann,&nbsp;C<td> 
<font size="2"> 
<center> 
3<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2">Interference<td> 
<font size="2">11:11<td> 
<font size="2">09:11<tr> 
<td nowrap> 
<font size="2">Buschmann,&nbsp;C<td> 
<font size="2"> 
<center> 
3<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2">Unsportmanlike Conduct<td> 
<font size="2">03:26<td> 
<font size="2">01:26<tr> 
<td nowrap> 
<font size="2">Bench<td> 
<font size="2"> 
<center> 
3<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2">Too Many Men<td> 
<font size="2">01:46<td> 
<font size="2"> 
00:00</table> 
<td valign="top"> 
<center> 
<table border="1" cellspacing="0" width="100%"> 
<td> 
<b><font size="2">Player<td> 
<font size="2"><b>Period<td> 
<font size="2"><b>Minutes<td> 
<font size="2"><b>Offense<td> 
<font size="2"><b>Start<td> 
<font size="2"><b>Expired<tr> 
<td nowrap> 
<font size="2">Marthaler,&nbsp;C<td> 
<font size="2"> 
<center> 
1<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2">Interference<td> 
<font size="2">01:19<td> 
<font size="2">16:19<tr> 
<td nowrap> 
<font size="2">Underwood,&nbsp;J<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2">Interference<td> 
<font size="2">12:32<td> 
<font size="2">10:32<tr> 
<td nowrap> 
<font size="2">Marthaler,&nbsp;C<td> 
<font size="2"> 
<center> 
3<td> 
<font size="2"> 
<center> 
2<td> 
<font size="2">Interference<td> 
<font size="2">11:39<td> 
<font size="2"> 
09:39</table> 
<tr> 
<td colspan="2"> 
<center> 
<font size="+1"><b>Goalies<tr> 
<td> 
<center> 
<table border="1" cellspacing="0" width="100%"> 
<td> 
<b><font size="2">Name<td> 
<font size="2"><b>Shots<td> 
<font size="2"><b>Goals<tr> 
<td> 
<font size="2">Baird,&nbsp;T<td> 
<font size="2">20<td> 
<font size="2">1<tr> 
<td> 
<font size="2"><b>Open Net<td> 
<td> 
<font size="2"> 
0</table> 
<td> 
<center> 
<table border="1" cellspacing="0" width="100%"> 
<td> 
<b><font size="2">Name<td> 
<font size="2"><b>Shots<td> 
<font size="2"><b>Goals<tr> 
<td> 
<font size="2">Hirdler,&nbsp;T<td> 
<font size="2">42<td> 
<font size="2"> 

奇妙的是,所有的瀏覽器似乎使這個就好了。 PHPTidy設法很好地理解了這一切,但這些表格嵌套得太深而且幾乎是隨機的,所以使用DOM XPath很難遍歷它。

有沒有人有任何建議採取其他方法?

驗屍:經過了太多比利時小麥啤酒和弄髒了我的代碼真正的好我通過通過用strip_tags去除所有標籤()除了表,TR和TD,然後通過libtidy運行它得到了很大的成績。它現在格式化得很漂亮,而且很容易遍歷。似乎它只是需要一點點按摩,然後將其發送到解析器。

+0

真正悲哀的是,我見過更糟! – SirDemon 2009-04-09 02:03:27

+0

借調。當你看到這樣的代碼與PHP混合在一起時,情況會更糟糕。 – epochwolf 2009-04-09 02:17:13

回答

3

您可以使用一些技巧來清理表格等高度可預測的結構。在運行HTML tidy之前,您可以使用正則表達式或其他方法搜索<tr><td>,然後再搜索另一個<tr><td>,並在其之前插入相應的更近。在<td>內部容納表格的時候有一些額外的技巧,但沒有什麼是不可能處理的。首先定位最裏面的結構並從那裏向外移動。

真正的難題就像未關閉的<div>'s和<p>'s一樣,這與其相應的(或缺乏)關閉者相比要困難得多。

0

也許你會有更好的運氣,使用正則表達式來獲取所需的結果,而不是將其解析爲XML。

2

如果您對其他語言(如Python)開放,Beautiful Soup在重建寫得很差的HTML時很有用。我只是試着通過下面的代碼片段來運行你的HTML,現在它非常易讀。

#!/usr/bin/env python 

from BeautifulSoup import BeautifulSoup 

html = "long string of html" 
soup = BeautifulSoup(html) 
print soup.prettify() 
2

如果您正在尋找的數據,我只想刪除所有HTML和處理它作爲行由行原始輸入。您可以使用strip_tags函數。

$clean = strip_tags($input); 

// example: <p>Test paragraph.</p> <a href="#fragment">Other text</a> 
// returns: Test paragraph. Other text 
0

我用xpath與Python的lxml庫來解析IMDB Top 250頁面。 View the source讓自己看看它有多糟糕。

下面的代碼解析保存IMDB前250頁(top250.html),並將提取的信息在SQLite數據庫(top250.db

import sqlite3 
from lxml import html 

tree = html.parse('top250.html') 

class TopMovie(object): 
    base_xpath = "/html/body/div/div[2]/layer/div[3]/table/tr/td[3]/div/table/tr/td/table/tr[%d]" 

    def __init__(self, num): 
     self.rank = num 
     self.xpath = self.base_xpath % (self.rank + 1) 

    def rating(self): 
     return tree.xpath(self.xpath + '/td[2]/font')[0].text 

    def link(self): 
     return tree.xpath(self.xpath + '/td[3]/font/a')[0].values()[0] 

    def title(self): 
     return tree.xpath(self.xpath + '/td[3]/font')[0].text_content() 

    def votes(self): 
     return tree.xpath(self.xpath + '/td[4]/font')[0].text 


def main(): 
    conn = sqlite3.connect('top250.db') 
    conn.execute("""DROP TABLE IF EXISTS movies""") 
    conn.execute(""" 
     CREATE TABLE movies (
      id INTEGER PRIMARY KEY, 
      title TEXT, 
      link TEXT, 
      rating TEXT, 
      votes INTEGER 
     )""") 

    for n in xrange(1, 251): 
     m = TopMovie(n) 
     query = r'INSERT INTO movies VALUES (%d, "%s", "%s", "%s", "%s")' \ 
      % (n, m.title(), m.link(), m.rating(), m.votes().replace(',', '')) 
     conn.execute(query) 

    conn.commit() 
    conn.close() 


if __name__ == "__main__": 
    main()