我正在使用PHP和libtidy試圖屏蔽刮擦什麼可能是歷史上最可怕和格式不正確的HTML表使用。該網站關閉了幾個表格,tr,td,字體或粗體標籤,並始終在表格內嵌入許多不同層次的表格。屏幕抓取您生活中見過的最醜陋的HTML
例片段:
<center>
<table border="1" bordercolor="#000000" cellspacing="0" cellpadding="0">
<tr>
<td width="50%">
<center>
Home Team - <b>Wildcats<td>
<center>
Away Team - <b>Polar Bears<tr>
<td colspan="2">
<center>
<b><font size="+1">Rosters<tr>
<td valign="top">
<center>
<table border="0" cellspacing="0">
<tr>
<td>
<font size="2">1 <td>
<font size="2">Baird, T<tr>
<td>
<font size="2">2 <td>
<font size="2">Knight, P<tr>
<td>
<font size="2">8 <td>
<font size="2">Miller, B<tr>
<td>
<font size="2">9 <td>
<font size="2">Huebsch, B<tr>
<td>
<font size="2">11 <td>
<font size="2">Buschmann, C<tr>
<td>
<font size="2">12 <td>
<font size="2">Reding, J<tr>
<td>
<font size="2">14 <td>
<font size="2">Simpson, S<tr>
<td>
<font size="2">27 <td>
<font size="2">Kupferschmidt, M<tr>
<td>
<font size="2">28 <td>
<font size="2">Anderson, D<tr>
<td>
<font size="2">31 <td>
<font size="2">Gehrts, J<tr>
<td>
<font size="2">39 <td>
<font size="2">McGinnis, G<tr>
<td>
<font size="2">42 <td>
<font size="2">Temple, B<tr>
<td>
<font size="2">44 <td>
<font size="2">Kemplin, A<tr>
<td>
<font size="2">77 <td>
<font size="2">Weiner, B<tr>
<td>
<font size="2">95 <td>
<font size="2">
Zytkoskie, D</table>
<td valign="top">
<center>
<table border="0" cellspacing="0">
<tr>
<td>
<font size="2">5 <td>
<font size="2">Mack, A<tr>
<td>
<font size="2">8 <td>
<font size="2">Foucault, R<tr>
<td>
<font size="2">11 <td>
<font size="2">Oberpriller, D *<tr>
<td>
<font size="2">12 <td>
<font size="2">Underwood, J<tr>
<td>
<font size="2">15 <td>
<font size="2">Oberpriller, M<tr>
<td>
<font size="2">19 <td>
<font size="2">Langfus, B<tr>
<td>
<font size="2">25 <td>
<font size="2">Carroll, R<tr>
<td>
<font size="2">30 <td>
<font size="2">Hirdler, T<tr>
<td>
<font size="2">33 <td>
<font size="2">Gibson, S<tr>
<td>
<font size="2">35 <td>
<font size="2">Marthaler, C<tr>
<td>
<font size="2">44 <td>
<font size="2">Yurik, J<tr>
<td>
<font size="2">58 <td>
<font size="2">
Gronemeyer, S</table>
<tr>
<td colspan="2">
<center>
<b><font size="+1">Goals<tr>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<b><font size="2">Period<td>
<b><font size="2">Time<td>
<b><font size="2">Assist 1<td>
<b><font size="2">Assist 2<td>
<b><font size="2">SH<td>
<b><font size="2">PP<tr>
<td nowrap>
<font size="2">Kupferschmidt, M<td>
<font size="2">1<td>
<font size="2">12:51<td nowrap>
<font size="2">Kemplin, A<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">McGinnis, G<td>
<font size="2">1<td>
<font size="2">12:33<td nowrap>
<font size="2">Huebsch, B<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">Kupferschmidt, M<td>
<font size="2">2<td>
<font size="2">16:01<td nowrap>
<font size="2">None<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
<tr>
<td nowrap>
<font size="2">Buschmann, C<td>
<font size="2">3<td>
<font size="2">00:38<td nowrap>
<font size="2">None<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
</table>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<b><font size="2">Period<td>
<b><font size="2">Time<td>
<b><font size="2">Assist 1<td>
<b><font size="2">Assist 2<td>
<b><font size="2">SH<td>
<b><font size="2">PP<tr>
<td nowrap>
<font size="2">Oberpriller, D *<td>
<font size="2">3<td>
<font size="2">12:31<td nowrap>
<font size="2">Gronemeyer, S<td nowrap>
<font size="2">None<td>
<font size="2">
<center>
<td>
<font size="2">
<center>
</table>
<tr>
<td colspan="2">
<center>
<b><font size="+1">Penalties<tr>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<font size="2"><b>Period<td>
<font size="2"><b>Minutes<td>
<font size="2"><b>Offense<td>
<font size="2"><b>Start<td>
<font size="2"><b>Expired<tr>
<td nowrap>
<font size="2">Buschmann, C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">11:11<td>
<font size="2">09:11<tr>
<td nowrap>
<font size="2">Buschmann, C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Unsportmanlike Conduct<td>
<font size="2">03:26<td>
<font size="2">01:26<tr>
<td nowrap>
<font size="2">Bench<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Too Many Men<td>
<font size="2">01:46<td>
<font size="2">
00:00</table>
<td valign="top">
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Player<td>
<font size="2"><b>Period<td>
<font size="2"><b>Minutes<td>
<font size="2"><b>Offense<td>
<font size="2"><b>Start<td>
<font size="2"><b>Expired<tr>
<td nowrap>
<font size="2">Marthaler, C<td>
<font size="2">
<center>
1<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">01:19<td>
<font size="2">16:19<tr>
<td nowrap>
<font size="2">Underwood, J<td>
<font size="2">
<center>
2<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">12:32<td>
<font size="2">10:32<tr>
<td nowrap>
<font size="2">Marthaler, C<td>
<font size="2">
<center>
3<td>
<font size="2">
<center>
2<td>
<font size="2">Interference<td>
<font size="2">11:39<td>
<font size="2">
09:39</table>
<tr>
<td colspan="2">
<center>
<font size="+1"><b>Goalies<tr>
<td>
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Name<td>
<font size="2"><b>Shots<td>
<font size="2"><b>Goals<tr>
<td>
<font size="2">Baird, T<td>
<font size="2">20<td>
<font size="2">1<tr>
<td>
<font size="2"><b>Open Net<td>
<td>
<font size="2">
0</table>
<td>
<center>
<table border="1" cellspacing="0" width="100%">
<td>
<b><font size="2">Name<td>
<font size="2"><b>Shots<td>
<font size="2"><b>Goals<tr>
<td>
<font size="2">Hirdler, T<td>
<font size="2">42<td>
<font size="2">
奇妙的是,所有的瀏覽器似乎使這個就好了。 PHPTidy設法很好地理解了這一切,但這些表格嵌套得太深而且幾乎是隨機的,所以使用DOM XPath很難遍歷它。
有沒有人有任何建議採取其他方法?
驗屍:經過了太多比利時小麥啤酒和弄髒了我的代碼真正的好我通過通過用strip_tags去除所有標籤()除了表,TR和TD,然後通過libtidy運行它得到了很大的成績。它現在格式化得很漂亮,而且很容易遍歷。似乎它只是需要一點點按摩,然後將其發送到解析器。
真正悲哀的是,我見過更糟! – SirDemon 2009-04-09 02:03:27
借調。當你看到這樣的代碼與PHP混合在一起時,情況會更糟糕。 – epochwolf 2009-04-09 02:17:13