2009-08-03 183 views
1

我想寫一個簡單的函數來關閉使用PHP preg_replace丟失的HTML標記。幫助PHP正則表達式使用背後的負面看

我認爲這將是相對直接的,但由於某種原因,它沒有。

什麼基本上,我試圖做的是密切以下行中缺少標籤:

<tr> 
<th class="ProfileIndent0"> 
<p>Global pharmaceuticals</p> 
<td>197.2</td> 
<td>94</td> 
</tr> 

我一直在服用的方法是使用一個負的外觀後面找到打開是TD標籤而不是在打開的th標籤和正確關閉的th標籤之前。

例如:

$text = preg_replace('!<th(\s\S*){0,1}?>(.*)((?<!<\/th>)[\s]*<td>)!U','<th$1>$2</th>',$text); 

我寫的正則表達式模式無數不同的方法都無濟於事。問題在於,我似乎無法完全匹配前一個缺失/前一個開放td,而是似乎與幾個開放td標籤匹配。

下面是完整的輸入文本:

<CO_TEXT text_type_id="6"> 
     <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr> 

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr> 

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr> 

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr> 

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li 
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release, pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT> 

有什麼用PHP負面看屁股,我是不知道的,或有我只是打不上合適的匹配模式怎麼回事?

任何幫助將不勝感激。

感謝, 約翰

+0

嗨! (對不起,這不是一個安慰;只是一個想法;也許它會幫助你認爲可能有其他方法來做到這一點)看看你的正則表達式,只有一件事情發生在我腦海裏:正則表達式可能不是「正確的工具「,你正在嘗試做什麼......這已經是一個很難閱讀的正則表達式,我不認爲它必須變得能夠處理任何種類的混淆僞-HTML可能會餵它... – 2009-08-03 22:44:52

+0

Pascal,是的 - 我知道你在說什麼。在過去的幾天裏,我的頭撞牆後,我認爲有更好的方法來解決這個問題。特別是,在源頭捕捉不良HTML - 而不是顯示結束。 – John 2009-08-04 14:59:28

回答

0

的問題是,我似乎無法匹配僅在一個開放的TD與丟失</th>前述它 - 而是它似乎匹配在幾個打開的TD標籤。

聽起來像是你想要的「非貪婪」或「懶」匹配表達式。使用'*?''+?'而不是'*''+',它會抓取儘可能少的字符以獲得匹配,而不是儘可能多。

+0

謝謝Alan。我嘗試添加一個?在適當的地方,但它似乎沒有什麼區別。 – John 2009-08-04 14:43:19

3

寫我對你的問題的評論,我在想「有definitly必須是另一種解決方案,不涉及某種形式的正則表達式,將成爲不可能維持」 ......

也許我已經找到一種方式;看看

的第一個狀態的手動(引用):

不像加載XML,HTML沒有 是良好的加載。

,第二個的手冊說:

創建從DOM 表示的HTML文檔。


嘗試那些你所提供的非有效-HTML字符串給出了這樣的例子:

$str = <<<STRING 
<tr> 
<th class="ProfileIndent0"> 
<p>Global pharmaceuticals</p> 
<td>197.2</td> 
<td>94</td> 
</tr> 
STRING; 

$doc = new DOMDocument(); 
$doc->loadHTML($str); 
echo $doc->saveHTML(); 

而且,在運行時,它(在命令行,以避免逃避任何麻煩HTML得到它正常顯示),我得到:

$ php ./temp.php 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html><body><tr> 
<th class="ProfileIndent0"> 
<p>Global pharmaceuticals</p> 
</th> 
<td>197.2</td> 
<td>94</td> 
</tr></body></html> 

其中,重新格式化,使:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html> 
    <body> 
     <tr> 
      <th class="ProfileIndent0"> 
       <p>Global pharmaceuticals</p> 
      </th> 
      <td>197.2</td> 
      <td>94</td> 
     </tr> 
    </body> 
</html> 

還不完善,我承認(它沒有添加任何<table>標籤,例如),但是,至少,標籤現在已關閉的應...

有可能是DOCTYPE<html>標籤存在一些問題;你可能不希望那些在... somecomments看看手冊頁下:他們可能會幫助你;-)



編輯多一點思考後:

你「完整」示例會生成一些警告;也許你可以整理你的「HTML」喂加時賽loadHTML之前有點...

Warning: DOMDocument::loadHTML(): Tag co_text invalid in Entity, 
    line: 1 in /home/squale/developpement/tests/temp/temp.php on line 18 
Warning: DOMDocument::loadHTML(): Tag text_data invalid in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18 
Warning: DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18 
Warning: DOMDocument::loadHTML(): Unexpected end tag : table in Entity, 
    line: 10 in /home/squale/developpement/tests/temp/temp.php on line 18 

在糟糕的是,你可能會掩蓋這些錯誤,無論是之前使用error_reporting功能和調用函數,或使用後@ operator ...
我一般不會推薦那些,但是:使用這些應該是在極端情況下 - 也許這一個^^

然而,結果是不看壞,實際上是:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html> 
<body> 
    <co_text text_type_id="6"> 
     <text_data> 
      <tr> 
       <th class="TableHead" colspan="21">2008 Sales</th> 
      </tr> 
      <tr> 
       <th class="ProfileIndent0"></th> 
       <th class="ProfileHead">$ mil.</th> 
       <th class="ProfileHead">% of total</th> 
      </tr> 
      <tr> 
       <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> </th> 
       <td>197.2</td> 
       <td>94</td> 
      </tr> 
      <tr> 
       <th class="ProfileIndent0">Impax pharmaceuticals</th> 
       <td>12.9</td> 
       <td>6</td> 
      </tr> 
      <tr> 
       <th class="ProfileTotal">Total</th> 
       <td class="ProfileDataTotal">210.1</td> 
       <td class="ProfileDataTotal">100</td> 
      </tr> 
      <h3>Selected Generic Products</h3> 
      <ul class="prodoplist"> 
       <li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li> 
       <li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li> 
       <li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li> 
       <li>Dantrolene sodium (generic Dantrium, spasticity)</li> 
       <li>Metformin Hcl (generic Glucophage XR, diabetes)</li> 
       <li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li> 
       <li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li> 
       <li>Oxycodone hydrochloride (generic OxyContin controlled release, pain)</li> 
       <li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li> 
      </ul> 
     ]]&gt; 
     </text_data> 
    </co_text> 
</body> 
</html> 


總之,正如其他人已經建議,一個真正的HTML tidyier /淨化器也許能幫助;-)

0

此正則表達式爲我工作:

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)[email protected]','<th$1>$2</th>',$text); 

注意,只有一行行工作。我的意思是,它的工作:

<tr><th><td>some</td></tr> 

但不適用於:

<tr><th> 
<td>some</td> 
</tr> 

我真的不知道如何使它與「S」修改工作。如果有人能解釋我,我很感激。

這是我的例子:

<?php 
$html = '<CO_TEXT text_type_id="6"> 
     <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr> 

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr> 

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr> 

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr> 

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li 
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release, pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>'; 

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)[email protected]','<th$1>$2</th>',$html); 
echo $text; 
?> 

輸出:

<CO_TEXT text_type_id="6"> 
     <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr> 

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr> 

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr> 

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr> 

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td></th> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li 
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release, pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>