2014-02-10 32 views
0

我正在使用simple_html_dom來爬取網站,並且需要在 - > innertext和 - >明文之間的某個結果。只能使用PHP在html字符串中留下一些標籤

例如,這裏是源字符串:

<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:</span>

我需要擺脫span標籤的,但不是他們的內容(除非span只包含&nbsp;的),但保留<i><u><b>

所以我想在這裏實現的結果將是一個字符串:

[28] The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:

回答

0

你可以試試下面的代碼行:

<?php 

$string = '<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n 
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian tr 
ade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive u 
se of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> 
provides:</span>'; 

// Remove attributes within the <span> tag, just for clarity's sake. 
$string = preg_replace('/(<span ([^\>]+)>)/i', '<span>', $string); 

// Remove any spans that only contain &nbsp; 
$string = preg_replace('/<span>([ ]|&nbsp;)*<\/span>/i', '', $string); 

// Replace any consecutive span (opening or closing) tags with a space, to make 
// clear the separation between one span and the next. 
$string = preg_replace('/<(\/)?span><(\/)?span>/i', ' ', $string); 

// Remove any remaining any instances of opening or closing span tags. 
$string = preg_replace('/<(\/)?span>/i', '', $string); 

print $string; 

請注意,我對每個正則表達式,它給你一個不區分大小寫的搜索斜槓後添加一個i。這是萬一你有一些代碼是<SPAN><span>甚至<SpaN>

當然,它並不是緊密壓縮的單行正則表達式代碼。但是,我這樣做是爲了讓您看到沿途的步驟。你可以把整個print $string;行看到進展。我希望通過這種向您演示代碼的方式,從長遠來看可以幫助您更好地瞭解正則表達式和preg_replace的使用方式。

+0

謝謝。這正是我需要的! –

0

你可以試試這個。

echo stripcslashes('<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:</span>'); 
0

這就是用strip_tags爲:

echo strip_tags('<span>strip me</span> <i>leave me alone</i>', '<i>'); 
//=> strip me <i>leave me alone</i> 
相關問題