2014-02-15 108 views
2

好吧,儘管看起來很簡單,但我仍然無法正確執行。我嘗試過使用RegEx,我甚至嘗試過DOM解析,但仍然無法正確解決。刪除文本中的所有HTML標記+內容

基於對礦井(Trying to remove HTML tags (+ content) from String)的前一個問題的答案,這是我已經結束了:

public static function removeHtmlTags($str) { 
     $dom = new DOMDOcument(); 
     $errorState = libxml_use_internal_errors(true); 
     $dom->loadHTML($str); 

     $xpath = new DOMXPath($dom); 
     $node = $xpath->query('//body/p/text()')->item(0); 

     if (isset($node->textContent)) $ret = $node->textContent; 
     else $ret=""; 

     libxml_use_internal_errors($errorState); 

     return $ret; 
    } 

這看似招的大部分時間,但這裏的趕...

這(當然,如果你能不承認它是什麼,這是一個維基百科信息框):

|conventional_long_name = Italian Republic 
|native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}} 
|common_name = Italy 
|nickname(s) = Il Belpaese 
|image_flag = Flag of Italy.svg 
|image_coat = Italy-Emblem.svg 
|symbol_type = Emblem 
|image_map = EU-Italy.svg 
|map_caption = {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}} 
|national_anthem = {{native name|it|[[Il Canto degli Italiani]]}}<br/>{{small|''The Song of the Italians''}} [[File:Inno di Mameli instrumental.ogg|center]] 
|official_languages = [[Italian language|Italian]]<sup>a</sup> 
|Religion= [[Roman Catholic]] 
|capital = {{Coat of arms|Rome}} 
|latd=41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E 
|largest_city = capital 
|largest_metropolitan area = {{hlist |[[Milan]] |[[Naples]]}} 
|demonym = [[Italians|Italian]] 
|government_type = [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]] 
|leader_title1 = [[President of Italy|President]] 
|leader_name1 = [[Giorgio Napolitano]] 
|leader_title2 = [[Prime Minister of Italy|Prime Minister]] 
|leader_name2 = [[Enrico Letta]] 
|leader_title3 = [[List of Presidents of the Senate of Italy|President of the Senate]] 
|leader_name3 = [[Pietro Grasso]] 
|leader_title4 = [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]] 
|leader_name4 = [[Laura Boldrini]] 
|legislature = [[Parliament of Italy|Parliament]] 
|upper_house = [[Italian Senate|Senate of the Republic]] 
|lower_house = [[Italian Chamber of Deputies|Chamber of Deputies]] 
|accessionEUdate = 25 March 1957 (founding member) 
|EUseats = 78 
|area_rank = 72nd 
|area_magnitude = 1 E11 
|area_km2 = 301,338 
|area_sq_mi = 116,347 <!--Do not remove per [[WP:MOSNUM]]--> 
|percent_water = 2.4 
|population_census = 59,433,744<ref name="Istat">{{cite web |url=http://www.istat.it/it/files/2012/12/volume_popolazione-legale_XV_censimento_popolazione.pdf|title=Census 2011 - final results |publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=19 December 2012}}</ref> 
|population_census_year = 2011 
|population_census_rank = 23rd 
|population_estimate = 59,685,227<ref>{{cite web |url=http://www.istat.it/en/archive/94537|title=Resident population and population change|publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=25 June 2013}}</ref> 
|population_estimate_year = 2012 
|population_estimate_rank = 23rd 
|population_density_rank = 63rd 
|population_density_km2 = 197.7 
|population_density_sq_mi = 511.6 <!--Do not remove per [[WP:MOSNUM]]--> 
|GDP_PPP = $1.848 trillion<ref name=autogenerated1 >{{cite web |url=http://www.imf.org/external/pubs/ft/weo/2013/02/weodata/weorept.aspx?pr.x=25&pr.y=1&sy=2013&ey=2013&scsm=1&ssd=1&sort=country&ds=.&br=1&c=136&s=NGDPD%2CNGDPDPC%2CPPPGDP%2CPPPPC&grp=0&a= |title=Italy |publisher=International Monetary Fund |accessdate=17 October 2013}}</ref> 
|GDP_PPP_rank = 11th 
|GDP_PPP_year = 2014 
|GDP_PPP_per_capita = $30,218<ref name=autogenerated1/> 
|GDP_PPP_per_capita_rank = 34th 
|GDP_nominal = $2.148 trillion<ref name=autogenerated1/> 
|GDP_nominal_rank = 9th 
|GDP_nominal_year = 2014 
|GDP_nominal_per_capita = $35,123<ref name=autogenerated1/> 
|GDP_nominal_per_capita_rank = 27th 
|sovereignty_type = [[History of Italy|Formation]] 
|established_event1 = [[Italian unification|Unification]] 
|established_date1 = 17 March 1861 
|established_event2 = [[Italian constitutional referendum, 1946|Republic]] 
|established_date2 = 2 June 1946 
|Gini_year = 2011 
|Gini_change = <!--increase/decrease/steady--> 
|Gini = 31.9 <!--number only--> 
|Gini_ref = <ref name=eurogini>{{cite web|title=Gini coefficient of equivalised disposable income (source: SILC)|url=http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ilc_di12|publisher=Eurostat Data Explorer|accessdate=13 August 2013}}</ref> 
|Gini_rank = 
|HDI_year = 2013 
|HDI_change = increase <!--increase/decrease/steady--> 
|HDI = 0.881 <!--number only--> 
|HDI_ref = <ref name="HDI">{{cite web |url=http://hdr.undp.org/en/media/HDR_2011_EN_Table1.pdf |title=Human Development Report 2011 |year=2011 |publisher=United Nations |accessdate=5 November 2011}}</ref> 
|HDI_rank = 25th 
|currency = Euro ([[Euro sign|€]])<sup>b</sup> 
|currency_code = EUR 
|country_code = 
|time_zone = [[Central European Time|CET]] 
|utc_offset = +1 
|time_zone_DST = [[Central European Summer Time|CEST]] 
|utc_offset_DST = +2 
|drives_on = right 
|calling_code = [[Telephone numbers in Italy|39]]<sup>c</sup> 
|cctld = [[.it]]<sup>d</sup> 
|footnote_a = <span style="font-size:100%;">French is co-official in the [[Aosta Valley]]; [[Slovene language|Slovene]] is co-official in the [[province of Trieste]] and the [[province of Gorizia]]; German and [[Ladin language|Ladin]] are co-official in [[South Tyrol]].</span> 

|footnote_b = <span style="font-size:100%;">Before 2002, the [[Italian lira|Italian Lira]]. The euro is accepted in [[Campione d'Italia]], but the official currency there is the [[Swiss Franc]].<ref>{{cite web |url=http://www.comune.campione-d-italia.co.it/ |title=Comune di Campione d'Italia |publisher=Comune.campione-d-italia.co.it |date=14 July 2010 |accessdate=30 October 2010}}</ref></span> 
|footnote_c = <span style="font-size:100%;">To call [[Campione d'Italia]], it is necessary to use the Swiss code [[+41]].</span> 
|footnote_d = <span style="font-size:100%;">The [[.eu]] domain is also used, as it is shared with other [[European Union]] member states.</span> 

(後也explode荷蘭國際集團的新行)變爲:

Array 
(
    [conventional_long_name] => Italian Republic 
    [native_name] => {{lang|it|''Repubblica italiana 
    [common_name] => Italy 
    [nickname(s)] => Il Belpaese 
    [image_flag] => Flag of Italy.svg 
    [image_coat] => Italy-Emblem.svg 
    [symbol_type] => Emblem 
    [image_map] => EU-Italy.svg 
    [map_caption] => {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}} 
    [national_anthem] => {{native name|it|[[Il Canto degli Italiani]]}} 
    [official_languages] => [[Italian language|Italian]] 
    [Religion] => [[Roman Catholic]] 
    [capital] => {{Coat of arms|Rome}} 
    [latd] => 41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E 
    [largest_city] => capital 
    [largest_metropolitan area] => {{hlist |[[Milan]] |[[Naples]]}} 
    [demonym] => [[Italians|Italian]] 
    [government_type] => [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]] 
    [leader_title1] => [[President of Italy|President]] 
    [leader_name1] => [[Giorgio Napolitano]] 
    [leader_title2] => [[Prime Minister of Italy|Prime Minister]] 
    [leader_name2] => [[Enrico Letta]] 
    [leader_title3] => [[List of Presidents of the Senate of Italy|President of the Senate]] 
    [leader_name3] => [[Pietro Grasso]] 
    [leader_title4] => [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]] 
    [leader_name4] => [[Laura Boldrini]] 
    [legislature] => [[Parliament of Italy|Parliament]] 
    [upper_house] => [[Italian Senate|Senate of the Republic]] 
    [lower_house] => [[Italian Chamber of Deputies|Chamber of Deputies]] 
    [accessionEUdate] => 25 March 1957 (founding member) 
    [EUseats] => 78 
    [area_rank] => 72nd 
    [area_magnitude] => 1 E11 
    [area_km2] => 301,338 
    [area_sq_mi] => 116,347 
    [percent_water] => 2.4 
    [population_census] => 59,433,744 
    [population_census_year] => 2011 
    [population_census_rank] => 23rd 
    [population_estimate] => 59,685,227 
    [population_estimate_year] => 2012 
    [population_estimate_rank] => 23rd 
    [population_density_rank] => 63rd 
    [population_density_km2] => 197.7 
    [population_density_sq_mi] => 511.6 
    [GDP_PPP] => $1.848 trillion 
    [GDP_PPP_rank] => 11th 
    [GDP_PPP_year] => 2014 
    [GDP_PPP_per_capita] => $30,218 
    [GDP_PPP_per_capita_rank] => 34th 
    [GDP_nominal] => $2.148 trillion 
    [GDP_nominal_rank] => 9th 
    [GDP_nominal_year] => 2014 
    [GDP_nominal_per_capita] => $35,123 
    [GDP_nominal_per_capita_rank] => 27th 
    [sovereignty_type] => [[History of Italy|Formation]] 
    [established_event1] => [[Italian unification|Unification]] 
    [established_date1] => 17 March 1861 
    [established_event2] => [[Italian constitutional referendum, 1946|Republic]] 
    [established_date2] => 2 June 1946 
    [Gini_year] => 2011 
    [Gini_change] => 
    [Gini] => 31.9 
    [Gini_ref] => 
    [HDI_year] => 2013 
    [HDI_change] => increase 
    [HDI] => 0.881 
    [HDI_ref] => 
    [HDI_rank] => 25th 
    [currency] => Euro ([[Euro sign|â¬]]) 
    [currency_code] => EUR 
    [time_zone] => [[Central European Time|CET]] 
    [utc_offset] => +1 
    [time_zone_DST] => [[Central European Summer Time|CEST]] 
    [utc_offset_DST] => +2 
    [drives_on] => right 
    [calling_code] => [[Telephone numbers in Italy|39]] 
    [cctld] => [[.it]] 
    [footnote_a] => 
    [footnote_b] => 
    [footnote_c] => 
    [footnote_d] => 
) 

我想知道:

發生了什麼事|native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}}

不能說是:

|native_name = {{lang|it|''Repubblica italiana''}}

相反,它似乎是擺脫了下面的文字HTML評論

任何想法?

+0

你不想刪除HTML評論嗎? –

+0

@AmalMurali嗯,我*做*。但是,如果你仔細觀察上面的結果,它不會刪除*只是*註釋,而是刪除後面的內容。這似乎很奇怪......爲什麼會這樣呢? –

+0

可能的重複:http://stackoverflow.com/questions/2630159/strip-tags-and-everything-in-between – Niels

回答

0

來自地獄的方式:

$str = substr($str, 1); 
$lines = explode("\n|", $str); 

$result = array(); 

$pattern = '~ 
# subpattern definitions 
(?(DEFINE) 
    (?<c> <!--.*?-->)  # html comment 
    (?<tag>     # tag (possible nested tags with the same name) 
     ( <(\w++) 
      (?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2))* 
      </\g{-1}>) 
    ) 
    (?<sctag> </w++[^>]*>) # self closing tag 
) 
# main pattern 
\g<c> | \g<tag> | \g<sctag> | \s+$ 
~x'; 

foreach($lines as $line) { 
    $kv = explode(' = ', $line, 2); 

    $kv[1] = (isset($kv[1])) ? preg_replace($pattern, '', $kv[1]) : null; 

    $result[$kv[0]] = $kv[1]; 
} 
unset($kv, $pattern, $lines, $str); 
echo '<pre>' . htmlspecialchars(print_r($result, true)) . '</pre>'; 

注1:由於字符串包含罕見標籤(即標籤不在html標籤),它是可能的,這些標籤可以自行閉合的標籤或不在同時。換句話說,您可以在同一個文檔中找到<ref>....</ref><ref/>(或<ref>作爲自閉標籤)。要處理這個特定的情況,你可以將標籤子模式定義的中間行改爲:(?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) | <\g{-1}\b[^>]*?/?>)*

注意2:如果你不想使用正則表達式,方法是使用DOM,但是由於標籤<ref>在html中不存在,您必須編寫自己的描述此標記(和所有其他html標記)的DTD,將其添加到您的字符串中,並使用DOMDocument類的loadXML方法。

相關問題