使用CodeIgniter的正則表達式最終化HTML輸出

Google頁面建議您縮小HTML，即刪除所有不必要的空格。 CodeIgniter具有giziping輸出的功能，或者可以通過.htaccess完成。但我仍然想從最終的HTML輸出中刪除不必要的空格。使用CodeIgniter的正則表達式最終化HTML輸出

我用這段代碼玩了一下，它似乎工作。這確實會導致沒有多餘空格的HTML並刪除其他標籤格式。

class Welcome extends CI_Controller 
{ 
    function _output() 
    { 
     echo preg_replace('!\s+!', ' ', $output); 
    } 

    function index(){ 
    ... 
    } 
}

的問題是有可能像 <pre>，<textarea>等標籤..這可能有空格他們和正則表達式應該刪除它們。那麼，如何從最終的HTML中刪除多餘空間，而不會影響使用正則表達式的空間或格式化這些特定標記？

由於@Alan摩爾得到了答案，這個工作對我來說

echo preg_replace('#(?ix)(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)(?:<(?>textarea|pre)\b|\z))#', ' ', $output);

ridgerunner做了分析這個正則表達式的一個很好的工作。我最終使用他的解決方案。歡呼ridgerunner。

來源

2011-03-15 Aman

+12

不要使用正則表達式來執行HTML。 – SLaks 2011-03-15 13:23:53

無限upvotes你，SLaks。 – 2011-03-15 13:24:26

好吧，那麼重新格式化最終html輸出的好方法是什麼呢？ – Aman 2011-03-15 13:30:54

對於那些好奇如何艾倫·摩爾的正則表達式的作品（是的，它不工作），我已經採取的評論，以便它可以通過凡人閱讀自由：

function process_data_alan($text) // 
{ 
    $re = '%# Collapse ws everywhere but in blacklisted elements. 
     (?>    # Match all whitespans other than single space. 
      [^\S ]\s*  # Either one [\t\r\n\f\v] and zero or more ws, 
     | \s{2,}  # or two or more consecutive-any-whitespace. 
     ) # Note: The remaining regex consumes no text at all... 
     (?=    # Ensure we are not in a blacklist tag. 
      (?:   # Begin (unnecessary) group. 
      (?:   # Zero or more of... 
       [^<]++ # Either one or more non-"<" 
      | <   # or a < starting a non-blacklist tag. 
       (?!/?(?:textarea|pre)\b) 
      )*+   # (This could be "unroll-the-loop"ified.) 
     )    # End (unnecessary) group. 
      (?:   # Begin alternation group. 
      <   # Either a blacklist start tag. 
      (?>textarea|pre)\b 
      | \z   # or end of file. 
     )    # End alternation group. 
     ) # If we made it here, we are not in a blacklist tag. 
     %ix'; 
    $text = preg_replace($re, " ", $text); 
    return $text; 
}

我這裏有新東西，但我可以馬上看到Alan在正則表達式方面非常出色。我只會添加以下建議。

有一個不必要的捕獲組可以刪除。
儘管OP沒有這麼說，但<SCRIPT>元素應該被添加到<PRE>和<TEXTAREA>黑名單中。
添加'S' PCRE「研究」修飾符加快了這個正則表達式大約20％。
在預測中有一個替代組，適用於Friedl的「展開回路」效率構造。
更嚴重的是，這個相同的替換組（即(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+）容易在大型目標字符串上出現過多的PCRE遞歸，這可能導致堆棧溢出，從而導致Apache/PHP可執行文件無提示地 seg-fault並沒有警告的崩潰。（Apache httpd.exe的Win32版本特別容易受此影響，因爲它與* nix可執行文件相比，只有256KB堆棧，通常使用8MB堆棧或更多版本構建。）Philip Hazel（PHP中使用的PCRE正則表達式引擎的作者）在文檔中討論這個問題：PCRE DISCUSSION OF STACK USAGE。儘管Alan已經正確應用了與本文檔中Philip展示的相同的修補程序（對第一個替代方案應用了所有格），但如果HTML文件很大並且有很多未列入黑名單的標籤，仍然會有很多遞歸。例如在我的Win32盒子（帶有一個256KB堆棧的可執行文件）上，腳本中只有60KB的測試文件。還要注意的是，PHP遺憾的是不遵循這些建議，並將默認遞歸限制方式設置得太高（100000）。（根據PCRE文檔，這應該設置爲等於堆棧大小除以500的值）。

下面是一個改進的版本，這是比原來快，處理較大的輸入，並正常失敗，如果輸入字符串過大，無法處理的消息：

// Set PCRE recursion limit to sane value = STACKSIZE/500 
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache 
ini_set("pcre.recursion_limit", "16777"); // 8MB stack. *nix 
function process_data_jmr1($text) // 
{ 
    $re = '%# Collapse whitespace everywhere but in blacklisted elements. 
     (?>    # Match all whitespans other than single space. 
      [^\S ]\s*  # Either one [\t\r\n\f\v] and zero or more ws, 
     | \s{2,}  # or two or more consecutive-any-whitespace. 
     ) # Note: The remaining regex consumes no text at all... 
     (?=    # Ensure we are not in a blacklist tag. 
      [^<]*+  # Either zero or more non-"<" {normal*} 
      (?:   # Begin {(special normal*)*} construct 
      <   # or a < starting a non-blacklist tag. 
      (?!/?(?:textarea|pre|script)\b) 
      [^<]*+  # more non-"<" {normal*} 
     )*+   # Finish "unrolling-the-loop" 
      (?:   # Begin alternation group. 
      <   # Either a blacklist start tag. 
      (?>textarea|pre|script)\b 
      | \z   # or end of file. 
     )    # End alternation group. 
     ) # If we made it here, we are not in a blacklist tag. 
     %Six'; 
    $text = preg_replace($re, " ", $text); 
    if ($text === null) exit("PCRE Error! File too big.\n"); 
    return $text; 
}

P.S.我非常熟悉這個PHP/Apache seg-fault問題，因爲我在參與幫助Drupal社區的同時也在解決這個問題。參見：Optimize CSS option causes php cgi to segfault in pcre function "match"。我們也在FluxBB論壇軟件項目中使用BBCode解析器。

希望這會有所幫助。

來源

2011-03-16 10:31:38 ridgerunner

哇，這是相當深入的分析，我不知道所有這些細節。感謝很多，我會嘗試你的正則表達式。 – Aman 2011-03-17 04:41:08

我可以有你使用的測試文件嗎？ – Aman 2011-03-17 10:28:27

@Aman是的，但它會有一段時間之前，我發佈它（該文件是一篇文章正在進行中（在HTML中）...） – ridgerunner 2011-03-18 05:02:38

我在兩個項目中實施了@ridgerunner的答案，最終在其中一個項目中進行了一些嚴重的減速（10-30秒的請求時間）。我發現我必須將pcre.recursion_limit和pcre.backtrack_limit都設置得很低才能工作，但即使如此，在大約2秒的處理後它也會放棄，並返回false。由於這個原因，我用這個解決方案（易於理解的正則表達式）取代了它，它受Smarty 2的outputfilter.trimwhitespace函數的啓發。它沒有回溯或遞歸，並且每次都工作（而不是在藍色月亮中發生災難性故障）：

function filterHtml($input) { 
    // Remove HTML comments, but not SSI 
    $input = preg_replace('/<!--[^#](.*?)-->/s', '', $input); 

    // The content inside these tags will be spared: 
    $doNotCompressTags = ['script', 'pre', 'textarea']; 
    $matches = []; 

    foreach ($doNotCompressTags as $tag) { 
     $regex = "!<{$tag}[^>]*?>.*?</{$tag}>!is"; 

     // It is assumed that this placeholder could not appear organically in your 
     // output. If it can, you may have an XSS problem. 
     $placeholder = "@@<'-placeholder-$tag'>@@"; 

     // Replace all the tags (including their content) with a placeholder, and keep their contents for later. 
     $input = preg_replace_callback(
      $regex, 
      function ($match) use ($tag, &$matches, $placeholder) { 
       $matches[$tag][] = $match[0]; 
       return $placeholder; 
      }, 
      $input 
     ); 
    } 

    // Remove whitespace (spaces, newlines and tabs) 
    $input = trim(preg_replace('/[ \n\t]+/m', ' ', $input)); 

    // Iterate the blocks we replaced with placeholders beforehand, and replace the placeholders 
    // with the original content. 
    foreach ($matches as $tag => $blocks) { 
     $placeholder = "@@<'-placeholder-$tag'>@@"; 
     $placeholderLength = strlen($placeholder); 
     $position = 0; 

     foreach ($blocks as $block) { 
      $position = strpos($input, $placeholder, $position); 
      if ($position === false) { 
       throw new \RuntimeException("Found too many placeholders of type $tag in input string"); 
      } 
      $input = substr_replace($input, $block, $position, $placeholderLength); 
     } 
    } 

    return $input; 
}

來源

2016-08-10 22:07:16 olemartinorg

使用CodeIgniter的正則表達式最終化HTML輸出

回答

相關問題