2016-05-17 20 views
-1

我有兩個正則表達式匹配[value]和另一個匹配html屬性,但我需要將它們組合成一個單一的正則表達式。PHP的preg_replace在html中找到匹配,但如果它的html屬性不匹配

這是我的工作正則表達式找到[value]

$tagregexp = '[a-zA-Z_\-][0-9a-zA-Z_\-\+]{2,}'; 

    $pattern = 
      '\\['        // Opening bracket 
     . '(\\[?)'       // 1: Optional second opening bracket for escaping shortcodes: [[tag]] 
     . "($tagregexp)"      // 2: Shortcode name 
     . '(?![\\w-])'      // Not followed by word character or hyphen 
     . '('        // 3: Unroll the loop: Inside the opening shortcode tag 
     .  '[^\\]\\/]*'     // Not a closing bracket or forward slash 
     .  '(?:' 
     .   '\\/(?!\\])'    // A forward slash not followed by a closing bracket 
     .   '[^\\]\\/]*'    // Not a closing bracket or forward slash 
     .  ')*?' 
     . ')' 
     . '(?:' 
     .  '(\\/)'      // 4: Self closing tag ... 
     .  '\\]'       // ... and closing bracket 
     . '|' 
     .  '\\]'       // Closing bracket 
     .  '(?:' 
     .   '('      // 5: Unroll the loop: Optionally, anything between the opening and closing shortcode tags 
     .    '[^\\[]*+'    // Not an opening bracket 
     .    '(?:' 
     .     '\\[(?!\\/\\2\\])' // An opening bracket not followed by the closing shortcode tag 
     .     '[^\\[]*+'   // Not an opening bracket 
     .    ')*+' 
     .   ')' 
     .   '\\[\\/\\2\\]'    // Closing shortcode tag 
     .  ')?' 
     . ')' 
     . '(\\]?)';       // 6: Optional second closing bracket for escaping shortcodes: [[tag]] 

example here

此正則表達式(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?屬性和值相匹配。 example here

我想正則表達式來匹配在下面的例子

  • <div [value] ></div>
  • <div>[value]</div>

但不[value]找到匹配在這個例子中

  • <input attr="attribute[value]"/>

只是需要將它做成一個單一的正則表達式中使用我的preg_replace_callback

preg_replace_callback($pattern, replace_matches, $html); 
+0

你有沒有考慮使用一個解析器呢? – chris85

+0

它是PHP字符串,而不是Java字符串,你不需要全部轉義。使用x修飾符(如果可以使用nowdoc字符串),而不是使用連接。如果你想處理html(或xml),忘記regex並使用DOMDocument(最終DOMXPath)。 –

+0

其他的事情,關閉方括號不是一個特殊的字符,你不需要逃避它。字符類中的方括號沒有什麼特別之處,你可以寫'[^ []'而不是'[^ \\ []''。 *(你甚至可以寫'[^]]和'[]]',因爲在第一個位置,方括號被看作是一個文字字符。)* –

回答

1

Foreward

在它看起來像你試圖解析HTML代碼與常規的表面表達。我覺得有必要指出,由於可能會出現所有可能的模糊邊緣情況,因此使用正則表達式來解析HTML是不可取的,但似乎您對HTML有一些控制權,因此您應該能夠避免使用許多正則表達式警察哭了。

說明

<\w+\s(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\[(?<DesiredValue>[^\]]*)\]) 
| 
<\w+\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*> 
(?:(?!<\/div>)(?!\[).)*\[(?<DesiredValue>[^\]]*)\] 

Regular expression visualization

這個正則表達式將執行以下操作:方括號[some value]

  • [value]

    • 捕獲子是在一個標籤
    • [value]是不是一個標籤
    • 提供子串的屬性區域內沒有嵌套在另一個值的ttributes <input attrib=" [value] ">
  • 捕獲的子串將不包括包裹方括號
  • 允許任何標籤名,或與所需的標籤名稱
  • 允許value是任何字符串替換\w
  • 難以避免邊緣情況

注:這個表達式最好用下列標誌使用:

  • 全球
  • 點匹配新行
  • 忽略表達空白
  • 允許重複的命名捕獲組

例子

現場演示

https://regex101.com/r/tT0bN5/1

示例文字

<div [value 1] ></div> 
<div>[value 2]</div> 
but not find a match in this example 

<div attr="attribute[value 3]"/> 
<img [value 4]> 
<a href="http://[value 5]">[value 6]</a> 

樣品匹配

MATCH 1 
DesiredValue [6-13] `value 1` 
MATCH 2 
DesiredValue [29-36] `value 2` 
MATCH 3 
DesiredValue [121-128] `value 4` 
MATCH 4 
DesiredValue [159-166] `value 6` 

說明

NODE      EXPLANATION 
---------------------------------------------------------------------- 
    <div      '<div' 
---------------------------------------------------------------------- 
    \s      whitespace (\n, \r, \t, \f, and " ") 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the least amount 
          possible)): 
---------------------------------------------------------------------- 
     [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     ='      '=\'' 
---------------------------------------------------------------------- 
     [^']*     any character except: ''' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     ="      '="' 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     =      '=' 
---------------------------------------------------------------------- 
     [^'"]     any character except: ''', '"' 
---------------------------------------------------------------------- 
     [^\s>]*     any character except: whitespace (\n, 
           \r, \t, \f, and " "), '>' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
    )*?      end of grouping 
---------------------------------------------------------------------- 
    \[      '[' 
---------------------------------------------------------------------- 
    (      group and capture to \1: 
---------------------------------------------------------------------- 
     [^\]]*     any character except: '\]' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
    )      end of \1 
---------------------------------------------------------------------- 
    \]      ']' 
---------------------------------------------------------------------- 
)      end of look-ahead 
---------------------------------------------------------------------- 
|      OR 
---------------------------------------------------------------------- 
    <div      '<div' 
---------------------------------------------------------------------- 
    \s?      whitespace (\n, \r, \t, \f, and " ") 
          (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    ='      '=\'' 
---------------------------------------------------------------------- 
    [^']*     any character except: ''' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    '      '\'' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    ="      '="' 
---------------------------------------------------------------------- 
    [^"]*     any character except: '"' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    =      '=' 
---------------------------------------------------------------------- 
    [^'"]     any character except: ''', '"' 
---------------------------------------------------------------------- 
    [^\s>]*     any character except: whitespace (\n, 
          \r, \t, \f, and " "), '>' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)*      end of grouping 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     <      '<' 
---------------------------------------------------------------------- 
     \/      '/' 
---------------------------------------------------------------------- 
     div>      'div>' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     \[      '[' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    .      any character 
---------------------------------------------------------------------- 
)*      end of grouping 
---------------------------------------------------------------------- 
    \[      '[' 
---------------------------------------------------------------------- 
    (      group and capture to \2: 
---------------------------------------------------------------------- 
    [^\]]*     any character except: '\]' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)      end of \2 
---------------------------------------------------------------------- 
    \]      ']' 
+0

令人難以置信的答案,我很欣賞投入到答案中的時間和精力。我仍然沒有完全解決它,但這應該有很大的幫助。 – TarranJones

+0

讓我知道這個答案是缺少的,或者我可以幫忙。 –