根據用戶輸入/用戶的來源,您可能需要加大容差。雖然我討厭標籤屬性沒有引號,你確實遇到。您還會在標籤中遇到懶惰的括號,例如value="4 > 3"
。
(?<=<)([\w-]+)((?:\s+[\w-]+\s*(?:=\s*(?:[^"'>\s]+|("|').*?\3))?)*)\s*(?=>)
或
<([\w-]+)((?:\s+[\w-]+\s*(?:=\s*(?:[^"'>\s]+|("|').*?\3))?)*)\s*>
第一正則表達式的說明(第二個是類似的,但實際上捕獲括號而不是與lookarounds觀察它們)。
(?<= # Opens LB
< # Literal <
) # Closes LB
( # Opens CG1
[\w-]+ # Character class (any of the characters within)
# Token: \w (a-z, A-Z, 0-9, _)
# Any of: -
# + repeats one or more times
) # Closes CG1
( # Opens CG2
(?: # Opens NCG
\s+ # Token: \s (white space)
[\w-]+ # Character class (any of the characters within)
# Token: \w (a-z, A-Z, 0-9, _)
# Any of: -
\s* # Token: \s (white space)
# * repeats zero or more times
(?: # Opens NCG
= # Literal =
\s* # Token: \s (white space)
(?: # Opens NCG
[^"'>\s]+ # Negated Character class (excludes the characters within)
# None of: "'>
# Token: \s (white space)
| # Alternation (NCG)
( # Opens CG3
" # Literal "
| # Alternation (CG3)
' # Literal '
) # Closes CG3
.*? # . denotes any single character, except for newline
# * repeats zero or more times
# ? as few times as possible
\3 # A backreference to CG3
# This is not a repeat of the match, not the pattern.
# If this is an Octal Escape try padding with 0s like \003.
) # Closes NCG
)? # Closes NCG
# ? repeats zero or one times
)* # Closes NCG
) # Closes CG2
\s* # Token: \s (white space)
(?= # Opens LA
> # Literal >
) # Closes LA
你可以試試'(?<=<)[^<]+(?=>)'。或者將括號與標籤匹配:'<[^<]+>'。 –
我會嘗試簡單的'<.*?>',以匹配尖括號包圍的所有內容。如果你在你的非HTML中有一個尖括號,但是這會失敗。 – Cullub
@stribizhev,你的解決方案工作。謝謝。 :) –