HTML標記的正則表達式

我正在使用詞法分析器。我有一個HTML文件。我希望將文件中的每一個字母都轉換爲CAPITAL字母，並將其寫入HTML標籤。例如：HTML標記的正則表達式

<html> 
    <body> 
     StackOverFlow 
    </body> 
</html>

這將轉換爲以下內容。

<html> 
    <body> 
     STACKOVERFLOW 
    </body> 
</html>

我只想知道正則表達式，它將選擇HTML標記中的所有內容，並對它們不做任何處理。

只考慮在<和>之內的簡單HTML標籤。

來源

2015-05-05 Surajeet Bharati

你可以試試'（？<=<)[^<]+(?=>）'。或者將括號與標籤匹配：'<[^<]+>'。 –

我會嘗試簡單的'<.*?>'，以匹配尖括號包圍的所有內容。如果你在你的非HTML中有一個尖括號，但是這會失敗。 – Cullub

@stribizhev，你的解決方案工作。謝謝。 :) –

使用下列任正則表達式的：

(?<=<)[^<]+(?=>)

說明：(?<=<) - [^<]+前檢查，如果我們有一個<（不消耗）（1個或多個非開括號）（消耗），然後在(?=>)（不要消耗）後檢查我們是否有>。由於我們只消費中間部分，因此我們有p而不是<p>作爲匹配。

或者，只是爲了與標籤匹配括號：

<[^<]+>

說明：<匹配單個文字<，然後[^<]+比賽1或大於<其他多個字符，然後一個單一的文字>。所有角色都被消耗掉了，因此，這場比賽看起來像是<p>。

來源

2015-05-05 22:24:52

根據用戶輸入/用戶的來源，您可能需要加大容差。雖然我討厭標籤屬性沒有引號，你確實遇到。您還會在標籤中遇到懶惰的括號，例如value="4 > 3"。

(?<=<)([\w-]+)((?:\s+[\w-]+\s*(?:=\s*(?:[^"'>\s]+|("|').*?\3))?)*)\s*(?=>)

或

<([\w-]+)((?:\s+[\w-]+\s*(?:=\s*(?:[^"'>\s]+|("|').*?\3))?)*)\s*>

第一正則表達式的說明（第二個是類似的，但實際上捕獲括號而不是與lookarounds觀察它們）。

(?<=        # Opens LB 
    <        # Literal < 
)         # Closes LB 
(        # Opens CG1 
    [\w-]+      # Character class (any of the characters within) 
            # Token: \w (a-z, A-Z, 0-9, _) 
            # Any of: - 
            # + repeats one or more times 
)         # Closes CG1 
(        # Opens CG2 
    (?:       # Opens NCG 
     \s+      # Token: \s (white space) 
     [\w-]+     # Character class (any of the characters within) 
            # Token: \w (a-z, A-Z, 0-9, _) 
            # Any of: - 
     \s*      # Token: \s (white space) 
            # * repeats zero or more times 
     (?:      # Opens NCG 
      =      # Literal = 
      \s*     # Token: \s (white space) 
      (?:     # Opens NCG 
       [^"'>\s]+   # Negated Character class (excludes the characters within) 
            # None of: "'> 
            # Token: \s (white space) 
      |      # Alternation (NCG) 
       (    # Opens CG3 
        "    # Literal " 
       |     # Alternation (CG3) 
        '    # Literal ' 
       )     # Closes CG3 
       .*?    # . denotes any single character, except for newline 
            # * repeats zero or more times 
            # ? as few times as possible 
       \3    # A backreference to CG3 
            # This is not a repeat of the match, not the pattern. 
            # If this is an Octal Escape try padding with 0s like \003. 
      )      # Closes NCG 
     )?      # Closes NCG 
            # ? repeats zero or one times 
    )*       # Closes NCG 
)         # Closes CG2 
\s*        # Token: \s (white space) 
(?=        # Opens LA 
    >        # Literal > 
)         # Closes LA

來源

2015-05-05 22:50:31

HTML標記的正則表達式

回答

相關問題