HTML標記替換正則表達式不能正常工作

這是我的後續another question。我發現的解決方案非常適合我投擲的每一個測試案例，直到第一次出現的案例出現在我身上。我的目標是使用正則表達式對格式不正確的標籤屬性進行重新格式化（我知道，可能不是我發現的傻瓜式方法，但忍受着我）。HTML標記替換正則表達式不能正常工作

我的功能：

Public Function ConvertMarkupAttributeQuoteType(ByVal html As String) As String 
    Dim findTags As String = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>" 
    Return Regex.Replace(html, findTags, AddressOf EvaluateTag) 
End Function 

Private Function EvaluateTag(ByVal match As Match) As String 
    Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))" 
    Return Regex.Replace(match.Value, attributes, "='$2'") 
End Function

在EvaluateTag功能正則表達式將正確轉換HTML類似

<table border=2 cellpadding='2' cellspacing="1">

到

<table border='2' cellpadding='2' cellspacing='1'>

你會發現我強迫屬性值被單引號包圍 - 不用擔心這一點。如果最近的屬性值在它周圍沒有任何東西，則它打破的情況。

<table width=100 border=0>

出來的正則表達式的替換爲

<table width='100' border='0>'

與去年單引號錯誤外的標籤。在我之前我已經承認過我並不擅長正則表達式;我只是沒有花時間去理解它能做的一切。所以，我要求幫助調整EvaluateTag正則表達式，以便它可以處理這個最後的情況。

謝謝！

來源

2009-09-14 Cᴏʀʏ

我認爲使用HTML清理工具（如您在其他問題上提出的da8）或使用寬容DOM解析HTML並重新導出它會更好。 – TrueWill 2009-09-14 17:26:01

[可以提供一些爲什麼很難用正則表達式分析XML和HTML的例子嗎？]（http：// stackoverflow。com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege） – 2011-07-09 20:54:05

[RegEx match open標籤除XHTML自包含標籤]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags） – 2011-09-15 14:15:52

richardtallent的解釋。在玩了一段時間之後，下面的EvaluateTag函數替換似乎正在工作。

任何人都可以看到任何問題嗎？我所做的更改是在管道之後的最後一個組中。也許它可以進一步更簡化？

Private Function EvaluateTag(ByVal match As Match) As String 
    Dim attributes As String = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))" 
    Return Regex.Replace(match.Value, attributes, "='$2'") 
End Function

如果沒有人回覆我可能會接受這個答案。再次感謝！

來源

2009-09-15 01:27:15

第一個正則表達式函數會通過EvaluateTag 整個匹配，這是整個HTML標記。

但EvaluateTag不忽略最終大於字符...

恐怕我還沒有足夠的咖啡因還沒有通過，整個表達式的工作，但這種調整可能工作（增加了一個大於在人物列表）：爲什麼正則表達式是不工作我指出了正確的方向

Private Function EvaluateTag(ByVal match As Match) As String 
    Dim attributes As String = "\s*=\s*(?:(['"">])(?<g1>(?:(?!\1).)*)\1|(?<g1>\S+))" 
    Return Regex.Replace(match.Value, attributes, "='$2'") 
End Function

來源

2009-09-14 16:00:06 richardtallent

這並沒有很好的工作。實際上，它對原始正則表達式沒有任何影響。 – 2009-09-15 01:23:32

HTML標記替換正則表達式不能正常工作

回答

相關問題