我嘗試了一些東西,但似乎沒有任何工作正常。我有一個Access數據庫,並在VBA中編寫代碼。我有一串HTML源代碼,我有興趣將所有的HTML代碼和標籤剝離出來,這樣我只需要純文本字符串,並且沒有html或標籤。做這個的最好方式是什麼?從字符串中剝離HTML
感謝
我嘗試了一些東西,但似乎沒有任何工作正常。我有一個Access數據庫,並在VBA中編寫代碼。我有一串HTML源代碼,我有興趣將所有的HTML代碼和標籤剝離出來,這樣我只需要純文本字符串,並且沒有html或標籤。做這個的最好方式是什麼?從字符串中剝離HTML
感謝
Function StripHTML(cell As Range) As String
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
Dim sInput As String
Dim sOut As String
sInput = cell.Text
With RegEx
.Global = True
.IgnoreCase = True
.MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
End With
sOut = RegEx.Replace(sInput, "")
StripHTML = sOut
Set RegEx = Nothing
End Function
這可以幫助你,祝你好運。
它取決於html結構的複雜程度以及您希望得到的數據量。
取決於你使用正則表達式可能會遇到的複雜性,但對於使用正則表達式試圖從html解析數據的複雜標記就像嘗試用叉子吃湯一樣。
可以使用htmFile對象打開平面文件到您可以用交互的對象,例如:
Function ParseATable(url As String) As Variant
Dim htm As Object, table As Object
Dim data() As String, x As Long, y As Long
Set htm = CreateObject("HTMLfile")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.send
htm.body.innerhtml = .responsetext
End With
With htm
Set table = .getelementsbytagname("table")(0)
Redim data(1 To table.Rows.Length, 1 To 10)
For x = 0 To table.Rows.Length - 1
For y = 0 To table.Rows(x).Cells.Length - 1
data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText
Next y
Next x
ParseATable = data
End With
End Function
+1值得它獨自*從正則表達式的HTML解析數據就像嘗試用叉子吃湯* – brettdj
使用早期綁定:
Public Function GetText(inputHtml As String) As String
With New HTMLDocument
.Open
.write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
.Close
StripHtml = .body.outerText
End With
End Function
的改進過的一個上面...它找到引號和換行符,並用非HTML等價物替換它們。另外,原始函數在嵌入UNC引用時存在問題(即:< \ server \ share \ folder \ file.ext>)。由於<在開始時和結束時會刪除整個UNC字符串。此功能解決了這樣UNC被插入到字符串正確:
Function StripHTML(strString As String) As String
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
Dim sInput As String
Dim sOut As String
sInput = Replace(strString, "<\\", "\\")
With RegEx
.Global = True
.IgnoreCase = True
.MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
End With
sOut = RegEx.Replace(sInput, "")
StripHTML = Replace(Replace(Replace(sOut, " ", vbCrLf, 1, - 1), """, "'", 1, -1), "\\", "<\\", 1, -1)
Set RegEx = Nothing
End Function
,我發現這一個非常簡單的解決方案。由於系統限制和共享驅動器權限,我目前運行訪問數據庫並使用excel表單來更新系統。當我從Access調用數據時,我使用: 明文(YourStringHere)這將刪除所有html部分並僅保留文本。
希望這能奏效。
謝謝,這對我有用! – porlicus