從字符串中剝離HTML

我嘗試了一些東西，但似乎沒有任何工作正常。我有一個Access數據庫，並在VBA中編寫代碼。我有一串HTML源代碼，我有興趣將所有的HTML代碼和標籤剝離出來，這樣我只需要純文本字符串，並且沒有html或標籤。做這個的最好方式是什麼？從字符串中剝離HTML

感謝

來源

2012-10-09 Ann Sanderson

Function StripHTML(cell As Range) As String 
Dim RegEx As Object 
Set RegEx = CreateObject("vbscript.regexp") 

Dim sInput As String 
Dim sOut As String 
sInput = cell.Text 

With RegEx 
    .Global = True 
    .IgnoreCase = True 
    .MultiLine = True 
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags. 
End With 

sOut = RegEx.Replace(sInput, "") 
StripHTML = sOut 
Set RegEx = Nothing 
End Function

這可以幫助你，祝你好運。

來源

2012-10-09 16:10:28 Lior

謝謝，這對我有用！ – porlicus

其中一種方式儘可能具有彈性以適應糟糕的標記;

with createobject("htmlfile") 
    .open 
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>" 
    .close 
    msgbox "text=" & .body.outerText 
end with

來源

2012-10-09 16:14:18

+1不錯的想法，只需要刪除特殊字符然後 – SWa

它也應該翻譯實體& - >＆ –

哦，是的，可以發誓它沒有;） – SWa

它取決於html結構的複雜程度以及您希望得到的數據量。

取決於你使用正則表達式可能會遇到的複雜性，但對於使用正則表達式試圖從html解析數據的複雜標記就像嘗試用叉子吃湯一樣。

可以使用htmFile對象打開平面文件到您可以用交互的對象，例如：

Function ParseATable(url As String) As Variant 

    Dim htm As Object, table As Object 
    Dim data() As String, x As Long, y As Long 
    Set htm = CreateObject("HTMLfile") 
    With CreateObject("MSXML2.XMLHTTP") 
     .Open "GET", url, False 
     .send 
     htm.body.innerhtml = .responsetext 
    End With 

    With htm 
     Set table = .getelementsbytagname("table")(0) 
     Redim data(1 To table.Rows.Length, 1 To 10) 
     For x = 0 To table.Rows.Length - 1 
      For y = 0 To table.Rows(x).Cells.Length - 1 
       data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText 
      Next y 
     Next x 

     ParseATable = data 

    End With 
End Function

來源

2012-10-09 16:14:40 SWa

+1值得它獨自*從正則表達式的HTML解析數據就像嘗試用叉子吃湯* – brettdj

使用早期綁定：

Public Function GetText(inputHtml As String) As String 
With New HTMLDocument 
    .Open 
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>" 
    .Close 
    StripHtml = .body.outerText 
End With 
End Function

來源

2012-10-10 08:50:45

的改進過的一個上面...它找到引號和換行符，並用非HTML等價物替換它們。另外，原始函數在嵌入UNC引用時存在問題（即：< \ server \ share \ folder \ file.ext>）。由於<在開始時和結束時會刪除整個UNC字符串。此功能解決了這樣UNC被插入到字符串正確：

Function StripHTML(strString As String) As String 
Dim RegEx As Object 
Set RegEx = CreateObject("vbscript.regexp") 

Dim sInput As String 
Dim sOut As String 
sInput = Replace(strString, "<\\", "\\") 

With RegEx 
    .Global = True 
    .IgnoreCase = True 
    .MultiLine = True 
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags. 
End With 

sOut = RegEx.Replace(sInput, "") 
StripHTML = Replace(Replace(Replace(sOut, "&nbsp;", vbCrLf, 1, - 1), "&quot;", "'", 1, -1), "\\", "<\\", 1, -1) 
Set RegEx = Nothing 
End Function

來源

2015-05-27 03:03:01

，我發現這一個非常簡單的解決方案。由於系統限制和共享驅動器權限，我目前運行訪問數據庫並使用excel表單來更新系統。當我從Access調用數據時，我使用：明文（YourStringHere）這將刪除所有html部分並僅保留文本。

希望這能奏效。

來源

2016-06-06 05:37:37

從字符串中剝離HTML

回答

相關問題