2010-07-03 84 views
0

情況:我有一個html文件,我需要刪除某些部分。找到一個字符串,並更有效地替換它

例如:文件包含HTML:<div style="padding:10px;">First Name:</div><div style="padding:10px; background-color: gray">random information here</div><div style="padding:10px;">First Name:</div><div style="padding:10px; background-color: gray">random information here</div>

我需要刪除所有以「<div style="padding:10px; background-color: gray">」開始,以「</div>」,這樣的結果將是結尾的所有文字:

<div style="padding:10px;">First Name:</div><div style="padding:10px;">First Name:</div> 

我創建了2個這樣做的函數,但我沒有這個效率。我有一個40MB的文件,它需要大約2個小時完成程序。有沒有更有效的方法來做到這一點?有沒有辦法使用正則表達式?

見下面我的代碼:

Public Shared Function String_RemoveText(ByVal startAt As String, ByVal endAt As String, ByVal SourceString As String) As String 
    Dim TotalCount As Integer = String_CountCharacters(SourceString, startAt) 
    Dim CurrentCount As Integer = 0 

RemoveNextString: 

    Dim LeftRemoved As String = Mid(SourceString, InStr(SourceString, startAt) + 1, Len(SourceString) - Len(endAt)) 
    Dim RemoveCore As String = Left(LeftRemoved, InStr(LeftRemoved, endAt) - 1) 
    Dim RemoveString As String = startAt & RemoveCore & endAt 


    Do 
     ' Application.DoEvents() 
     SourceString = Replace(SourceString, RemoveString, "") 
     If InStr(SourceString, startAt) < 1 Then Exit Do 
     GoTo RemoveNextString 
    Loop 

    Return Replace(SourceString, RemoveString, "") 

End Function 

Public Shared Sub Files_ReplaceText(ByVal DirectoryPath As String, ByVal SourceFile As String, ByVal DestinationFile As String, ByVal sFind As String, ByVal sReplace As String, ByVal TrimContents As Boolean, ByVal RemoveCharacters As Boolean, ByVal rStart As String, ByVal rEnd As String) 

    'CREATE NEW FILENAME 
    Dim DateFileName As String = Date.Now.ToString.Replace(":", "_") 
    DateFileName = DateFileName.Replace(" ", "_") 
    DateFileName = DateFileName.Replace("/", "_") 
    Dim FileExtension As String = ".txt" 
    Dim NewFileName As String = DirectoryPath & DateFileName & FileExtension 
    'CHECK IF FILENAME ALREADY EXISTS 
    Dim counter As Integer = 0 
    If IO.File.Exists(NewFileName) = True Then 
     'CREATE NEW FILE NAME 
     Do 
      'Application.DoEvents() 
      counter = counter + 1 
      If IO.File.Exists(DirectoryPath & DateFileName & "_" & counter & FileExtension) = False Then 
       NewFileName = DirectoryPath & DateFileName & "_" & counter & FileExtension 
       Exit Do 
      End If 
     Loop 
    End If 
    'END NEW FILENAME 

    'READ SOURCE FILE 
    Dim sr As New StreamReader(DirectoryPath & SourceFile) 
    Dim content As String = sr.ReadToEnd() 
    sr.Close() 

    'WRITE NEW FILE 
    Dim sw As New StreamWriter(NewFileName) 

    'REPLACE VALUES 
    content = content.Replace(sFind, sReplace) 

    'REMOVE STRINGS 
    If RemoveCharacters = True Then content = String_RemoveText(rStart, rEnd, content) 


    'TRIM 
    If TrimContents = True Then content = Regex.Replace(content, "[\t]", "") 

    'WRITE FILE 
    sw.Write(content) 

    'CLOSE FILE 
    sw.Close() 
End Sub 

例執行代碼(也去除了CHR(13)& CHR(10): Files_ReplaceText(tPath.Text, tSource.Text, "", Chr(13) & Chr(10), "", True, True, tStart.Text, tEnd.Text)

回答

2

不要使用正則表達式來解析HTML - 它是不是正規的語言。對於一些引人注目的演示見here

使用HTML Agility Pack解析HTML和替代數據。

+0

+1對於不使用RegEx解析HTML。有更好的解決方案已經存在。從來沒有嘗試過HTML敏捷包,所以我不能說那個。 – 2010-07-03 08:48:44

+0

+1爲引人注目的示範。謝謝 :) – sarnold 2010-07-03 09:06:59