2012-09-16 76 views
1

我有一個巨大的文本文件,其中發生大量的重複。重複如下。使用vb.net從文本文件中查找和刪除重複文件

帖子總數16

Pin碼= GFDHG
TITLE =店鋪招牌/投影標誌/工業標牌/餐廳招牌/菜單板& Box在倫敦
日期= 2012年12月9日
跟蹤密鑰#85265E712050-15207427406854753

帖子總數16

Pin碼= GFDHG
TITLE = S跳符號/投影標誌/工業標牌/餐廳招牌/菜單板& Box在倫敦
日期= 2012年12月9日
跟蹤密鑰#85265E712050-15207427406854753

帖子總數2894

Pin碼= GFDHG
TITLE =店鋪招牌/投影標誌/工業標牌/餐廳招牌/菜單板& Box在倫敦
DATE = 15-09-2012
跟蹤密鑰#85265E712050-152797637654753

帖子總數2894

Pin碼= GFDHG
TITLE =店鋪招牌/投影標誌/工業標牌/餐廳招牌/菜單板& Box在倫敦
DATE = 15-09-2012
跟蹤密鑰# 85265E712050-152797637654753

等等這個文本文件中最多有4000個帖子。我希望我的程序將總帖子6匹配到發生在文件中的所有總帖子,並在其中找到重複項,然後以編程方式刪除該重複項,並刪除該副本的接下來的7行。謝謝

+0

那麼你的問題到底在哪裏? – sloth

+0

我希望我的程序匹配總帖子6到下一個總帖子6並刪除第二個和第二個之前的第二個5行 –

回答

0

假設格式是一致的(即文件中每個記錄的事件總共使用6行文本),那麼如果你想從文件中刪除重複項,你只需要這樣做:

Sub DupClean(ByVal fpath As String) 'fpath is the FULL file path, i.e. C:\Users\username\Documents\filename.txt 
     Dim OrigText As String = "" 
     Dim CleanText As String = "" 
     Dim CText As String = "" 
     Dim SReader As New System.IO.StreamReader(fpath, System.Text.Encoding.UTF8) 
     Dim TxtLines As New List(Of String) 
     Dim i As Long = 0 
     Dim writer As New System.IO.StreamWriter(Left(fpath, fpath.Length - 4) & "_clean.txt", False) 'to overwrite the text inside the same file simply use StreamWriter(fpath) 

     Try 
      'Read in the text 
      OrigText = SReader.ReadToEnd 

      'Parse the text at new lines to allow selecting groups of 6 lines 
      TxtLines.AddRange(Split(OrigText, Chr(10))) 'may need to change the Chr # to look for depending on if 10 or 13 is used when the file is generated 
     Catch ex As Exception 
      MsgBox("Encountered an error while reading in the text file contents and parsing them. Details: " & ex.Message, vbOKOnly, "Read Error") 
      End 
     End Try 

     Try 
      'Now we iterate through blocks of 6 lines 
      Do While i < TxtLines.Count 
       'Set CText to the next 6 lines of text 
       CText = TxtLines.Item(i) & Chr(10) & TxtLines.Item(i + 1) & Chr(10) & TxtLines.Item(i + 2) & Chr(10) & TxtLines.Item(i + 3) & Chr(10) & TxtLines.Item(i + 4) & Chr(10) & TxtLines.Item(i + 5) 

       'Check if CText is already present in CleanText 
       If Not (CleanText.Contains(CText)) Then 
        'Add CText to CleanText 
        If CleanText.Length = 0 Then 
         CleanText = CText 
        Else 
         CleanText = CleanText & Chr(10) & CText 
        End If 
       End If 'else the text is already present and we don't need to do anything 

       i = i + 6 
      Loop 
     Catch ex As Exception 
      MsgBox("Encountered an error while running cleaning duplicates from the read in text. The application was on the " & i & "-th line of text when the following error was thrown: " & ex.Message, _ 
        vbOKOnly, "Comparison Error") 
      End 
     End Try 

     Try 
      'Write out the clean text 
      writer.Write(CleanText) 
     Catch ex As Exception 
      MsgBox("Encountered an error writing the cleaned text. Details: " & ex.Message & Chr(10) & Chr(10) & "The cleaned text was " & CleanText, vbOKOnly, "Write Error") 
     End Try 
    End Sub 

如果格式不一致,您需要更加奇特並定義規則,以告知在循環中的任何給定通過哪些行添加到CText,但沒有上下文我將無法給您關於這些可能是什麼的任何想法。