使用Powershell更改大CSV文件中的分隔符

我需要一種方法將CSV文件中的分隔符從逗號更改爲管道。由於CSV文件的大小（〜750 Mb到幾Gb），使用Import-CSV和/或Get-Content不是一種選擇。什麼我使用（和什麼工作，儘管速度緩慢）是下面的代碼：使用Powershell更改大CSV文件中的分隔符

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source 
$reader.SetDelimiters(",") 

While(!$reader.EndOfData) 
{ 
    $line = $reader.ReadFields() 
    $details = [ordered]@{ 
          "Plugin ID" = $line[0] 
          CVE = $line[1] 
          CVSS = $line[2] 
          Risk = $line[3]  
         }       
    $export = New-Object PSObject -Property $details 
    $export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"  
}

這個小環花了將近2分鐘處理20 MB的文件。以這種速度擴展將意味着我目前正在使用的最小CSV文件超過一小時。

我已經試過這還有：

While(!$reader.EndOfData) 
{ 
    $line = $reader.ReadFields() 

    $details = [ordered]@{ 
          # Same data as before 
         } 

    $export.Add($details) | Out-Null   
} 

$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"

這是更快，但不提供新的CSV正確的信息。相反，我得到的這一排排：

"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized" 
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False" 
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"

於是，兩個問題：

1）可以在一個代碼塊進行得更快？ 2）如何在第二個示例中打開arraylist以獲取實際數據？這裏找到的樣本數據 - http://pastebin.com/6L98jGNg

來源

2016-09-16 Tchotchke

CSV文件是否在數據中包含逗號？如果不是，逐行讀取文件並用管道替換逗號可能會快得多。 –

是否*刪除數據以保持帖子小*意味着您正在處理CSV以及使用管道？ –

@AndrewMorton，是的。逗號和換行符。我已經添加了幾行來看看發生了什麼。我沒有輸送任何東西，只是將CSV中的數據添加到$ details變量中。 – Tchotchke

這是簡單的文本處理，所以瓶頸應該是磁盤讀取速度：每100MB 1秒或每個1GB的10秒（對於OP的樣本重複上述大小），如在i7上測量的那樣。對於包含許多/所有小引用字段的文件，結果會更糟糕。

的算法中很簡單：

閱讀大串塊的文件如1MB。因爲：
- 由於我們主要/主要只查看雙引號，所以執行的檢查較少;因此，讀取數百萬行由CR/LF分隔的行要快得多。
- 由解釋器執行的代碼的迭代次數較少。
找到下一個雙引號。
根據當前的$inQuotedField標誌確定找到的雙引號是否開始引用字段（應在前面加上, +可選擇一些空格）或結束當前引用字段（之後應加上任意偶數的雙引號，可選空格，然後,）。
如果未找到引號，則替換上一個範圍中的分隔符或1MB塊的末尾。

該代碼提出了一些合理的假設，但如果在字段分隔符之前/之後跟隨或超過3個空格，則可能無法檢測到轉義字段。支票不會太難添加，我可能錯過了其他一些邊緣情況，但我沒有那麼感興趣。

$sourcePath = 'c:\path\file.csv' 
$targetPath = 'd:\path\file2.csv' 
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM 

$delim = [char]',' 
$newDelim = [char]'|' 

$buf = [char[]]::new(1MB) 
$sourceBase = [IO.FileStream]::new(
    $sourcePath, 
    [IO.FileMode]::open, 
    [IO.FileAccess]::read, 
    [IO.FileShare]::read, 
    $buf.length, # let OS prefetch the next chunk in background 
    [IO.FileOptions]::SequentialScan) 
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding 
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length) 

$bufStart = 0 
$bufPadding = 4 
$inQuotedField = $false 
$fieldBreak = [char[]]@($delim, "`r", "`n") 
$out = [Text.StringBuilder]::new($buf.length) 

while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) { 
    $s = [string]::new($buf, 0, $nRead+$bufStart) 
    $len = $s.length 
    $pos = 0 
    $out.Clear() >$null 

    do { 
     $iQuote = $s.IndexOf([char]'"', $pos) 
     if ($inQuotedField) { 
      $iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) } 
      if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) { 
       # no closing quote in buffer safezone 
       $out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null 
       break 
      } 
      if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") { 
       # even number of quotes are just quoted quotes 
       $inQuotedField = $matches[1].length % 2 -eq 0 
      } 
      $out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null 
      $pos = $iDelim + 1 
      continue 
     } 
     if ($iQuote -ge 0) { 
      $iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote) 
      if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) { 
       $inQuotedField = $true 
      } 
      $replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim) 
     } elseif ($pos -gt 0) { 
      $replaced = $s.Substring($pos).Replace($delim, $newDelim) 
     } else { 
      $replaced = $s.Replace($delim, $newDelim) 
     } 
     $out.Append($replaced) >$null 
     $pos = $iQuote + 1 
    } while ($iQuote -ge 0) 

    $target.Write($out) 

    $bufStart = 0 
    for ($i = $out.length; $i -lt $s.length; $i++) { 
     $buf[$bufStart++] = $buf[$i] 
    } 
} 
if ($bufStart) { $target.Write($buf, 0, $bufStart) } 
$source.Close() 
$target.Close()

來源

2016-09-17 17:52:29 wOxxOm

感謝您的示例。我能夠使用它，稍微修改一下，並在幾秒鐘內瀏覽最大的文件。：d – Tchotchke

還沒我所說的快，但是這是相當快於您已經使用了-Join運營商列出的是什麼：

編輯

$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source 
$reader.SetDelimiters(",") 

While(!$reader.EndOfData){ 
    $line = $reader.ReadFields() 
    $line -join '|' | Add-Content C:\Temp\TestOutput.csv 
}

這花了在32秒內處理一個20MB文件。以這樣的速度，你的750MB文件將在20分鐘內完成，而更大的文件應該在每演出約26分鐘。

來源

2016-09-16 21:04:56 TheMadTechnician

使用Powershell更改大CSV文件中的分隔符

回答

相關問題