2017-04-07 44 views
0

我試圖加載160GB CSV文件,SQL和我使用PowerShell腳本我從Github了,我得到這個錯誤錯誤:輸入數組的長度超過該表中PowerShell中的列數

IException calling "Add" with "1" argument(s): "Input array is longer than the number of columns in this table." 
At C:\b.ps1:54 char:26 
+ [void]$datatable.Rows.Add <<<< ($line.Split($delimiter)) 
    + CategoryInfo   : NotSpecified: (:) [], MethodInvocationException 
    + FullyQualifiedErrorId : DotNetMethodException 

所以我用小3行csv檢查了相同的代碼,並且所有列匹配,並且在第一行也有標題,沒有額外的分隔符不知道爲什麼我得到這個錯誤。

的代碼是下面

<# 8-faster-runspaces.ps1 #> 
# Set CSV attributes 
$csv = "M:\d\s.txt" 
$delimiter = "`t" 

# Set connstring 
$connstring = "Data Source=.;Integrated Security=true;Initial Catalog=PresentationOptimized;PACKET SIZE=32767;" 

# Set batchsize to 2000 
$batchsize = 2000 

# Create the datatable 
$datatable = New-Object System.Data.DataTable 

# Add generic columns 
$columns = (Get-Content $csv -First 1).Split($delimiter) 
foreach ($column in $columns) { 
[void]$datatable.Columns.Add() 
} 

# Setup runspace pool and the scriptblock that runs inside each runspace 
$pool = [RunspaceFactory]::CreateRunspacePool(1,5) 
$pool.ApartmentState = "MTA" 
$pool.Open() 
$runspaces = @() 

# Setup scriptblock. This is the workhorse. Think of it as a function. 
$scriptblock = { 
    Param (
[string]$connstring, 
[object]$dtbatch, 
[int]$batchsize 
    ) 

$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connstring,"TableLock") 
$bulkcopy.DestinationTableName = "abc" 
$bulkcopy.BatchSize = $batchsize 
$bulkcopy.WriteToServer($dtbatch) 
$bulkcopy.Close() 
$dtbatch.Clear() 
$bulkcopy.Dispose() 
$dtbatch.Dispose() 
} 

# Start timer 
$time = [System.Diagnostics.Stopwatch]::StartNew() 

# Open the text file from disk and process. 
$reader = New-Object System.IO.StreamReader($csv) 

Write-Output "Starting insert.." 
while ((($line = $reader.ReadLine()) -ne $null)) 
{ 
[void]$datatable.Rows.Add($line.Split($delimiter)) 

if ($datatable.rows.count % $batchsize -eq 0) 
{ 
    $runspace = [PowerShell]::Create() 
    [void]$runspace.AddScript($scriptblock) 
    [void]$runspace.AddArgument($connstring) 
    [void]$runspace.AddArgument($datatable) # <-- Send datatable 
    [void]$runspace.AddArgument($batchsize) 
    $runspace.RunspacePool = $pool 
    $runspaces += [PSCustomObject]@{ Pipe = $runspace; Status = $runspace.BeginInvoke() } 

    # Overwrite object with a shell of itself 
    $datatable = $datatable.Clone() # <-- Create new datatable object 
} 
} 

# Close the file 
$reader.Close() 

# Wait for runspaces to complete 
while ($runspaces.Status.IsCompleted -notcontains $true) {} 

# End timer 
$secs = $time.Elapsed.TotalSeconds 

# Cleanup runspaces 
foreach ($runspace in $runspaces) { 
[void]$runspace.Pipe.EndInvoke($runspace.Status) # EndInvoke method retrieves the results of the asynchronous call 
$runspace.Pipe.Dispose() 
} 

# Cleanup runspace pool 
$pool.Close() 
$pool.Dispose() 

# Cleanup SQL Connections 
[System.Data.SqlClient.SqlConnection]::ClearAllPools() 

# Done! Format output then display 
$totalrows = 1000000 
$rs = "{0:N0}" -f [int]($totalrows/$secs) 
$rm = "{0:N0}" -f [int]($totalrows/$secs * 60) 
$mill = "{0:N0}" -f $totalrows 

Write-Output "$mill rows imported in $([math]::round($secs,2)) seconds ($rs rows/sec and $rm rows/min)" 
+0

通常在這種情況下,此錯誤表示某些行在本例中具有意外的嵌入分隔符...選項卡。這只是一些骯髒的輸入數據。您可以嘗試閱讀行,用空字符串替換選項卡,然後通過比較原始大小和縮小大小來查看哪些行具有比列更多的選項卡。如果你有四列,你會期望一行縮小三個字符。 –

+0

@Laughing Vergil謝謝你的迴應,對逗號分隔的文件也一樣,我得到了同樣的錯誤 – Zack

+0

對於某些數據,也可能有行終止符問題 - 在Windows樣式文本文件中有UNIX樣式換行符在一個數據檢索操作中結束兩行。嵌入換行符或CR/LF對也可能導致處理混亂。 –

回答

1

與160 GB輸入文件工作將是有疼痛感。你不能真正加載它到任何一種編輯器 - 或者至少你沒有真正分析這樣的數據量沒有一些嚴重的自動化。

根據評論,似乎數據有一些質量問題。爲了找到有問題的數據,你可以嘗試二進制搜索。這種方法可以快速收縮數據。像這樣,

1) Split the file in about two equal chunks. 
2) Try and load first chunk. 
3) If successful, process the second chunk. If not, see 6). 
4) Try and load second chunk. 
5) If successful, the files are valid, but you got another a data quality issue. Start looking into other causes. If not, see 6). 
6) If either load failed, start from the beginning and use the failed file as the input file. 
7) Repeat until you narrow down the offending row(s). 

另一種方法是使用像SSIS這樣的ETL工具。配置軟件包以將無效行重定向到錯誤日誌中,以查看哪些數據無法正常工作。

+0

加1爲SSIS選項。 – Bruce