2012-08-23 34 views
14

獲取no的方法之一。從文件中的行是此方法在PowerShell中powershell獲取大(大)文件的行數

PS C:\Users\Pranav\Desktop\PS_Test_Scripts> $a=Get-Content .\sub.ps1 
PS C:\Users\Pranav\Desktop\PS_Test_Scripts> $a.count 
34 
PS C:\Users\Pranav\Desktop\PS_Test_Scripts> 

然而,當我有一個大的800 MB的文本文件,我怎麼得到它的行號,而不必讀取整個文件?

上述方法會消耗太多內存,導致腳本崩潰或需要很長時間才能完成。

回答

13

使用Get-Content -Read $nLinesAtTime由部分

$nlines = 0; 
#read file by 1000 lines at a time 
gc $YOURFILE -read 1000 | % { $nlines += $_.Length }; 
[string]::Format("{0} has {1} lines", $YOURFILE, $nlines) 

閱讀您的文件組成部分,這裏很簡單,但速度慢腳本對小文件的驗證工作

gc $YOURFILE | Measure-Object -Line 
+1

值得指出的是你的第二個方法只計算與文本行。如果有空行,則不計數。 – Vladislav

8

嘗試的第一件事就是到流Get-Content和一次構建一個行數,而不是一次存儲數組中的所有行。我認爲這會給出正確的流式傳輸行爲 - 即整個文件不會一次存儲在內存中,只是當前行。

$lines = 0 
Get-Content .\File.txt |%{ $lines++ } 

而作爲對方的回答表明,加入-ReadCount可以加快這。

如果不適合你(過慢或過多的內存)的工作,你可以直接去StreamReader

$count = 0 
$reader = New-Object IO.StreamReader 'c:\logs\MyLog.txt' 
while($reader.ReadLine() -ne $null){ $count++ } 
$reader.Close() # don't forget to do this. Ideally put this in a try/finally block to make sure it happens 
+0

使用上面的IO.StreamReader代碼修復了使用下面的gc方法時出現的內存不足錯誤。我可以確認它消耗的內存少得多(使用PowerShell 5.0.10514.6) – Fares

1

這裏是東西解析出來的時候,我寫在試圖減輕內存使用情況在我的txt文件中的空白。儘管如此,內存使用量仍然很高,但該過程需要較少的時間來運行。 只是給你一些背景我的文件,該文件有超過2百萬記錄,並在每一行的前面和後面有領先的空白。 我相信總時間是5分鐘以上 如果有改進格式的方法,請讓我知道您的想法。 感謝

$testing = 'C:\Users\something\something\test3.txt' 

$filecleanup = gci $testing 

    foreach ($file in $filecleanup) 
    { $file1 = gc $file -readcount 1000 |foreach{ $_.Trim()} 
    $file1 > $filecleanup} 
9

這裏有一個PowerShell腳本我拼湊這表明在一個文本文件計數線,對每種方法所需要的時間和內存一起的幾種不同的方法。結果(下面)顯示了時間和記憶要求的明顯差異。對於我的測試,它看起來最可愛的地方是Get-Content,使用100的ReadCount設置。其他測試需要更多的時間和/或內存使用。

#$testFile = 'C:\test_small.csv' # 245 lines, 150 KB 
#$testFile = 'C:\test_medium.csv' # 95,365 lines, 104 MB 
$testFile = 'C:\test_large.csv' # 285,776 lines, 308 MB 

# Using ArrayList just because they are faster than Powershell arrays, for some operations with large arrays. 
$results = New-Object System.Collections.ArrayList 

function AddResult { 
param([string] $sMethod, [string] $iCount) 
    $result = New-Object -TypeName PSObject -Property @{ 
     "Method" = $sMethod 
     "Count" = $iCount 
     "Elapsed Time" = ((Get-Date) - $dtStart) 
     "Memory Total" = [System.Math]::Round((GetMemoryUsage)/1mb, 1) 
     "Memory Delta" = [System.Math]::Round(((GetMemoryUsage) - $dMemStart)/1mb, 1) 
    } 
    [void]$results.Add($result) 
    Write-Output "$sMethod : $count" 
    [System.GC]::Collect() 
} 

function GetMemoryUsage { 
    # return ((Get-Process -Id $pid).PrivateMemorySize) 
    return ([System.GC]::GetTotalMemory($false)) 
} 

# Get-Content -ReadCount 1 
[System.GC]::Collect() 
$dMemStart = GetMemoryUsage 
$dtStart = Get-Date 
$count = 0 
Get-Content -Path $testFile -ReadCount 1 |% { $count++ } 
AddResult "Get-Content -ReadCount 1" $count 

# Get-Content -ReadCount 10,100,1000,0 
# Note: ReadCount = 1 returns a string. Any other value returns an array of strings. 
# Thus, the Count property only applies when ReadCount is not 1. 
@(10,100,1000,0) |% { 
    $dMemStart = GetMemoryUsage 
    $dtStart = Get-Date 
    $count = 0 
    Get-Content -Path $testFile -ReadCount $_ |% { $count += $_.Count } 
    AddResult "Get-Content -ReadCount $_" $count 
} 

# Get-Content | Measure-Object 
$dMemStart = GetMemoryUsage 
$dtStart = Get-Date 
$count = (Get-Content -Path $testFile -ReadCount 1 | Measure-Object -line).Lines 
AddResult "Get-Content -ReadCount 1 | Measure-Object" $count 

# Get-Content.Count 
$dMemStart = GetMemoryUsage 
$dtStart = Get-Date 
$count = (Get-Content -Path $testFile -ReadCount 1).Count 
AddResult "Get-Content.Count" $count 

# StreamReader.ReadLine 
$dMemStart = GetMemoryUsage 
$dtStart = Get-Date 
$count = 0 
# Use this constructor to avoid file access errors, like Get-Content does. 
$stream = New-Object -TypeName System.IO.FileStream(
    $testFile, 
    [System.IO.FileMode]::Open, 
    [System.IO.FileAccess]::Read, 
    [System.IO.FileShare]::ReadWrite) 
if ($stream) { 
    $reader = New-Object IO.StreamReader $stream 
    if ($reader) { 
     while(-not ($reader.EndOfStream)) { [void]$reader.ReadLine(); $count++ } 
     $reader.Close() 
    } 
    $stream.Close() 
} 

AddResult "StreamReader.ReadLine" $count 

$results | Select Method, Count, "Elapsed Time", "Memory Total", "Memory Delta" | ft -auto | Write-Output 

下面是包含〜95K線,104 MB,用於文本文件的結果:

Method         Count Elapsed Time  Memory Total Memory Delta 
------         ----- ------------  ------------ ------------ 
Get-Content -ReadCount 1     95365 00:00:11.1451841   45.8   0.2 
Get-Content -ReadCount 10     95365 00:00:02.9015023   47.3   1.7 
Get-Content -ReadCount 100    95365 00:00:01.4522507   59.9   14.3 
Get-Content -ReadCount 1000    95365 00:00:01.1539634   75.4   29.7 
Get-Content -ReadCount 0     95365 00:00:01.3888746   346  300.4 
Get-Content -ReadCount 1 | Measure-Object 95365 00:00:08.6867159   46.2   0.6 
Get-Content.Count       95365 00:00:03.0574433  465.8  420.1 
StreamReader.ReadLine      95365 00:00:02.5740262   46.2   0.6 

下面是一個較大的文件的結果(包含〜285k線,308 MB):

Method         Count Elapsed Time  Memory Total Memory Delta 
------         ----- ------------  ------------ ------------ 
Get-Content -ReadCount 1     285776 00:00:36.2280995   46.3   0.8 
Get-Content -ReadCount 10     285776 00:00:06.3486006   46.3   0.7 
Get-Content -ReadCount 100    285776 00:00:03.1590055   55.1   9.5 
Get-Content -ReadCount 1000    285776 00:00:02.8381262   88.1   42.4 
Get-Content -ReadCount 0     285776 00:00:29.4240734  894.5  848.8 
Get-Content -ReadCount 1 | Measure-Object 285776 00:00:32.7905971   46.5   0.9 
Get-Content.Count       285776 00:00:28.4504388  1219.8  1174.2 
StreamReader.ReadLine      285776 00:00:20.4495721   46   0.4 
4

這是一個基於Pseudothink的帖子的單線程。 一個特定的文件:

"the_name_of_your_file.txt" |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"} 

在當前目錄下的所有文件:

Get-ChildItem "." |% {$n = $_; $c = 0; Get-Content -Path $_ -ReadCount 1000 |% { $c += $_.Count }; "$n; $c"} 
+0

請詳細解釋它。 –

+0

Le完美的解決方案。 –