2010-05-13 116 views
1

我有記錄看起來像這樣的文件...PowerShell的 - 日誌文件轉換爲CSV

2009-12-18T08:25:22.983Z  1   174 dns:0-apr-credit-cards-uk.pedez.co.uk P http://0-apr-credit-cards-uk.pedez.co.uk/ text/dns #170 20091218082522021+89 sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF - - 
2009-12-18T08:25:22.984Z  1   5 dns:0-60racing.co.uk P http://0-60racing.co.uk/ text/dns #116 20091218082522037+52 sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG - - 
2009-12-18T08:25:23.066Z  1   79 dns:0-addiction.metapress.com.wam.leeds.ac.uk P http://0-addiction.metapress.com.wam.leeds.ac.uk/ text/dns #042 20091218082522076+20 sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R - - 
...plus millions of other records 

我需要把這些轉化成CSV文件...

"2009-12-18T08:25:22.983Z","1","174","dns:0-apr-credit-cards-uk.pedez.co.uk","P","http://0-apr-credit-cards-uk.pedez.co.uk/","text/dns","#170","20091218082522021+89","sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF","-","-" 
"2009-12-18T08:25:22.984Z","1","5","dns:0-60racing.co.uk","P","http://0-60racing.co.uk/","text/dns","#116","20091218082522037+52","sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG","-","-" 
"2009-12-18T08:25:23.066Z","1","79","dns:0-addiction.metapress.com.wam.leeds.ac.uk","P","http://0-addiction.metapress.com.wam.leeds.ac.uk/","text/dns","#042","20091218082522076+20","sha1:NSUQN6TBIECAP5VG6TZJ5AVY34ANIC7R","-","-" 

字段分隔符可以可以是單個或多個空格字符,同時具有固定寬度和可變寬度字段。這往往會混淆我發現的大多數CSV解析器。

最終我想將這些文件包裝到SQL Server中,但只能指定一個字符作爲字段分隔符(即''),並且這會打破固定長度的字段。

到目前爲止 - 我使用PowerShell的

gc -ReadCount 10 -TotalCount 200 .\crawl_sample.log | foreach { ([regex]'([\S]*)\s+').matches($_) } | foreach {$_.Groups[1].Value} 

這返回的字段的流:

2009-12-18T08:25:22.983Z 
1 
74 
dns:0-apr-credit-cards-uk.pedez.co.uk 
P 
http://0-apr-credit-cards-uk.pedez.co.uk/ 
text/dns 
#170 
20091218082522021+89 
sha1:AIDBQOKOYI7OPLVSWEBTIAFVV7SRMLMF 
- 
- 
2009-12-18T08:25:22.984Z 
1 
55 
dns:0-60racing.co.uk 
P 
http://0-60racing.co.uk/ 
text/dns 
#116 
20091218082522037+52 
sha1:WMII7OOKYQ42G6XPITMHJSMLQFLGCGMG 
- 

但我怎麼是輸出轉換成CSV格式?

+0

你可能想看看我的FOSS CSV改寫(munging)工具http://code.google.com/p/csvfix,我可以按照你的想法做,但只能作爲一個多階段的過程。 – 2010-05-13 12:30:09

回答

2

Anwering再次我自己的問題......

measure-command { 
    $q = [regex]" +" 
    $q.Replace(([string]::join([environment]::newline, (Get-Content -ReadCount 1 \crawl_sample2.log))), ",") > crawl_sample2.csv 
} 

,它的快!

觀察:

  • 我用\s+作爲正則表達式分隔符和該被打破換行符
  • Get-Content -ReadCount 1到流單列陣列來正則表達式
  • 然後管輸出字符串到新的文件

UPDATE

此腳本可用,但在處理大文件時使用大量內存。那麼,如果沒有8GB內存和交換空間,我該如何做同樣的事情!

我認爲這是由join再次緩衝所有的數據....任何想法?

更新2

OK - 有一個更好的解決辦法...

Get-Content -readcount 100 -totalcount 100000 .\crawl.log | 
    ForEach-Object { $_ } | 
     foreach { $_ -replace " +", "," } > .\crawl.csv 

一個非常方便的指南PowerShell的 - Powershell regular expressions

+0

...歡迎任何更好的解決方案或對腳本的改進! – Guy 2010-05-13 12:44:29

+1

您可以通過擺脫中間的Foreach-Object來簡化這一點,因爲-replace在字符串數組上運行,例如''a b','c d','e f'-replace'+',',''。試試這個'gc crawl.log -read 100 -total 100000 | %{$ _ -replace'+',','}>抓取。csv' – 2010-05-14 00:05:21

+0

考慮'-replace',它可以更簡單:'(gc crawl.log ...)-replace'+',',''> crawl.csv(我的帖子*運營商鏈* http:/ /www.leporelo.eu/blog.aspx?id=powershell-tips-and-tricks-3-chain-of-operators) – stej 2010-05-14 08:07:34