如何使用R或PowerShell從文本文件中提取數據？

我有一個包含這樣的數據的文本文件：如何使用R或PowerShell從文本文件中提取數據？

This is just text 
------------------------------- 
Username:   SOMETHI   C:     [Text] 
Account:   DFAG    Finish time:  1-JAN-2011 00:31:58.91 
Process ID:  2028aaB   Start time:  31-DEC-2010 20:27:15.30 

This is just text 
------------------------------- 
Username:   SOMEGG   C:     [Text] 
Account:   DFAG    Finish time:  1-JAN-2011 00:31:58.91 
Process ID:  20dd33DB   Start time:  12-DEC-2010 20:27:15.30 

This is just text 
------------------------------- 
Username:   SOMEYY   C:     [Text] 
Account:   DFAG    Finish time:  1-JAN-2011 00:31:58.91 
Process ID:  202223DB   Start time:  15-DEC-2010 20:27:15.30

有沒有一種方法來提取用戶名，完成時間，從這種數據的開始時間？我正在尋找一些使用R或Powershell的起點。

來源

2012-01-24 jrara

R可以不處理文本文件的最佳工具，但可以進行如下操作：通過讀取該文件作爲一個固定寬度的文件確定的兩列，由分裂的琴絃分開自己的價值領域冒號，添加一個「id」列，並將所有內容按順序排列。

# Read the file 
d <- read.fwf("A.txt", c(37,100), stringsAsFactors=FALSE) 

# Separate fields and values 
d <- d[grep(":", d$V1),] 
d <- cbind( 
    do.call(rbind, strsplit(d$V1, ":\\s+")), 
    do.call(rbind, strsplit(d$V2, ":\\s+")) 
) 

# Add an id column 
d <- cbind(d, cumsum(d[,1] == "Username")) 

# Stack the left and right parts 
d <- rbind(d[,c(5,1,2)], d[,c(5,3,4)]) 
colnames(d) <- c("id", "field", "value") 
d <- as.data.frame(d) 
d$value <- gsub("\\s+$", "", d$value) 

# Convert to a wide data.frame 
library(reshape2) 
d <- dcast(d, id ~ field)

來源

2012-01-24 14:07:33

謝謝，作品像魅力！ – jrara

什麼是你使用文本文件的工具？ Perl，Ruby也許？ –

@RomanLuštrik：我個人使用Perl，因爲我對它很熟悉，但是Python或Ruby應該證明同樣好的解決方案。我通常更喜歡單獨進行所有預處理，因此R只需讀取數據庫中的csv文件或表。 –

你有一個數據幀的文件嗎？像列名應爲用戶名，進程ID，開始時間......如果是這樣，你可以通過伊斯利

df$Username (where df is your data frame and if you want to see all your usernames) 
df$FinishTime

如果你想了解一個特定名稱的用戶的一切，用這個

解壓

df[df$username == "SOMETHI",]

如果你想知道與結束時間用戶..

希望這可以是一個起點。如果不清楚，請告訴我。

來源

2012-01-24 13:43:12 Chris

我認爲他試圖提取數據，這樣他就可以把它放在一個data.frame。 –

這些只是我將如何處理這個問題的指導方針。我相信有一個更奇特的做法。可能包括plyr。 :)

rara <- readLines("test.txt") # you could use readLines(textConnection = "text")) 

# find usernames 
usn <- rara[grepl("Username:", rara)] 
# you can find a fancy way to split or weed out spaces 
# I crudely do it like this: 
unlist(lapply(strsplit(usn, "  "), "[", 2)) # 2 means "extract the second element" 

# and accounts 
acc <- rara[grepl("Account:", rara)] 
unlist(lapply(strsplit(acc, "  "), "[", 2))

您可以使用str_trim()刪除單詞前/後的空格。希望有足夠的指針讓你去。

來源

2012-01-24 13:59:49

這裏有一個PowerShell的解決方案：

$result = @() 

get-content c:\somedir\somefile.txt | 
foreach { 
    if ($_ -match '^Username:\s+(\S+)'){ 
     $rec = ""|select UserName,FinishTime,StartTime 
     $rec.UserName = $matches[1] 
     } 
    elseif ($_ -match '^Account.+Finish\stime:\s+(.+)'){ 
     $rec.FinishTime = $matches[1] 
     } 
    elseif ($_ -match '^Process\sID:\s+\S+\s+Start\stime:\s+(.+)'){ 
     $rec.StartTime = $matches[1] 
     $result += $rec 
     } 
} 
$result

來源

2012-01-24 14:46:15 mjolinor

如何使用R或PowerShell從文本文件中提取數據？

回答

相關問題