分割一個變量名稱並將其拆分爲R中的單獨列中的數據

我有一些perfmon（Windows性能日誌數據）數據我希望解析。分割一個變量名稱並將其拆分爲R中的單獨列中的數據

通常一組列名如下所示：

> colnames(p) 
[1] "Time"               
[2] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length"  
[3] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length" 
[4] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length" 
[5] "\\\\testdb1\\Processor(_Total)\\% Processor Time"    
[6] "\\\\testdb1\\System\\Processes"        
[7] "\\\\testdb1\\System\\Processor Queue Length"

，我輸入的這個數據爲R的方式，是：

p <- read.csv("r-perfmon.csv",stringsAsFactors = FALSE, check.names = FALSE)

下面是一些示例數據

> head(p) 
        Time \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length 
1 04/15/2013 00:00:19.279            0.040037563 
2 04/15/2013 00:00:34.279            0.009740260 
3 04/15/2013 00:00:49.275            0.011009828 
4 04/15/2013 00:01:04.284            0.006016244 
5 04/15/2013 00:01:19.279            0.015125328 
6 04/15/2013 00:01:34.275            0.002814141 
    \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length 
1             0.001421333 
2             0.000000000 
3             0.000206726 
4             0.000000000 
5             0.001894000 
6             0.000000000 
    \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length 
1             0.038616230 
2             0.009740260 
3             0.010803102 
4             0.006016244 
5             0.013231327 
6             0.002814141 
    \\\\testdb1\\Processor(_Total)\\% Processor Time \\\\testdb1\\System\\Processes 
1          29.569339        86 
2          10.856994        86 
3           7.733924        81 
4           1.910202        81 
5           6.164864        81 
6           1.351883        81 
    \\\\testdb1\\System\\Processor Queue Length 
1           0 
2           0 
3           0 
4           0 
5           0 
6           0

我希望能夠解析列名，然後融化數據。

所以，如果我們把一列數據作爲例子

> example <- p[2] 
> head(example) 
    \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length 
1            0.040037563 
2            0.009740260 
3            0.011009828 
4            0.006016244 
5            0.015125328 
6            0.002814141

我希望它看起來像這樣

Time, MachineName, Object, Counter, InstanceName, Value 
04/15/2013 00:00:19.279, testdb1, PhysicalDisk, Avg. Disk Queue Length, 0 C:, 0.040037563 
04/15/2013 00:00:34.279, testdb1, PhysicalDisk, Avg. Disk Queue Length, 0 C:, 0.009740260 
04/15/2013 00:00:49.275, testdb1, PhysicalDisk, Avg. Disk Queue Length, 0 C:, 0.011009828

編輯：根據要求我的數據

頭的dput

structure(list(`(PDH-CSV 4.0) (GMT Daylight Time)(-60)` = c("04/15/2013 00:00:19.279", 
"04/15/2013 00:00:34.279", "04/15/2013 00:00:49.275", "04/15/2013 00:01:04.284", 
"04/15/2013 00:01:19.279", "04/15/2013 00:01:34.275"), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length` = c(0.040037563, 
0.00974026, 0.011009828, 0.006016244, 0.015125328, 0.002814141 
), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length` = c(0.001421333, 
0, 0.000206726, 0, 0.001894, 0), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length` = c(0.03861623, 
0.00974026, 0.010803102, 0.006016244, 0.013231327, 0.002814141 
), `\\\\testdb1\\Processor(_Total)\\% Processor Time` = c(29.56933862, 
10.85699395, 7.733924001, 1.910202013, 6.164864178, 1.351882837 
), `\\\\testdb1\\System\\Processes` = c(86L, 86L, 81L, 81L, 81L, 
81L), `\\\\testdb1\\System\\Processor Queue Length` = c(0L, 0L, 0L, 
0L, 0L, 0L)), .Names = c("(PDH-CSV 4.0) (GMT Daylight Time)(-60)", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length", "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length", 
"\\\\testdb1\\Processor(_Total)\\% Processor Time", "\\\\testdb1\\System\\Processes", 
"\\\\testdb1\\System\\Processor Queue Length"), row.names = c(NA, 
6L), class = "data.frame")

來源

2015-06-28 Gauss

首先在r中使用'reshape'將數據重塑爲長格式，然後在最後一列名稱中使用'strsplit'。如果您希望其他人重現您的數據，您還需要「輸入」您的數據。 – user227710

我用長格式'p < - melt（p，id = c（「time」））''但我正在努力解決這個問題 – Gauss

在寬格式中，您可以一次更改每一列......但是im不確定最終數據集的外觀。但是對於你的例子..'s < - strsplit（colnames（example），「\\\\ | \\）| \\（」）[[1]]; data.frame（t（s [nzchar（s）]），示例[[1]]） – user20650

它有點難以知道你的最終數據應該是什麼樣子，就好像每個列名都被backsla分開shes或括號中，根據輸入列，結果中會得到不同數量的列。

所以我把每一列分成一個單獨的列表元素。如果您dput的data.frame被稱爲d

# Look at second column - then all you need to do is tweak the names 
s <- strsplit(colnames(d)[2], "\\\\|\\)|\\(")[[1]] 
data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[2]]) 

        time  X1   X2 X3      X4  value 
1 04/15/2013 00:00:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.040037563 
2 04/15/2013 00:00:34.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.009740260 
3 04/15/2013 00:00:49.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.011009828 
4 04/15/2013 00:01:04.284 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.006016244 
5 04/15/2013 00:01:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.015125328 
6 04/15/2013 00:01:34.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.002814141

的strsplit將每個串在\\或(或) - 在R注意這些需要與領先\\進行轉義。這個結果，其中由nzchar功能去掉了一些空字符串（返回FALSE如果長度爲零）

# Apply it over all variables 
lapply(seq_along(colnames(d))[-1], function(i) { 
       s <- strsplit(colnames(d)[[i]], "\\\\|\\)|\\(")[[1]] 
       data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[i]]) 
})

同樣，你將需要重命名列。

來源

2015-06-28 21:18:48 user20650

謝謝你，這是一個非常好的起點。這段代碼似乎會發生什麼，最終我得到了一個數據幀，其中有很多列時間，x1，x2，time.1，x1.1，x2.1，time.2，x1.2等。我很希望數據將只有時間，x1，x2列。那有意義嗎？ – Gauss

嗨高斯，即時通訊不太確定預期的輸出應該是什麼樣子，當我剛剛做了第二欄，它符合您的問題的預期結果。由於列名分隔到不同數量的部分，我不確定如何/如果你想合併它們。你能指出你想如何組合多欄輸出嗎？ – user20650

也許創建三個數據框。物理磁盤，處理器和系統各一個，並將公共列一起打包？ – user20650

分割一個變量名稱並將其拆分爲R中的單獨列中的數據

回答

相關問題