2017-08-31 114 views
1

我有幾千個*.csv文件(所有文件都有唯一的名稱),但文件中的標題列相同 - 比如"Timestamp""System_Name""CPU_ID"等...
我的問題是我怎麼能取代"System_Name"(這是一個系統名稱像"as12535.org.at"或任何其他字符組合,並匿名此?我很感激任何提示或點右方向...
下面的CSV文件的結構...R - 通過列表中的data.frames循環 - 修改列(列表元素)的字符

"Timestamp","System_Name","CPU_ID","User_CPU","User_Nice_CPU","System_CPU","Idle_CPU","Busy_CPU","Wait_IO_CPU","User_Sys_Pct" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 

我試過用R包anonymizer,它在矢量級別上工作正常,但是我遇到了這樣的問題,因爲我在R中讀取了數千個csv文件 - 我嘗試的是以下內容 - 創建包含所有csv文件作爲列表中的數據框。

initialize a list 
r.path <- setwd("mypath") 
ldf <- list() 

# creates the list of all the csv files in my directory - but filter for 
# files with Unix in the filename for testing. 
listcsv <- dir(pattern = ".UnixM.") 

for (i in 1:length(listcsv)){ 
ldf[[i]] <- read.csv(file = listcsv[i]) 
} 

我扭我的大腦死亡,因爲我無法匿名的System_Name列,甚至可以通過列表(ldf)和該數據幀的元素替換某些字符(僞匿名)和環路很名單。

我的目錄ldf(包含單CSV文件DF)是這樣的:

summary(ldf) 
Length Class  Mode 
[1,] 5  data.frame list 
[2,] 5  data.frame list 
[3,] 5  data.frame list 

showing the structure of my list, containing all files contents as dataframe

如何我現在可以在所有的CSV文件,更改閱讀或匿名的整個或甚至是"System_Name"列的一部分,並且爲我的目錄中的每個CSV執行此操作,在R中進行循環?不需要是超級優雅的 - 很高興當它:-)

+0

使用'lapply'到你想要的功能列表中。我不知道anonymizer如何工作,在假設的情況下,函數就像'anonymizer(column)':'lapply(list,function(x)anonymizer(x $ System_Name))' –

回答

2

的工作做一個常見的模式是:

df <- do.call(
    rbind, 
    lapply(dir(pattern = "UnixM"), 
     read.csv, stringsAsFactors = FALSE) 
) 
df$System_Name <- anonymizer::anonymize(df$System_Name) 

它不同於你試圖什麼,因爲它將所有數據幀綁定在一起,然後匿名。

當然,您可以將所有內容都保存在列表中,例如@S Rivero所建議的。它看起來像:

listdf <- lapply(
    dir(pattern = "UnixM"), 
    function(filename) { 
    df <- read.csv(filename, stringsAsFactors = FALSE) 
    df$System_Name <- anonymizer::anonymize(df$System_Name) 
    df 
    } 
)