2015-12-07 94 views
-1

有n個文件。在每個文件中有多個列,我只能選擇前兩個。我必須在這兩列的基礎上將這些n個文件與另一列合併。該值將像一個字符串。字符串的長度取決於文件的數量。例如,假設有4個文件, 文件1:在兩列的基礎上合併兩個數據集

cat dog 
lion ele 
mice hello 
new lion 
ele that 

文件2:

cat lion 
mice hello 
cub pet 
old lion 

文件3:

new lion 
cub pet 
cat dog 
hello cat 

FILE4:

ele that 
hello cat 
new old 

我想生成一個新文件,

cat dog  PAPA 
lion ele  PAAA 
mice hello PPAA 
new lion PAPA 
ele that PAAP 
cat lion APAA 
cub pet  APPA 
old lion APAA 
new lion AAPA 
hello cat  AAPP 
new old  AAAP 

如果它們不存在於第i個文件中,那麼該值應該在位置'i'處爲'A',否則它將爲'P'。這是如何形成字符串。

+0

請包括您在嘗試自己解決此問題時使用的代碼。 –

+0

@RichardScriven我已經嘗試過使用合併功能,但是我無法找出鋤頭分配「A」或「P」並形成一個字符串。 – birk

回答

0

如果你有一個小的數據集,您可以在安裝tidyr包重塑

library(dplyr) 
library(tidyr) 

list_of_file_names = c(...) 

data_frame(file = list_of_file_names) %>% 
    group_by(file) %>% 
    do(read.csv(.$file)) %>% 
    distinct %>% 
    mutate(present = "P") %>% 
    spread(file, present, fill = "A") %>% 
    gather(file, present_absent, first_file_name:last_file_name) %>% 
    group_by(column1, column2) %>% 
    summarize(present_absent_string = 
       present_absent %>% 
       paste(collapse = "")) 
+0

謝謝你的回覆。我在安裝tidyr軟件包時遇到麻煩。有沒有其他方法? – birk

0
我有麻煩

做到這一點。有沒有其他 的方式?

這是一個沒有額外的庫。

#!/usr/bin/Rscript --vanilla 
# data input - filenames are to be provided as command line arguments: 
t = lapply(commandArgs(T), read.table, col.names=1:2, flush=T) # only 2 columns 
t = mapply('[<-', t, 3, value="P", SIMPLIFY=F) # mark the values as "present" 
t = Reduce(function(x, y) merge(x, y, 1:2, all=T, suffixes=ncol(x)), t) # merge 
t[is.na(t)] = "A"   # mark the not present values as "absent" 
t[3] = Reduce(function(...) paste(..., sep=''), t[-(1:2)]) # concatenate P&A 
# data output - write the desired output format 
write.table(format(t[1:3], justify="l"), quote=F, row.names=F, col.names=F)