2016-07-06 42 views
1

我在這裏問了一個問題Finding the index based on two data frames of strings,我得到了一個完美的答案。 現在我一直面臨着另一個我無法解決的問題。如果我的第二個數據是多列,然後我就可以解決它的基礎上根據不同長度的字符串操縱兩個數據幀

setDT(strs)[, c('colids1','colids2') := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][] 

只要這是確定作爲我的第二個數據序列(STR)在所有列 長度相同,但如果他們改變(不相同的長度),那麼這是行不通的,並給我一個錯誤。

所以我們說,我的第一個數據是

lut <- structure(list(V1 = c("O75663", "O95400", "O95433", NA, NA), 
    V2 = c("O95456", "O95670", NA, NA, NA), V3 = c("O75663", 
    "O95400", "O95433", "O95456", "O95670"), V4 = c("O95456", 
    "O95670", "O95801", "P00352", NA), V1 = c("O75663", "O95400", 
    "O95433", NA, NA), V2 = c("O95456", "O95670", NA, NA, NA), 
    V3 = c("O75663", "O95400", "O95433", "O95456", "O95670"), 
    V4 = c("O95456", "O95670", "O95801", "P00352", NA)), .Names = c("V1", 
"V2", "V3", "V4", "V1", "V2", "V3", "V4"), row.names = c(NA, 
-5L), class = "data.frame") 

和我的第二個數據是

strs <- structure(list(strings = structure(c(2L, 3L, 4L, 5L, 6L, 7L, 
1L, 1L), .Label = c("", "O75663", "O95400", "O95433", "O95456", 
"O95670", "O95801"), class = "factor"), strings2 = structure(c(4L, 
2L, 6L, 5L, 3L, 1L, 1L, 1L), .Label = c("", "O75663", "O95433", 
"O95456", "P00352", "P00492"), class = "factor"), strings3 = structure(c(4L, 
6L, 7L, 8L, 2L, 3L, 5L, 1L), .Label = c("", "O75663", "O95400", 
"O95456", "O95670", "O95801", "P00352", "P00492"), class = "factor"), 
    strings4 = structure(c(2L, 5L, 3L, 4L, 1L, 1L, 1L, 1L), .Label = c("", 
    "O95400", "O95456", "O95801", "P00492"), class = "factor"), 
    strings5 = structure(c(8L, 2L, 7L, 1L, 3L, 6L, 5L, 4L), .Label = c("O75663", 
    "O95400", "O95433", "O95456", "O95670", "O95801", "P00352", 
    "P00492"), class = "factor")), .Names = c("strings", "strings2", 
"strings3", "strings4", "strings5"), class = "data.frame", row.names = c(NA, 
-8L)) 

這就是我試圖做

df<- setDT(strs)[, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][] 

它的工作原理,如果長度strs是相同的,但它不起作用,當長度變化時,我給這裏的例子

+0

錯誤很明顯。試試這個'strs [c(1:3,5)] < - lapply(strs [c(1:3,5)],as.character)'然後運行你的'data.table'語句。由此產生的'df'是否符合您的期望? – Sumedh

+0

@Sumedh謝謝你的消息,它不能解決問題。我做了你所說的然後我做了df < - setDT(strs)[,paste0('colids _',seq_along(strs)):= lapply(.SD,function(x)toString(which(colSums(lut == x,na.rm = TRUE)> 0))),by = 1:nrow(strs)] []然後得到同樣的錯誤。 – nik

+0

@Sumedh我一直在嘗試在網絡上提供的每一個評論,但我不知道爲什麼它不工作! – nik

回答

1

strs到字符變量轉換你的因子變量,也可以很容易地與data.table完成。假設你strs數據集已經是一個data.table,你應該做的:

strs[, names(strs) := lapply(.SD, as.character)] 

如果strs還不是data.table,你應該使用:

setDT(strs)[, names(strs) := lapply(.SD, as.character)] 

之後,你可以像執行操作你自找的。一切鏈接在一起,它看起來像:

setDT(strs)[, lapply(.SD, as.character) 
      ][, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), 
       by = 1:nrow(strs)][] 
+0

非常感謝你的寶貴意見,我已經喜歡你的答案,太棒了!有可能看看我的真實數據嗎?一旦你看,我可以從網上刪除它們。謝謝 – nik

+1

@nik我已經在尋找;-) – Jaap

+0

很好,謝謝兄弟,我也接受了你的回答,因爲它非常豐富,我從中學到很多東西。再次感謝 – nik

2

這個我從@scentoni傾斜,rapplylapply的遞歸版本它將所有的向量轉換爲字符。如果它被設置爲替換如何=「替換」,那麼列表中不是列表並且具有類中包括的類的列表中的每個元素被替換爲應用函數的結果,其中是as.character here to the element。

strs <- rapply(strs, as.character, classes="factor", how="replace") 

然後執行

df<- setDT(strs)[, paste0('colids_',seq_along(strs)) := lapply(.SD, function(x) toString(which(colSums(lut == x, na.rm=TRUE) > 0))), by = 1:nrow(strs)][] 
+0

這一個也適用!你能評論一下這個功能嗎? – nik

+0

感謝它工作 – nik