2014-01-22 15 views
0

希望標題的措辭有意義。我有一個由值組成的數據框:「A」,「B」,「C」,「D」,「」,「A/B」。我想確定哪些行只包含2個「A」,「B」,「C」或「D」。這些字母中每個字母的頻率並不重要。我只想知道該行中是否存在超過2個這樣的4個字母。刪除其列中的值包含2個以上4個唯一字符的行

下面是一個示例數據幀:

df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B"))) 
    df.sample 

     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 
    1 A B A A/B B B B B  B 
    2 A B C A B  B  B 
    3 A B D D B B B B  B 
    4 A B A A B B B B B B 

我想的功能適用於確定多少各4個字母(「A」,「B」,「C」中的每一行,或「D」),而不是每個的頻率,但基本上只有「A」,「B」,「C」和「D」的0或1值。如果這4個值的總和大於3,那麼我想將該行的索引分配給一個新的向量,該向量將用於從數據幀中刪除這些行。

myfun (x){ 
     #which rows contain > 2 different letters of A, B, C, or D. 
     #The number of times each letter occurs in a given row does not matter. 
     #What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter. 

     out = which(something > 2) 
    } 

    row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters. 

    new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters. 

在df.sample以上,2和3行包含多於2那些4個字母的並且因此應該被索引以便除去。通過函數運行df.sample和row.indexes刪除行後,我new.df.sample數據幀應該是這樣的:

 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 
    1 A B A A/B B B B B  B 
    4 A B A A B B B B B B 

我試圖認爲這是對每個邏輯語句4個字母,然後分配一個0或1的每個字母,總結它們,然後確定哪些總和大於2.例如,我想也許我可以嘗試'grep()'並將其轉換爲邏輯每個字母然後被轉換爲0或1並且相加。這似乎太冗長了,並沒有用我試過的方式工作。有任何想法嗎?

+0

如何處理'A/B'? –

+0

對於A/B,忽略它是「A/B」,並且只檢查該值是否包含A,B,C或D.單元格內的值不必完美匹配,但僅限於包含我正在尋找的價值。例如,如果第1行中的A/B實際上是A/C,則該行將被索引以進行刪除,但因爲它是A/B,所以它保持不變。 – SC2

回答

2

這是這項任務的一個功能。該函數返回一個邏輯值。 TRUE表示具有兩個以上不同字符串的行:

myfun <- function(x) { 
    sp <- unlist(strsplit(x, "/")) 
    length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2 
} 

row.indexes <- apply(df.sample, 1, myfun) 
# [1] FALSE TRUE TRUE FALSE 

new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!' 

# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 
# 1 A B A A/B B B B B  B 
# 4 A B A A B B B B B B 
+0

看,我知道它要簡單得多。完美,謝謝! – SC2

+0

@ SC2我更新了功能。現在,它也適用於'A/B'的情況。 –

相關問題