希望標題的措辭有意義。我有一個由值組成的數據框:「A」,「B」,「C」,「D」,「」,「A/B」。我想確定哪些行只包含2個「A」,「B」,「C」或「D」。這些字母中每個字母的頻率並不重要。我只想知道該行中是否存在超過2個這樣的4個字母。刪除其列中的值包含2個以上4個唯一字符的行
下面是一個示例數據幀:
df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
df.sample
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
2 A B C A B B B
3 A B D D B B B B B
4 A B A A B B B B B B
我想的功能適用於確定多少各4個字母(「A」,「B」,「C」中的每一行,或「D」),而不是每個的頻率,但基本上只有「A」,「B」,「C」和「D」的0或1值。如果這4個值的總和大於3,那麼我想將該行的索引分配給一個新的向量,該向量將用於從數據幀中刪除這些行。
myfun (x){
#which rows contain > 2 different letters of A, B, C, or D.
#The number of times each letter occurs in a given row does not matter.
#What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.
out = which(something > 2)
}
row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.
new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.
在df.sample以上,2和3行包含多於2那些4個字母的並且因此應該被索引以便除去。通過函數運行df.sample和row.indexes刪除行後,我new.df.sample數據幀應該是這樣的:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
4 A B A A B B B B B B
我試圖認爲這是對每個邏輯語句4個字母,然後分配一個0或1的每個字母,總結它們,然後確定哪些總和大於2.例如,我想也許我可以嘗試'grep()'並將其轉換爲邏輯每個字母然後被轉換爲0或1並且相加。這似乎太冗長了,並沒有用我試過的方式工作。有任何想法嗎?
如何處理'A/B'? –
對於A/B,忽略它是「A/B」,並且只檢查該值是否包含A,B,C或D.單元格內的值不必完美匹配,但僅限於包含我正在尋找的價值。例如,如果第1行中的A/B實際上是A/C,則該行將被索引以進行刪除,但因爲它是A/B,所以它保持不變。 – SC2