2017-04-17 65 views
1

刪除NA模式有A R數據框柱與下面的文字我如何從字符串

ClientID   Recom 
ABC    1:Teck|Scrip:ABC|Call:Buy||2:NA|Scrip:NA|Call:NA 
DEF    1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA||5:NA|Scrip:NA|Call:NA 
WER    1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA 

我想從上面的圖案刪除NA的。期望的數據框將是

ClientID   Recom 
ABC    1:Teck|Scrip:ABC|Call:Buy|| 
DEF    1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy|| 
WER    1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy|| 

我在R中使用以下gsub,但它似乎並沒有工作。

df$Recom <- gsub("\\s*[|]+\\NA\\s+.*", "", df$Recom) 

我該怎麼辦?

回答

1

您的字符串設置的方式在第一個NA之後似乎都具有NA。如果是這樣,則情況下,

gsub('[0-9]+:NA.*', '', df$Recom) 

您還可以使用strsplitgrepl

sapply(strsplit(df$Recom, '\\|\\|'), function(i)paste(i[!grepl('NA', i)], collapse = '||')) 
1
df$Recom <- lapply(strsplit(df$Recom, split = '||', fixed = TRUE), 
        grep, 
        pattern = 'NA', 
        invert = TRUE, 
        value = TRUE) 

df 
# ClientID Recom 
# 1  ABC 1:Teck|Scrip:ABC|Call:Buy 
# 2  DEF 1:CG|Scrip:WERT|Call:Buy, 2:CDGS|Scrip:QWS|Call:Buy, 3:IT|Scrip:QAS|Call:Buy 
# 3  WER 1:CDGS|Scrip:WERT|Call:Sell, 2:IT|Scrip:QWS|Call:Buy, 3:Industrials|Scrip:QAS|Call:Buy 

數據:

df <- structure(list(ClientID = c("ABC", "DEF", "WER"), 
        Recom = c("1:Teck|Scrip:ABC|Call:Buy||2:NA|Scrip:NA|Call:NA", 
           "1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA||5:NA|Scrip:NA|Call:NA", 
           "1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA" 
        )), 
       .Names = c("ClientID", "Recom"), 
       row.names = c(NA, -3L), 
       class = "data.frame") 
1

看起來你有幾種嵌入在中的信息-列。要清潔你的數據,你也可以這樣做:

library(splitstackshape) # will automatically also load the 'data.table' package 
dt <- cSplit(
     cSplit(
      cSplit(df, 'Recom', sep = '||', 'long'), 
      'Recom', sep = '|', 'long' 
     ), 
     'Recom', sep = ':', 'wide' 
    )[Recom_2 != 'NA' 
     ][, num := cumsum(grepl('\\d+', Recom_1)), ClientID 
      ][grepl('\\d+', Recom_1), Recom_1 := 'kind'] 

dcast(dt, ClientID + num ~ Recom_1, value.var = 'Recom_2') 

這給:

ClientID num Call Scrip  kind 
1:  ABC 1 Buy ABC  Teck 
2:  DEF 1 Buy WERT   CG 
3:  DEF 2 Buy QWS  CDGS 
4:  DEF 3 Buy QAS   IT 
5:  WER 1 Sell WERT  CDGS 
6:  WER 2 Buy QWS   IT 
7:  WER 3 Buy QAS Industrials