2016-03-29 47 views
0

我有一個數據框和一個代表生命體徵的因子變量。它有50個級別。但是,許多級別是重複的。例如:「呼吸率」可以編碼爲「Resp Rate」或「RR」等。我想將所有呼吸頻率分組到一個單獨的級別,併爲其他生命體徵做相同的處理。我嘗試了下面的方法。有一個更好的方法嗎?R中的Bin字符變量

Sign_desc <- c("Resp rate:","Respiratory rate","Blood pressure panel","Systolic blood pressure", "Systolic blood pressure:", "Diastolic blood pressure","Diastolic blood pressure:","resp rate") 

Sign_Value <- c(10, 12, "80/120", 120, 120, 80, 80, 15) 
Vital_Sign <- as.data.frame(cbind(Sign_desc,Sign_Value)) 

Vital_Sign$Sign_desc[Vital_Sign$Sign_desc=="Respiratory Rate"] <- "RR" 
Vital_Sign$Sign_desc[Vital_Sign$Sign_desc=="Resp Rate:"] <- "RR" 
Vital_Sign$Sign_desc[Vital_Sign$Sign_desc=="resprate"] <- "RR" 
+3

沒有一個神奇的功能,請讓你的例如可重現的 – rawr

+0

'grep' /'grepl',可能。直接分配到因子水平而不是價值可能更快,但要小心你的訂單,否則你會搞亂你的數據。 – alistaire

+0

@rawr讓示例可重現。 – user3897

回答

2

您可以使用levels直接訪問的因素水平,而不是修改數據本身:

levels(Vital_Sign$Sign_desc)[levels(Vital_Sign$Sign_desc)=="Respiratory Rate"] <- "RR" 
levels(Vital_Sign$Sign_desc)[levels(Vital_Sign$Sign_desc)=="Resp Rate"] <- "RR" 
levels(Vital_Sign$Sign_desc)[levels(Vital_Sign$Sign_desc)=="resprate"] <- "RR" 

要做到這一切在一次:

levels(Vital_Sign$Sign_desc)[levels(Vital_Sign$Sign_desc) %in% c("Respiratory Rate", "Resp Rate","resprate")] <- "RR" 
1

一個更加自動化的又少顯而易見的方式將比使用字符串距離的方式更有趣。

Sign_desc <- c("Resp rate:","Respiratory rate","Blood pressure panel", 
       "Systolic blood pressure", "Systolic blood pressure:", 
       "Diastolic blood pressure","Diastolic blood pressure:","resp rate") 

ad <- adist(Sign_desc) 
rownames(ad) <- Sign_desc 

hc <- hclust(as.dist(ad)) 
plot(hc) 
rect.hclust(hc, 3) 

enter image description here

根據情節上面,3組可能是合適的,這樣你就可以再使用cutree看到這串會落入這組

(ct <- cutree(hc, 3)) 
# Resp rate:   Respiratory rate  Blood pressure panel 
#   1       1       2 
# Systolic blood pressure Systolic blood pressure: Diastolic blood pressure 
#      3       3       3 
# Diastolic blood pressure:     resp rate 
#       3       1 

你也可以使用這些小組按順序給你新的名字。從上面看,我想RR對應於1秒,BP的2S和3S等

## new names corresponding to the groups above 
nn <- c('RR', 'BP', 'BP') 

cbind(old = Sign_desc, new = nn[ct]) 
#  old       new 
# [1,] "Resp rate:"    "RR" 
# [2,] "Respiratory rate"   "RR" 
# [3,] "Blood pressure panel"  "BP" 
# [4,] "Systolic blood pressure" "BP" 
# [5,] "Systolic blood pressure:" "BP" 
# [6,] "Diastolic blood pressure" "BP" 
# [7,] "Diastolic blood pressure:" "BP" 
# [8,] "resp rate"     "RR" 

這裏是所有的代碼中使用

Sign_desc <- c("Resp rate:","Respiratory rate","Blood pressure panel","Systolic blood pressure", "Systolic blood pressure:","Diastolic blood pressure","Diastolic blood pressure:","resp rate") 
ad <- adist(Sign_desc) 
rownames(ad) <- Sign_desc 
hc <- hclust(as.dist(ad)) 
plot(hc) 
rect.hclust(hc, 3) 
(ct <- cutree(hc, 3)) 
nn <- c('RR', 'BP', 'BP') 
cbind(old = Sign_desc, new = nn[ct])