2017-07-24 154 views
2

我這裏列有一些拼寫錯誤的字符串表,讓說,作爲一個例子正確的參數替換表中的拼寫錯誤的單詞:如何使用,使用R

table$Status回報這些值:

"alive" "sic" "alive" "sick" "alive" "si" "alive" "ali" "alv" 
"dead" "alive" "alive" "alive" "al" "dead" "dead" "de" "dead" 
"dead" "dea" "dead" "al" "dead" "de" "al" "de" "sick" 
"dead" "alive" 

我想有活着生病像下面的例子:

"alive" "sick" "alive" "sick" "alive" "sick" "alive" "alive" "alive" 
"dead" "alive" "alive" "alive" "alive" "dead" "dead" "dead" "dead" 
"dead" "dead" "dead" "alive" "dead" "dead" "alive" "dead" "sick" 
"dead" "alive" 

我知道有從包RecordLinkage這個函數來得到這樣的字符串之間的距離:

levenshteinSim("al", "alive") 

所以我會比較其他每一個值,並獲得最佳的相似性,我也知道用table(Table$Status)我會得到最重複的值的數量,那些將是正確的。

但是,這是我的問題我怎麼能比較他們所有相互並替換我的表?如果有人知道一個簡單的方法來做這將是非常有幫助的。

回答

1
library(data.table) 
library(dplyr) 
table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si", "de", "al" )) 
table[,Status2:=ifelse(Status%like%"^al","alive", 
         ifelse(Status%like%"^si","sick","dead"))] 

UPDATE

一個更通用的解決方案:

library(data.table) 

table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si", "de", "al" )) 

correct_values <- c("alive","sick","dead") 
for (i in 1:nrow(table)){ # i <- 2 
    string <- table[i,Status] 
    max <- 0 
    similarity <- 0 
    for(j in correct_values){ # j <- "alive" 
    similarity <- length(Reduce(intersect, strsplit(c(string, j), split = ""))) 
    if(similarity > max){ 
     max <- similarity 
     to_replace <- j 
    } 
    } 
    table[i,"Status"] <- to_replace 
} 

在這裏,我假設你知道哪個值是校正那些(在此你手工輸入correct_values這它將替代。列Status中的值與correct_values中的值具有最高的通用字符數

我希望它有幫助!

+0

這有效,但它對我的例子非常具體當我有一個10000個值的表時會發生什麼?我怎麼知道這些是拼寫錯誤的單詞? –

+0

@quant我會建議在嵌套的'ifelse'上使用'dplyr :: case_when'。 @ProgrammerMan如果它不那麼具體,就沒有辦法確定'al'是什麼意思。 「活着」還是「全部」?也許'啤酒'? Ofc您應該對第一個符號使用模糊匹配,但您仍然必須提供全文的模式以供比較。 –

+0

@quant非常感謝你這個作品! –