我需要比較2個名稱以查看它們中的一個是否是另一個的暱稱。數據框中有兩列名稱。在R中避免使用for循環(使用臨時變量)
Names <- data.frame(In_Name = c("Gary",'John','James','William','Bill','Paul','Tom','Annie','Bella','Sue'),
Match_Name = c('Garry','Jon','Jimmy','Paul','William','Pablo','Thomas','Anne','Belle','Susan'),stringsAsFactors = F)
Names[] <- lapply(Names, toupper)
Names$Match <- 0
我也有一個暱稱表,其中包含一對暱稱。在全套的名稱可能會出現在對多行(如下面的「貝拉」行的情況下)
NickName_Table <- data.frame(Names = c('Garrett,Garret,Gary,Garry'
,'Ian,John,Johnie,Johnnie,Johnny,Jon'
,'Jae,James,Jamey,Jay,Jaymes,Jem,Jemmy,Jim,Jimi,Jimmie,Jimmy'
,'Bill,Billie,Billy,Wil,Will,William,Willie,Willy'
,'Paul,Pauly,Paulie'
,'Maas,Thom,Thomas,Tom,Tomas,Tommie,Tommy'
,'Ann,Anna,Anne,Annette,Annie,Nan,Nancy,Nanette,Nannie,Nanny'
,'Bella,Belle,Ibbie,Issy,Izzy,Sabella'
,'Isabella,Isabelle,Bella,Belle'
,'Sue,Sukie,Susan,Susann,Susanna,Suzie'))
NickName_Table[] <- lapply(NickName_Table, toupper)
我想避免使用for循環但是我無法工作,如何做一個函數調用,因爲我需要將找到的行存儲在一個臨時變量中,以便在同一行中搜索第二個名稱。我需要爲超過一百萬對名稱執行此操作,for循環太慢。我現在的循環是:
library(sqldf)
i=1
for (i in 1:nrow(Names))
{
first_name <- Names[i,1]
match_name <- Names[i,2]
if(!is.na(first_name) & !is.na(match_name) & first_name != match_name)
{
if (nrow(subset(NickName_Table,grepl(first_name,NickName_Table$Names)))>= 1)
{
possibleMatch <- subset(NickName_Table,grepl(first_name,NickName_Table$Names))
temp1 <- unique(as.data.frame(strsplit(gsub(" ", ",",Reduce(paste,unlist(possibleMatch))),","), stringsAsFactors = F))
colnames(temp1) <- "Names"
temp2 <- data.frame(match_name, stringsAsFactors = F)
colnames(temp2) <- "Names_1"
if(nrow(sqldf("Select a.* from temp1 a left join temp2 b on a.Names=b.Names_1 where b.Names_1 is not NULL"))>= 1)
{
Names[i,3] <- 1
}
else
Names[i,3] <- 0
}
else
Names[i,3] <- 0
}
else
Names[i,3] <- 0
}
編輯: 我試圖創建一個功能然而問題是,暱稱表和字符串的長度進行比較是不平等的,所以矢量化比較似乎不起作用。
functiona <- function (inNames,MatchNames,NickName_Table1){
if(!is.na(inNames) & !is.na(MatchNames) & inNames != MatchNames)
{
if (length(subset(NickName_Table1,grepl(inNames,NickName_Table1)))>= 1)
{
possibleMatch <- subset(NickName_Table1,grepl(inNames,NickName_Table1))
temp1 <- unique(as.data.frame(strsplit(gsub(" ", ",",Reduce(paste,unlist(possibleMatch))),","), stringsAsFactors = F))
colnames(temp1) <- "Names"
temp2 <- data.frame(MatchNames, stringsAsFactors = F)
colnames(temp2) <- "Names_1"
if(nrow(sqldf("Select a.* from temp1 a left join temp2 b on a.Names=b.Names_1 where b.Names_1 is not NULL"))>= 1)
{
return <- 1
}
else
return <- 0
}
else
return <- 0
}
else
return <- 0
}
c <- mapply(functiona,Names$In_Name,Names$Match_Name,NickName_Table$Names)
具體談談你的問題。請解釋一下,不僅是問題,還有你所嘗試的以及你所堅持的。閱讀此:http://stackoverflow.com/help/how-to-ask – crabbly