假設我有一個縣名單,其拼寫錯誤數量不同或其他問題與2010 FIPS dataset(用於創建fips
數據框的代碼如下)不同,但拼寫錯誤的縣所在的州正確輸入。這裏有一個sample
21的隨機觀察從我的完整數據集:R:使用plyr在兩個數據源的匹配子集之間執行模糊字符串匹配
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
我用adist
創建約80%我縣匹配在fips
縣名的模糊字符串匹配算法。然而,有時它會匹配兩個拼寫相似的縣,但來自不同的州(例如,「韋伯斯特,洛杉磯」匹配「韋伯斯特,喬治亞州」而不是「韋伯斯特帕裏什,洛杉磯」)。
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
因此,我想限制縣城的模糊字符串匹配的拼寫正確版本相匹配的狀態。
我目前的算法使一個大矩陣計算兩個源之間的標準Levenshtein距離,然後選擇最小距離的值。
爲了解決我的問題,我猜我需要創建一個函數,可以應用於每個'國家'組ddply
,但我很困惑,我應該如何表明組值ddply
函數應該匹配另一個數據幀。使用任何其他軟件包的dplyr
解決方案或解決方案也將受到讚賞。
代碼來創建FIPS數據集:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
您的問題將從[可重現的示例]中受益匪淺(http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – MrFlick