我需要根據與第一個鏈中鏈接的兩個其他數據框中的值更新數據框。鏈接連接的複雜條件
目標DF t_offices
有4個領域的興趣在這裏:
administrative_area_level_1 administrative_area_level_2 country locality
1 Arizona Maricopa County United States Phoenix
2 District of Columbia <NA> United States Washington
3 <NA> <NA> India <NA>
4 New York Albany County United States Albany
5 Utrecht Nieuwegein Netherlands Nieuwegein
6 Connecticut Fairfield County United States Stamford
707 Illinois <NA> United States <NA>
4241 Illinois <NA> United States West Chicago
999998 Alabama <NA> United States Altoona
999999 Pennsylvania <NA> United States Washington
我需要administrative_area_level_2
與美國記錄的縣更新NA值。該值是在DF t_places
:
state_ab place_name county_name place_nameshort
1 AL Abanda CDP Chambers County Abanda
2 AL Abbeville city Henry County Abbeville
3 AL Adamsville city Jefferson County Adamsville
4 AL Addison town Winston County Addison
5 AL Akron town Hale County Akron
6 AL Alabaster city Shelby County Alabaster
12 AL Altoona town Blount County, Etowah County Altoona
4298 DC Washington city District of Columbia Washington
7527 IL West Chicago city DuPage County Washington
32611 PA Washington township Armstrong County West Chicago
32612 PA Washington township Berks County Washington
place_nameshort
是place_name
截斷版本沒有名稱(例如「城市」,「鎮」等)
我加入t_offices
和t_places
對國家和地方爲了得到正確的縣。這可能會返回多個縣1),因爲county_name
可能包含以逗號分隔的多個縣,以及2)因爲截斷的place_nameshort
可能會在同一狀態內返回同義詞。我需要只是那些縣明確的情況下(返回單縣)。
由於t_places
只包含state_ab
,我需要第三個數據幀r_states
爲state_name
:
state_ab state_name
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
9 DC District of Columbia
17 IL Illinois
42 PA Pennsylvania
通過對state_ab
與r_states
加盟t_places
,我可以得到state_name
與t_offices$administrative_area_level_1
匹配。
這是我的嘗試,它是不完整的,因爲它不控制多個縣,由於在州內的同義詞,並且哪個不起作用。
no_county <- (!is.na(t_offices$country)
& t_offices$country == "United States"
& !is.na(t_offices$administrative_area_level_1)
& is.na(t_offices$administrative_area_level_2)
& !is.na(t_offices$locality))
t_offices$administrative_area_level_2[no_county] <-
t_places$county_name[!grepl(",", t_places$county_name)
& match(t_places$place_nameshort, t_offices$locality[no_county])
& match(t_places$state_ab,
r_states$state_ab[match(r_states$state_name,
t_offices$administrative_area_level_1[no_county])])]
編輯:繼@ r2evans的意見,這是我新的編碼的嘗試,它仍然不能正常工作:
# split multiple counties into columns
library(splitstackshape)
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F)
# merge state names into places
places_statename <- merge(t_places, r_states[,2:3])
# define condition to select t_offices records in U.S. with state and place but no county
no_county <- (
# country is U.S.
!is.na(t_offices$country)
& t_offices$country == "United States"
# with state
& !is.na(t_offices$administrative_area_level_1)
# blank county
& is.na(t_offices$administrative_area_level_2)
# with place
& !is.na(t_offices$locality))
# update blank counties
t_offices$administrative_area_level_2[no_county] <-
# unambiguous counties
places_statename$county_name_1[is.na(places_statename$county_name_2)
# locality matches place
& match(t_offices$locality[no_county], places_statename$place_nameshort)
# administrative_area_level_1 matches state
& match(t_offices$administrative_area_level_1[no_county],places_statename$state_name)]
我建議你爲了支持直接加入改革您的數據(通過'merge'或'dplyr :: left_join'和朋友)。這使得一切都變得更容易,更強大,並且更容易處理/排除故障。一開始:如果'縣名'可以包含多個以逗號分隔的值,可以用'tidyr :: separate'和'tidyr :: gather'來分割它們(所以加入更直觀/簡單。問題可以重現;現在,我們沒有符合您所有要求的代表性數據。 – r2evans
@ r2evans感謝您的建議!我已經添加了(真實和製作的)樣本數據以使問題具有可重現性。你的第一個建議是,我應該合併t_places和r_states並將縣名融入一個表中,然後用t_offices將該表加入? – syre
@ r2evans不會融化,但會轉換爲多列 – syre