2017-01-30 37 views
1

我需要根據與第一個鏈中鏈接的兩個其他數據框中的值更新數據框。鏈接連接的複雜條件

目標DF t_offices有4個領域的興趣在這裏:

 administrative_area_level_1 administrative_area_level_2  country  locality 
    1      Arizona    Maricopa County United States  Phoenix 
    2  District of Columbia      <NA> United States Washington 
    3      <NA>      <NA>   India   <NA> 
    4     New York    Albany County United States  Albany 
    5      Utrecht     Nieuwegein Netherlands Nieuwegein 
    6     Connecticut   Fairfield County United States  Stamford 
    707     Illinois      <NA> United States   <NA> 
    4241     Illinois      <NA> United States West Chicago 
999998      Alabama      <NA> United States  Altoona 
999999    Pennsylvania      <NA> United States Washington 

我需要administrative_area_level_2與美國記錄的縣更新NA值。該值是在DF t_places

 state_ab   place_name     county_name place_nameshort 
    1  AL   Abanda CDP    Chambers County   Abanda 
    2  AL  Abbeville city     Henry County  Abbeville 
    3  AL  Adamsville city    Jefferson County  Adamsville 
    4  AL   Addison town    Winston County   Addison 
    5  AL   Akron town     Hale County   Akron 
    6  AL  Alabaster city    Shelby County  Alabaster 
    12  AL   Altoona town Blount County, Etowah County   Altoona 
    4298  DC  Washington city   District of Columbia  Washington 
    7527  IL West Chicago city    DuPage County  Washington 
32611  PA Washington township    Armstrong County West Chicago 
32612  PA Washington township     Berks County  Washington 

place_nameshortplace_name截斷版本沒有名稱(例如「城市」,「鎮」等)

我加入t_officest_places對國家和地方爲了得到正確的縣。這可能會返回多個縣1),因爲county_name可能包含以逗號分隔的多個縣,以及2)因爲截斷的place_nameshort可能會在同一狀態內返回同義詞。我需要只是那些縣明確的情況下(返回單縣)。

由於t_places只包含state_ab,我需要第三個數據幀r_statesstate_name

state_ab    state_name 
1  AL    Alabama 
2  AK     Alaska 
3  AZ    Arizona 
4  AR    Arkansas 
5  CA    California 
6  CO    Colorado 
9  DC District of Columbia 
17  IL    Illinois 
42  PA   Pennsylvania 

通過對state_abr_states加盟t_places,我可以得到state_namet_offices$administrative_area_level_1匹配。

這是我的嘗試,它是不完整的,因爲它不控制多個縣,由於在州內的同義詞,並且哪個不起作用。

no_county <- (!is.na(t_offices$country) 
      & t_offices$country == "United States" 
      & !is.na(t_offices$administrative_area_level_1) 
      & is.na(t_offices$administrative_area_level_2) 
      & !is.na(t_offices$locality)) 

t_offices$administrative_area_level_2[no_county] <- 
    t_places$county_name[!grepl(",", t_places$county_name) 
         & match(t_places$place_nameshort, t_offices$locality[no_county]) 
         & match(t_places$state_ab, 
           r_states$state_ab[match(r_states$state_name, 
                 t_offices$administrative_area_level_1[no_county])])] 

編輯:繼@ r2evans的意見,這是我新的編碼的嘗試,它仍然不能正常工作:

# split multiple counties into columns 
library(splitstackshape) 
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F) 

# merge state names into places 
places_statename <- merge(t_places, r_states[,2:3]) 

# define condition to select t_offices records in U.S. with state and place but no county 
no_county <- (
    # country is U.S. 
    !is.na(t_offices$country) 
    & t_offices$country == "United States" 
    # with state 
    & !is.na(t_offices$administrative_area_level_1) 
    # blank county 
    & is.na(t_offices$administrative_area_level_2) 
    # with place 
    & !is.na(t_offices$locality)) 

# update blank counties 
t_offices$administrative_area_level_2[no_county] <- 
    # unambiguous counties 
    places_statename$county_name_1[is.na(places_statename$county_name_2) 
           # locality matches place 
           & match(t_offices$locality[no_county], places_statename$place_nameshort) 
           # administrative_area_level_1 matches state 
           & match(t_offices$administrative_area_level_1[no_county],places_statename$state_name)] 
+1

我建議你爲了支持直接加入改革您的數據(通過'merge'或'dplyr :: left_join'和朋友)。這使得一切都變得更容易,更強大,並且更容易處理/排除故障。一開始:如果'縣名'可以包含多個以逗號分隔的值,可以用'tidyr :: separate'和'tidyr :: gather'來分割它們(所以加入更直觀/簡單。問題可以重現;現在,我們沒有符合您所有要求的代表性數據。 – r2evans

+0

@ r2evans感謝您的建議!我已經添加了(真實和製作的)樣本數據以使問題具有可重現性。你的第一個建議是,我應該合併t_places和r_states並將縣名融入一個表中,然後用t_offices將該表加入? – syre

+0

@ r2evans不會融化,但會轉換爲多列 – syre

回答

0

這是我長期的解決方案。有可能更短,更優雅的。

# split multiple counties into columns 
library(splitstackshape) 
t_places <- cSplit(t_places, "county_name", sep = ", ", drop = F, type.convert = F) 
# subset original places with single county 
places_singlecounty <- t_places[is.na(places_statename$county_name_2), c(1,8,9)] 
# subset truncated places with single county 
library(data.table) 
setDT(places_singlecounty) 
places_singlecounty <- merge(places_singlecounty, 
          places_singlecounty[, .N, by = c("state_ab", "place_nameshort")][N == 1, 1:2]) 
# merge state names into single-county truncated places 
places_statename <- merge(places_singlecounty, r_states[,2:3], by = "state_ab") 

# define condition to select t_offices records in U.S. with state and place but no county 
no_county <- (
    # country is U.S. 
    !is.na(t_offices$country) 
    & t_offices$country == "United States" 
    # with state 
    & !is.na(t_offices$administrative_area_level_1) 
    # NA county 
    & is.na(t_offices$administrative_area_level_2) 
    # with place 
    & !is.na(t_offices$locality)) 

# update t_offices NA counties based on single-county truncated places 
setDT(t_offices) 
t_offices[no_county, administrative_area_level_2 := 
      places_statename[.(.SD), county_name_1, 
          on = c(state_name = "administrative_area_level_1", 
            place_nameshort = "locality")]]