R：創建並分配重複記錄

我有一系列媒體資源，我必須爲其指定縣名。對於只有一個縣分配的特定來源（例如本地報紙），這非常簡單 - 我根據switch函數創建了一個縣名變量，該函數根據源名稱分配縣名。示例：R：創建並分配重複記錄

switchfun <- function(x) {switch(x, 'Morning Call' = 'Lehigh', 'Inquirer' =  
'Philadelphia', 'Daily Ledger' = 'Mercer', 'Null') } 

County.Name <- as.character(lapply(Source, switchfun))

但是我有源（NPR，AP等），我想分配給我的數據集中的所有縣。這實質上是複製其來源爲「國家」的任何記錄，並將記錄分配給我的數據集中的每個縣。

當前文件佈局dput：

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 1L, 6L 
), .Label = c("Associated Press", "Daily Ledger", "Herald Tribune", 
"Inquirer", "Morning Call", "NPR", "Yahoo News"), class = "factor"), 
County = structure(c(1L, 2L, 4L, 3L, NA, NA, NA), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), 
Score = c(3L, 10L, 4L, 8L, 1L, 3L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -7L 
))

在當前文件NPR，美聯社，&雅虎新聞沒有關聯的縣（「NA」）。

所需的文件佈局dput：

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 7L, 7L, 
7L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L), .Label = c("Associated Press", 
"Daily Ledger", "Herald Tribune", "Inquirer", "Morning Call", 
"NPR", "Yahoo News"), class = "factor"), County = structure(c(1L, 
2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), Score = c(3L, 
10L, 4L, 8L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -16L 
))

在所需的佈局，我已經&的分值分配的每個國家源中的每個數據集四個縣。例如雅虎新聞&其1分複製4次&相關w/Lehigh，費城，蒙哥馬利，&默瑟縣。與雅虎新聞有「NA」縣的記錄消失。在我的實際數據集中，我有大約100個縣，所以雅虎新聞&其相關變量（例如Score，Date，Author等 - 我總共有大約60個變量）將被複制100次。我還希望縣的這些新「重複」記錄被分配到County.Name變量中，我使用上面的switch函數創建了這個變量。我不想要2個縣名字段，我想要所有這些新創建的縣下County.Names。

來源

2013-08-03 NiuBiBang

如果您可以向我們提供一些樣本數據並顯示期望的結果，那就太好了。 –

我想你可能正在尋找'merge'，但是如果沒有更好的數據表示，很難說。 – Roland

對不起，它已經很晚了，我累了。更新更多的解釋和輸出讀數重現性。 – NiuBiBang

如果我理解正確的話，這可能是一種可能性：

# a (minimal) data frame with all unique source-county combinations 
src_cnt <- data.frame(source = c("Morning Call", "AP", "AP", "AP"), county = c("Lehigh", "Lehigh", "Mercer", "Phila")) 

# a data frame with a unique score for each source 
src_score <- data.frame(source = c("Morning Call", "AP"), score = c(10, 3)) 

merge(src_cnt, src_score)

編輯以下更新的問題：

# Assuming your current data is named dd 
# select the national sources, i.e. the sources where County is missing 
src_national <- dd$Source[is.na(dd$County)]) 

# select unique counties 
counties <- unique(dd$County[!is.na(dd$County)]) 

# create all combinations of national sources and counties 
src_cnt <- expand.grid(Source = src_national, County = counties) 

# add score from current data to national sources 
src_cnt2 <- merge(src_cnt, dd[is.na(dd$County), c("Source", "Score")], by = "Source") 

# add national sources to local sources in dd 
dd2 <- rbind(dd[!is.na(dd$County), ], src_cnt2) 

# order by Sourcy and County 
# assuming desired data is named `desired` 
library(plyr) 
desired2 <- arrange(df = desired, Source, County) 
dd2 <- arrange(df = dd2, Source, County) 
all.equal(desired2, dd2)

對於你的問題的最後一部分，你可以只rbind國家來源src_cnt至County.Name，或從dd2

中選擇相關變量

來源

2013-08-03 09:49:49 Henrik

我實際上有一個唯一的ID也需要考慮，所以我修改爲：'src_cnt <-expand.grid（Source = src_natl $ Source，ID = src_natl $ ID，County =縣）＃其中縣<-c （ 'County1' ... 'County120'）';那麼''src_cnt2 <-merge（src_cnt，src_natl，by = c（「ID」，「Source」））''將'src_cnt'過濾爲僅來自原始數據集的ID和源的組合。 – NiuBiBang

R：創建並分配重複記錄

回答

相關問題