2013-08-03 73 views
0

我有一系列媒體資源,我必須爲其指定縣名。對於只有一個縣分配的特定來源(例如本地報紙),這非常簡單 - 我根據switch函數創建了一個縣名變量,該函數根據源名稱分配縣名。示例:R:創建並分配重複記錄

switchfun <- function(x) {switch(x, 'Morning Call' = 'Lehigh', 'Inquirer' =  
'Philadelphia', 'Daily Ledger' = 'Mercer', 'Null') } 

County.Name <- as.character(lapply(Source, switchfun)) 

但是我有源(NPR,AP等),我想分配給我的數據集中的所有縣。這實質上是複製其來源爲「國家」的任何記錄,並將記錄分配給我的數據集中的每個縣。

當前文件佈局dput

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 1L, 6L 
), .Label = c("Associated Press", "Daily Ledger", "Herald Tribune", 
"Inquirer", "Morning Call", "NPR", "Yahoo News"), class = "factor"), 
County = structure(c(1L, 2L, 4L, 3L, NA, NA, NA), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), 
Score = c(3L, 10L, 4L, 8L, 1L, 3L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -7L 
)) 

在當前文件NPR,美聯社,&雅虎新聞沒有關聯的縣( 「NA」)。

所需的文件佈局dput

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 7L, 7L, 
7L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L), .Label = c("Associated Press", 
"Daily Ledger", "Herald Tribune", "Inquirer", "Morning Call", 
"NPR", "Yahoo News"), class = "factor"), County = structure(c(1L, 
2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), Score = c(3L, 
10L, 4L, 8L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -16L 
)) 

在所需的佈局,我已經&的分值分配的每個國家源中的每個數據集四個縣。例如雅虎新聞&其1分複製4次&相關w/Lehigh,費城,蒙哥馬利,&默瑟縣。與雅虎新聞有「NA」縣的記錄消失。在我的實際數據集中,我有大約100個縣,所以雅虎新聞&其相關變量(例如Score,Date,Author等 - 我總共有大約60個變量)將被複制100次。我還希望縣的這些新「重複」記錄被分配到County.Name變量中,我使用上面的switch函數創建了這個變量。我不想要2個縣名字段,我想要所有這些新創建的縣下County.Names。

+5

如果您可以向我們提供一些樣本數據並顯示期望的結果,那就太好了。 –

+3

我想你可能正在尋找'merge',但是如果沒有更好的數據表示,很難說。 – Roland

+0

對不起,它已經很晚了,我累了。更新更多的解釋和輸出讀數重現性。 – NiuBiBang

回答

1

如果我理解正確的話,這可能是一種可能性:

# a (minimal) data frame with all unique source-county combinations 
src_cnt <- data.frame(source = c("Morning Call", "AP", "AP", "AP"), county = c("Lehigh", "Lehigh", "Mercer", "Phila")) 

# a data frame with a unique score for each source 
src_score <- data.frame(source = c("Morning Call", "AP"), score = c(10, 3)) 

merge(src_cnt, src_score) 

編輯以下更新的問題:

# Assuming your current data is named dd 
# select the national sources, i.e. the sources where County is missing 
src_national <- dd$Source[is.na(dd$County)]) 

# select unique counties 
counties <- unique(dd$County[!is.na(dd$County)]) 

# create all combinations of national sources and counties 
src_cnt <- expand.grid(Source = src_national, County = counties) 

# add score from current data to national sources 
src_cnt2 <- merge(src_cnt, dd[is.na(dd$County), c("Source", "Score")], by = "Source") 

# add national sources to local sources in dd 
dd2 <- rbind(dd[!is.na(dd$County), ], src_cnt2) 

# order by Sourcy and County 
# assuming desired data is named `desired` 
library(plyr) 
desired2 <- arrange(df = desired, Source, County) 
dd2 <- arrange(df = dd2, Source, County) 
all.equal(desired2, dd2) 

對於你的問題的最後一部分,你可以只rbind國家來源src_cntCounty.Name,或從dd2

中選擇相關變量
+0

我實際上有一個唯一的ID也需要考慮,所以我修改爲:'src_cnt <-expand.grid(Source = src_natl $ Source,ID = src_natl $ ID,County =縣)#其中縣<-c ( 'County1' ... 'County120')';那麼''src_cnt2 <-merge(src_cnt,src_natl,by = c(「ID」,「Source」))''將'src_cnt'過濾爲僅來自原始數據集的ID和源的組合。 – NiuBiBang