2014-12-02 40 views
0

我有一個政治捐贈數據集,其中包含字母數字代碼中的行業類別。單獨的文本文件中列出了這些字母數字代碼如何轉換爲行業名稱,部門名稱和行業類別名稱。將類別的代碼值合併到R中的數據集

例如,「A1200」是甘蔗產業類農業企業部門的作物生產行業。我想知道如何將字母數字代碼與其各自行業,行業和類別值分開配對。

眼下,代碼值數據集在

Catcode Catname  Catorder Industry    Sector  
    A1200 Sugar cane A01   Crop Production Agribusiness 

和這個行業的捐款數據集:

Business name Amount donated Year Category 
Sarah Farms  1000    2010 A1200 

類別的數據集大約是444行和捐贈一套大約是1M行。我如何感受捐贈數據集,看起來像這樣。類別將是共同的名字

Catcode Catname  Catorder Industry    Sector   Business name Amount donated Year Category 
    A1200 Sugar cane A01   Crop Production Agribusiness  Sarah Farms  1000    2010 A1200 

我有點新的這些論壇,所以如果有一個更好的方式來問這個問題,請讓我知道。感謝您的幫助!

+0

嘗試帶'by.x'和'by.y'參數的'merge()'函數。另請參閱http://stackoverflow.com/q/5963269/946850以改善問題。 – krlmlr 2014-12-02 02:51:06

回答

2

如果速度有問題,您可能需要使用data.tabledplyr。在這裏,我修改了一些示例數據以提供一些想法。

df1 <- data.frame(Catcode = c("A1200", "B1500", "C1800"), 
        Catname = c("Sugar", "Salty", "Butter"), 
        Catorder = c("cane A01", "cane A01", "cane A01"), 
        Industry = c("Crop Production", "Crop Production", "Crop Production"), 
        Sector = c("Agribusiness", "Agribusiness", "Agribusiness"), 
        stringsAsFactors = FALSE) 

# Catcode Catname Catorder  Industry  Sector 
#1 A1200 Sugar cane A01 Crop Production Agribusiness 
#2 B1500 Salty cane A01 Crop Production Agribusiness 
#3 C1800 Butter cane A01 Crop Production Agribusiness 

df2 <- data.frame(BusinessName = c("Sarah Farms", "Ben Farms"), 
        AmountDonated = c(100, 200), 
        Year = c(2010, 2010), 
        Category = c("A1200", "B1500"), 
        stringsAsFactors = FALSE) 

# BusinessName AmountDonated Year Category 
#1 Sarah Farms   100 2010 A1200 
#2 Ben Farms   200 2010 B1500 

library(dplyr) 
library(data.table) 

# 1) dplyr option 
# Catcode C1800 will be dropped since it does not exist in both data frames. 
inner_join(df1, df2, by = c("Catcode" = "Category")) 

#  Catcode Catname Catorder  Industry  Sector BusinessName AmountDonated Year 
#1 A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms   100 2010 
#2 B1500 Salty cane A01 Crop Production Agribusiness Ben Farms   200 2010 

# Catcide C1800 remains 
left_join(df1, df2, by = c("Catcode" = "Category")) 

#  Catcode Catname Catorder  Industry  Sector BusinessName AmountDonated Year 
#1 A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms   100 2010 
#2 B1500 Salty cane A01 Crop Production Agribusiness Ben Farms   200 2010 
#3 C1800 Butter cane A01 Crop Production Agribusiness   <NA>   NA NA 

# 2) data.table option 
# Convert data.frame to data.table 
setDT(df1) 
setDT(df2) 

#Set columns for merge 
setkey(df1, "Catcode") 
setkey(df2, "Category") 

df1[df2] 

# Catcode Catname Catorder  Industry  Sector BusinessName AmountDonated Year 
#1: A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms   100 2010 
#2: B1500 Salty cane A01 Crop Production Agribusiness Ben Farms   200 2010 

df2[df1] 
# BusinessName AmountDonated Year Category Catname Catorder  Industry  Sector 
#1: Sarah Farms   100 2010 A1200 Sugar cane A01 Crop Production Agribusiness 
#2: Ben Farms   200 2010 B1500 Salty cane A01 Crop Production Agribusiness 
#3:   NA   NA NA C1800 Butter cane A01 Crop Production Agribusiness 
0

我想你問如何查詢..不是嗎?

SELECT * 
FROM 
code values dataset(your table for this) a 
LEFT JOIN industry donation dataset(your table for this) b 
ON a.CatCode = b.Category 
0

由於krlmlr建議:

> merge(df1, df2, by.x = "Catcode", by.y = "Category", all = T) 
    Catcode Catname Catorder  Industry  Sector Business_name Amount_donated Year 
1 A1200 Sugar_cane  A01 Crop_Production Agribusiness Sarah_Farms   1000 2010 

但是要避免在列名和值空格。我將它們替換爲_