2015-12-28 64 views
3

我需要重新編碼連續變量到類別中,通常我使用「剪切」功能,但在剪切功能中,我需要指定中斷。我正在尋找一種方法,根據我的數據框中的其他分類變量,有一組不同的中斷變量。如何重新編碼連續變量到範圍

在我的示例變量是成本和「斷裂」是在第二個表「cost.range」,我有用於每個「地區」,並且每個「類別」

實施例一組不同的符的:

Region Product  Category Cost 
Country A Product 1  CAT A 731 
Country B Product 1  CAT A 659 
Country C Product 1  CAT A 385 
Country D Product 1  CAT A 763 
Country A Product 2  CAT A 701 
Country B Product 2  CAT A 759 
Country C Product 2  CAT A 580 
Country D Product 2  CAT A 147 
Country A Product 3  CAT B 645 
Country B Product 3  CAT B 657 
Country C Product 3  CAT B 424 


Region  Category Cost.Range  Range 
Country A  CAT A   10   R1 
Country A  CAT A   50   R2 
Country A  CAT A  200   R3 
Country A  CAT A  1000   R4 
Country A  CAT B   20   R1 
Country A  CAT B  100   R2 
Country A  CAT B  400   R3 
Country A  CAT B  1500   R4 

代碼來生成例如:

Region <- c("Country A","Country B","Country C","Country D","Country A","Country B","Country C","Country D","Country A","Country B","Country C","Country D","Country A","Country B","Country C","Country D") 
Product <- c("Product 1","Product 1","Product 1","Product 1","Product 2","Product 2","Product 2","Product 2","Product 3","Product 3","Product 3","Product 3","Product 4","Product 4","Product 4","Product 4") 
Category <- c("CAT A","CAT A","CAT A","CAT A","CAT A","CAT A","CAT A","CAT A","CAT B","CAT B","CAT B","CAT B","CAT B","CAT B","CAT B","CAT B") 
Cost <- c(731,659,385,763,701,759,580,147,645,657,424,34,850,463,160,550) 

Table1 <- data.frame(Region, Product, Category, Cost) 

Region <- c("Country A","Country A","Country A","Country A","Country A","Country A","Country A","Country A") 
Category <- c("CAT A","CAT A","CAT A","CAT A","CAT B","CAT B","CAT B","CAT B") 
Cost.range <- c(10,50,200,1000,20,100,400,1500) 
Range <- c("R1","R1","R3","R4","R1","R2","R3","R4") 

Table2 <- data.frame(Region, Category, Cost.range, Range) 
+1

您可以使用'by',它同時也適用於每個類別。你能否以複製副本的形式提供你的數據以及你嘗試過的代碼? –

+0

謝謝你,我編輯了我的帖子以包含代碼,我查看了「by」文檔,因爲我是R新手,我沒有看到如何使用它。你能解釋一下嗎? – akruug

+0

我以爲我會用'cut',但標籤在Range列中不是唯一的。這是設計嗎? –

回答

1

這是不是最優雅的解決方案(我很想看到一個更好的方法),但它應該達到的結果Y你在尋找。

dplyr包中的select()distinct()功能找到的RegionCategory可能的組合。這些組合用於對兩個表進行子集化並將cut()函數應用於每個子集。

library('dplyr') 
library('data.table') 

dt1 <- data.table(Table1) 
dt2 <- data.table(Table2) 

t2d <- Table2 %>% select(Region, Category) %>% distinct 

for(i in 1:nrow(t2d)){ 
    dt2_range_subset <- dt2[Region == as.character(t2d$Region[i]) 
          & Category == t2d$Category[i], Cost.range] 
    dt1[Region == as.character(t2d$Region[i]) & Category == t2d$Category[i], 
     Cost_factor := cut(as.matrix(Cost), dt2_range_subset)] 
} 
+0

感謝您使用此解決方案,是否可以在結果(R2)中使用Table2中的Range列而不是像(100,800)中的bin? – akruug

+0

這是,但實現取決於您期望的行爲(這就是爲什麼在你的問題中包括這一點很重要)。按照目前的格式,輸出將產生三個範圍(例如,你將得到的A區貓A(10,50,50,200和200,1000),但是你已經有四個標籤可以應用 – NGaffney

+0

我不清楚在我的問題對不起,R1應該適用於相應的範圍,例如[0,10] - > R1,(10,50] - > R2,(50, 200] - > R3,(200,1000]) - > R4 – akruug