2015-09-26 52 views
0

我只是在介紹性的R類中,所以這可能很基本。如何在R中摺疊/重新編碼一個變量

我使用Outlook on Life數據集並對收入感興趣。受訪者選擇了以下19個選項之一:

Less than $5,000  
$5,000 to $7,499  
$7,500 to $9,999  
$10,000 to $12,499 
$12,500 to $14,999 
$15,000 to $19,999 
$20,000to $24,999  
$25,000 to $29,999 
$30,000 to $34,999 
$35,000 to $39,999 
$40,000 to $49,999 
$50,000 to $59,999 
$60,000 to $74,999 
$75,000 to $84,999 
$85,000 to $99,999 
$100,000 to $124,999 
$125,000 to $149,999 
$150,000 to $174,999 
$175,000 or more 

我要崩潰了,並簡化該下列只是爲了讓情節更加理解:

  1. 在貧困線($ 0 - 24,999)
  2. 工人階級($ 25 000 - 34,999),
  3. 下階層($ 35,000個 - 60,000),
  4. 中產階級($ 60,000 - 100,000),
  5. 上層中產階級($ 100,000 - 150,000),
  6. 最高5%($ 150,000+)。

我該如何去重新編碼呢?

謝謝!

+2

嘗試切功能 – Chris

+4

你的間隔是有問題的。如果有人賺了22,000,他們會選擇第7組(20k - 24,999)。你會希望他們在貧困線下。但是24k的人也會選擇組7,但他們在工作班。你怎麼知道它的區別? –

+0

是的,這是有問題的。我可以按摩我想要的分組,使它們更適合預先建立的間隔。所以我可以讓貧困線下去24,999。然後工作班34,999。 – Katherine

回答

2

重新編碼因子的最簡單方法是認識到levels函數可以接受可用於重新映射因子水平的值列表。

我假設你的數據已經是一個因素(正如你所說的「受訪者必須選擇以下19個選項之一」),這意味着使用cut函數並沒有什麼意義。

這裏是它在動作的簡單示例:

z <- gl(3, 2, 12) # [1] 1 1 2 2 3 3 1 1 2 2 3 3, Levels: 1 2 3 
levels(z) <- list(A = c(1,3), B = 2) 
z # [1] A A B B A A A A B B A A, Levels: A B 

正如可以從上面的例子中看到的,我們已重新編碼的水平1和3爲A組和電平2爲基B.所以你的問題可以用類似的方式完成:

groups <- as.factor(sample(c("Less than $5,000", 
"$5,000 to $7,499", 
"$7,500 to $9,999", 
"$10,000 to $12,499", 
"$12,500 to $14,999", 
"$15,000 to $19,999", 
"$20,000to $24,999", 
"$25,000 to $29,999", 
"$30,000 to $34,999", 
"$35,000 to $39,999", 
"$40,000 to $49,999", 
"$50,000 to $59,999", 
"$60,000 to $74,999", 
"$75,000 to $84,999", 
"$85,000 to $99,999", 
"$100,000 to $124,999", 
"$125,000 to $149,999", 
"$150,000 to $174,999", 
"$175,000 or more"), size=100, replace=T)) 

levels(groups) <- list(
    "Under poverty line"=c("Less than $5,000", 
     "$5,000 to $7,499", 
     "$7,500 to $9,999", 
     "$10,000 to $12,499", 
     "$12,500 to $14,999", 
     "$15,000 to $19,999", 
     "$20,000to $24,999"), 
    "Working class"=c("$25,000 to $29,999", 
        "$30,000 to $34,999"), 
    "Lower middle class"=c("$35,000 to $39,999", 
         "$40,000 to $49,999", 
         "$50,000 to $59,999"), 
    "Middle class"=c("$60,000 to $74,999", 
        "$75,000 to $84,999", 
        "$85,000 to $99,999"), 
    "Upper middle class"=c("$100,000 to $124,999", 
         "$125,000 to $149,999"), 
    "Top 5 percent"=c("$150,000 to $174,999", 
        "$175,000 or more") 
)