2017-04-18 118 views
0

在Stata中,我可以使用codebookout命令創建一個Excel工作簿,該工作簿將現有數據集中所有變量的名稱,標籤和存儲類型與相應的值和值標籤一起保存。R:是否有與Stata的codebookout命令相同的內容?

我想在R中找到等效函數。到目前爲止,我遇到了memisc函數庫,它有一個名爲codebook的函數,但它與Stata中的函數不同。

例如,在Stata,碼本的輸出應該是這樣的......(見下文 - 這就是我想要的)

Variable Name Variable Label Answer Label Answer Code Variable Type 
    hhid    hhid   Open ended     String 
    inter_month  inter_month Open ended     long 
    year    year   Open ended     long 
    org_unit   org_unit          long 
             Balaka   1 
             Blantyre  2 
             Chikwawa  3 
             Chiradzulu  4 

即在數據幀中的每一列進行評估,以產生5個不同列的值:

  • 變量名這是該列的名稱
  • 變量標籤是 列的名稱
  • 答案標籤,它是 列中的唯一值。如果沒有唯一值,則認爲是開放式結果
  • 答案代碼,它是對答案標籤中每個類別的數字分配。如果答案標籤不是分類,則爲空。
  • 變量類型:int,STR,長(日期)...

這裏是我的嘗試:

CreateCodebook <- function(dF){ 
    numbercols <- length(colnames(dF)) 

    table <- data.frame() 

    for (i in 1:length(colnames(dF))){ 
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else "" 
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended" 
    VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i], 
                rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i] 

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType) 
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label") 
    table <- rbind(table, df) 

    } 
    return(table) 
} 

不幸的是,我得到以下警告消息:

Warning messages: 
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) : 
    invalid factor level, NA generated 
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) : 
    invalid factor level, NA generated 

我產生的輸出結果導致答案代碼標籤混亂:

   Variable Name Variable Label Variable Type Answer Code Answer Label 
hhid     hhid   hhid Open ended    character 
month     month   month Open ended     integer 
year     year   year Open ended     integer 
org_unit   org_unit  org_unit Open ended    character 
v000     v000   v000 Open ended    character 
v001     v001   v001 Open ended     integer 
v002     v002   v002 Open ended     integer 
v003     v003   v003 Open ended     integer 
v005     v005   v005 Open ended     integer 
v006     v006   v006 Open ended     integer 
v007     v007   v007 Open ended     integer 
v021     v021   v021 Open ended     numeric 
2285     v024   v024  central  <NA>  factor 
1             north  <NA>    
7119            south  <NA>    
11      v025   v025   rural  <NA>  factor 
1048     v025   v025   urban  <NA>  factor 
district_name district_name district_name Open ended    character 
coords_x1   coords_x1  coords_x1 Open ended     numeric 
coords_x2   coords_x2  coords_x2 Open ended     numeric 
itn_color   itn_color  itn_color Open ended     numeric 
piped     piped   piped Open ended     numeric 
sanit     sanit   sanit Open ended     numeric 
sanit_cd   sanit_cd  sanit_cd Open ended     numeric 
water     water   water Open ended     numeric 
+0

你能證明你是如何試圖到目前爲止回答這個問題?你可以開始寫一些代碼......(否則,這是「找到一個非現場資源」(脫離主題)或「爲我寫代碼」(脫離主題)......) –

+0

我基本上有一個DataFrame (它可以是任何數據幀,無關緊要),我將代碼簿應用於該df。但輸出不是我想要的。 –

+1

對不起,我讀得太快了,沒有看到你在問題的原始版本中提到過'memisc :: codebook'。儘管如此,恐怕(如果你自己不能取得更大的進展)這個問題可能不適合SO,因爲你基本上需要一個定製/非常具體的輸出。 –

回答

1

我決定爲了自己的娛樂而對此採取一些措施。我使用了內置的Titanic數據集。然而,我對你的一個定義有一個問題:你說「如果沒有獨特的價值,它就被認爲是開放式的」。但長度> 0的變量有一些獨特的價值:你的意思是「如果每個值是唯一的」?即使這個定義不一定按預期工作:在Titanic數據集中,響應是整數,並且32個總值中只有22個唯一值。我不認爲有人會真的想要這個枚舉,所以我測試了factor類型(但是如果你真的想要的話,你可以用下面的length(u)==length(x)這行代替)。

## utility function: pad vector with blanks to specified length 
pad <- function(x,n,p="") { 
    return(c(x,rep(p,n-length(x)))) 
} 
## process a single column 
proc_col <- function(x,nm) { 
    u <- unique(x) 
    ## if (length(u)==length(x)) { 
    if (!is.factor(x)) { 
     n <- 1 
     u <- "open ended" 
     cc <- "" 
    } else { 
     cc <- as.numeric(u) 
     n <- length(u) 
    } 
    dd <- data.frame(`Variable Name`=pad(nm,n), 
       `Variable Label`=pad(nm,n), 
       `Answer Label`=u, 
       `Answer Code`=cc, 
       `Variable Type`=pad(class(x),n), 
       stringsAsFactors=FALSE) 
    return(dd) 
} 
## process all columns 
proc_df <- function(x) { 
    L <- Map(proc_col,x,names(x)) 
    dd <- do.call(rbind,L) 
    rownames(dd) <- NULL 
    return(dd) 
} 

例子:

xx <- as.data.frame.table(Titanic) 
proc_df(xx) 

## Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type 
## 1   Class   Class   1st   1  factor 
## 2          2nd   2    
## 3          3rd   3    
## 4          Crew   4    
## 5   Sex   Sex   Male   1  factor 
## 6          Female   2    
## 7   Age   Age  Child   1  factor 
## 8          Adult   2    
## 9  Survived  Survived   No   1  factor 
## 10          Yes   2    
## 11   Freq   Freq open ended     numeric 

我沒碼值等的名單之前離開空格,但你可以自己做出這些調整?

+0

非常感謝你本!我肯定會贊成這一點,並接受這個答案。爲了我自己的利益,我也想出了一個解決方案。我非常接近,但我收到一條警告信息。 –

0

這裏是我的一個解決方案,破解:

CreateCodebook <- function(dF){ 
    numbercols <- length(colnames(dF)) 

    table <- data.frame() 

    for (i in 1:length(colnames(dF))){ 
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else "" 
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended" 
    VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i], 
                rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i] 

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE) 
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label") 
    table <- rbind(table, df) 

    } 
    rownames(table) <- 1:nrow(table) 
    return(table) 
} 

輸出:

Variable Name Variable Label Variable Type Answer Code Answer Label 
1   brid   brid Open ended    character 
2   month   month Open ended     integer 
3   year   year Open ended     integer 
4  org_unit  org_unit Open ended    character 
5   v000   v000 Open ended    character 
6   v001   v001 Open ended     integer 
7   v002   v002 Open ended     integer 
8   v003   v003 Open ended     integer 
9   v005   v005 Open ended     integer 
10   v006   v006 Open ended     integer 
11   v007   v007 Open ended     integer 
12   v021   v021 Open ended     numeric 
13   v024   v024  central   1  factor 
14          north   2    
15          south   3    
16   v025   v025   rural   1  factor 
17          urban   2    
18   bidx   bidx Open ended     integer 
19 district_name district_name Open ended    character 
20  coords_x1  coords_x1 Open ended     numeric 
21  coords_x2  coords_x2 Open ended     numeric 
22   anc4   anc4 Open ended     numeric 
23 antimal_48  antimal_48 Open ended     numeric 
24   carep   carep Open ended     numeric 
25   csec   csec Open ended     numeric 
26   dptv   dptv Open ended     numeric 
27  ebreast  ebreast Open ended     numeric 
28  fans_48  fans_48 Open ended     numeric 
29  ideliv   ideliv Open ended     numeric 
30   iptp   iptp Open ended     numeric 
31  iron90   iron90 Open ended     numeric 
32  measlesv  measlesv Open ended     numeric 
33   ors   ors Open ended     numeric 
34   ort   ort Open ended     numeric 
35   pncwm   pncwm Open ended     numeric 
36  sstools  sstools Open ended     numeric 
37   tt    tt Open ended     numeric 
38   vita   vita Open ended     numeric 
相關問題