2012-08-26 35 views
1

將CSV文件切割成不同的列這是Pivoting a CSV file using R的後續問題。使用R

在這個問題中,我想根據列(repository_name)中的值將單個列(類型)分割爲多個列。使用了以下輸入數據。

    type   created_at repository_name 
1   IssuesEvent 2012-03-11 06:48:31  bootstrap 
2   IssuesEvent 2012-03-11 06:48:31  bootstrap 
3 IssueCommentEvent 2012-03-11 07:03:57  bootstrap 
4 IssueCommentEvent 2012-03-11 07:03:57  bootstrap 
5 IssueCommentEvent 2012-03-11 07:03:57  bootstrap 
6  IssuesEvent 2012-03-11 07:03:58  bootstrap 
7   WatchEvent 2012-03-11 07:18:45  hogan.js 
8   WatchEvent 2012-03-11 07:18:45  hogan.js 
9   WatchEvent 2012-03-11 07:18:45  hogan.js 
10 IssueCommentEvent 2012-03-11 07:03:57  bootstrap 

完整的CSV文件可在https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/all_events.csv上找到。

這裏是CSV的第30行的dput():

structure(list(type = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 2L, 
2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 
1L, 4L, 4L, 4L, 2L, 2L, 2L), .Label = c("ForkEvent", "IssueCommentEvent", 
"IssuesEvent", "WatchEvent"), class = "factor"), created_at = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L), .Label = c("2012-03-11 06:48:31", 
"2012-03-11 06:52:50", "2012-03-11 07:03:57", "2012-03-11 07:03:58", 
"2012-03-11 07:15:44", "2012-03-11 07:18:45", "2012-03-11 07:19:01", 
"2012-03-11 07:23:56", "2012-03-11 07:32:43", "2012-03-11 07:38:52" 
), class = "factor"), repository_name = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L), .Label = c("bootstrap", 
"hogan.js", "twemproxy"), class = "factor")), .Names = c("type", 
"created_at", "repository_name"), class = "data.frame", row.names = c(NA, 
-30L)) 

這個問題深受誰提出的這個代碼@flodel回答。

data.split <- split(events.raw$type, events.raw$repository_name) 
data.split 

list.to.df <- function(arg.list) { 
    max.len <- max(sapply(arg.list, length)) 
    arg.list <- lapply(arg.list, `length<-`, max.len) 
    as.data.frame(arg.list) 
} 

df.out <- list.to.df(data.split) 
df.out 

不過,現在我想對列表進行排序,這樣的事件(類型)爲每個回購(repository_name)一列排序​​每個每月(從「created_at」列中提取)這樣:

bootstrap_2012_03 bootstrap_2012_04 hogan.js_2012_03 
1 IssuesEvent   PushEvent   PushEvent 
2 IssuesEvent   IssuesEvent  IssuesEvent 
3 OssueCommentEvent WatchEvent   IssuesEvent 

一些其他的假設是:

  • 時間戳僅僅是訂貨和不需要通過在該行同步
  • 即使「IssuesEvent」重複10倍我需要保留所有這些,因爲我將使用R包占美娜
  • 列可以不相等長度的做序列分析
  • 沒有爲不同的列之間沒有關係回購協議(「repository_name」)
  • 數據不同月份的同一個版本庫的完全獨立

我如何R中做到這一點?

+3

當你問你的更早的問題,它也被建議[提供一個可重複的數據的例子](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)。如果沒有潛在受訪者的額外工作,以這種格式存儲的數據就不容易複製並粘貼到R中。 – A5C1D2H2I1M1N2O1R2T1

+0

忘記了。我正在使用的文件可以在這裏找到:https://github.com/aronlindberg/VOSS-Sequencing-Toolkit/blob/master/twitter_exploratory_analysis/all_events.csv – histelheim

+3

我建議你使用'dput()'來粘貼你的將數據抽樣到問題中。 – Andrie

回答

3

而是由repository_name列分裂,首先創建結合repository_name新列和月:

events.raw$month  <- format(as.Date(events.raw$created_at), "%Y_%m") 
events.raw$repo.month <- paste(events.raw$repository_name, 
           events.raw$month, sep = "_") 

head(events) 
#   type   created_at repository_name month  repo.month 
# 1 IssuesEvent 2012-03-11 06:48:31  bootstrap 2012_03 bootstrap_2012_03 
# 2 IssuesEvent 2012-03-11 06:48:31  bootstrap 2012_03 bootstrap_2012_03 
# 3 IssuesEvent 2012-03-11 06:48:31  bootstrap 2012_03 bootstrap_2012_03 
# 4 IssuesEvent 2012-03-11 06:52:50  bootstrap 2012_03 bootstrap_2012_03 
# 5 IssuesEvent 2012-03-11 06:52:50  bootstrap 2012_03 bootstrap_2012_03 
# 6 IssuesEvent 2012-03-11 06:52:50  bootstrap 2012_03 bootstrap_2012_03 

然後用同樣的方法,我建議最後一次:

data.split <- split(events.raw$type, events.raw$repo.month) 

list.to.df <- function(arg.list) { 
    max.len <- max(sapply(arg.list, length)) 
    arg.list <- lapply(arg.list, `length<-`, max.len) 
    as.data.frame(arg.list) 
} 

df.out <- list.to.df(data.split) 
head(df.out) 
# bootstrap_2012_03 hogan.js_2012_03 twemproxy_2012_03 
# 1  IssuesEvent  WatchEvent  WatchEvent 
# 2  IssuesEvent  WatchEvent  WatchEvent 
# 3  IssuesEvent  WatchEvent  WatchEvent 
# 4  IssuesEvent    <NA>    <NA> 
# 5  IssuesEvent    <NA>    <NA> 
# 6  IssuesEvent    <NA>    <NA>