與R中

類似名稱堆疊列我有其可怕格式我不能改變（在此簡化的）的CSV文件：與R中

Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three 
1,1,1.5,"5 Things",2,2.5,"10 Things" 
2,5,5.5,"10 Things",6,6.5,"20 Things" 
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three 
3,9,9.5,"15 Things",10,10.5,"30 Things"

我的期望的輸出是含有新的CSV：

inc,label,one,two,three 
1,"a",1,1.5,"5 Things" 
2,"a",5,5.5,"10 Things" 
3,"a",9,9.5,"15 Things" 
1,"b",2,2.5,"10 Things" 
2,"b",6,6.5,"20 Things" 
3,"b",10,10.5,"30 Things"

基本上：

小寫頭
剝去頭前綴並通過將它們添加到新列來保留它們
在後面的行中刪除標題重複
將每個共享它們名稱的後半部分的列（例如， a_One和b_One值應該合併到同一列中）。
在此過程中，保留原始行的Inc值（在不同位置可能會有多於一行）。

帶警告：

我不知道提前列名（許多文件，許多不同的列）。如果要將它們用作剝離重複標題行的邏輯，則需要對它們進行分析。
可能有或沒有多於一列的屬性，如Inc，需要保存所有東西都堆疊起來。通常，Inc代表沒有像a_或b_這樣的前綴的任何列。我有一個正則表達式已經去掉這些前綴。

到目前爲止，我已經做到了這一點：

> wip_path <- 'C:/path/to/horrible.csv' 
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE) 
> rawwip 
    V1 V2 V3  V4 V5 V6  V7 
1 Inc a_One a_Two a_Three b_One b_Two b_Three 
2 1  1 1.5 5 Things  2 2.5 10 Things 
3 2  5 5.5 10 Things  6 6.5 20 Things 
4 Inc a_One a_Two a_Three b_One b_Two b_Three 
5 3  9 9.5 15 Things 10 10.5 30 Things 

> skips <- which(rawwip$V1==rawwip[1,1]) 
> skips 
[1] 1 4 

> filwip <- rawwip[-skips,] 
> filwip 
    V1 V2 V3  V4 V5 V6  V7 
2 1 1 1.5 5 Things 2 2.5 10 Things 
3 2 5 5.5 10 Things 6 6.5 20 Things 
5 3 9 9.5 15 Things 10 10.5 30 Things 

> rawwip[1,] 
    V1 V2 V3  V4 V5 V6  V7 
1 Inc a_One a_Two a_Three b_One b_Two b_Three

但後來當我嘗試一個tolower的（）適用於這些字符串，我得到：

> tolower(rawwip[1,]) 
[1] "4" "4" "4" "4" "4" "4" "4"

這是非常意外。

所以我的問題是：

1）如何我rawwip[1,]訪問的頭字符串，這樣我可以用tolower()和其他字符串操作功能重新格式化？

2）一旦我這樣做了，什麼是最有效的方式來堆疊具有共享名稱的列，同時保留每行的inc值？

請記住，將會有超過一千多個重複的列，可以過濾到大概20個共享列名稱。我不會提前知道每個可堆疊列的位置。這需要在腳本中確定。

來源

2015-05-12 Shawn

您可以使用基地reshape()函數。例如與輸入

dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three 
1,1,1.5,"5 Things",2,2.5,"10 Things" 
2,5,5.5,"10 Things",6,6.5,"20 Things" 
inc,a_one,a_two,a_three,b_one,b_two,b_three 
3,9,9.5,"15 Things",10,10.5,"30 Things"')

你可以做

dx <- reshape(subset(dd, Inc!="inc"), 
    varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")), 
    v.names=c("One","Two","Three"), 
    idvar="Inc",  
    timevar="label", 
    times = c("a","b"), 
    direction="long") 
dx

得到

Inc label One Two  Three 
1.a 1  a 1 1.5 5 Things 
2.a 2  a 5 5.5 10 Things 
3.a 3  a 9 9.5 15 Things 
1.b 1  b 2 2.5 10 Things 
2.b 2  b 6 6.5 20 Things 
3.b 3  b 10 10.5 30 Things

因爲輸入數據是雜亂（嵌入式頭），這將創建作爲一切因素。您可以嘗試轉換爲正確的數據類型與

dx[]<-lapply(lapply(dx, as.character), type.convert)

來源

2015-05-12 20:17:27 MrFlick

或許是我對學習[R最大的問題是不能夠找到全面的文檔。例如，在我目前的資源中，我找不到上面使用的Map（）上的文檔。您能否提供該功能的文檔來源？ Google一直沒有幫助。 – Shawn

在R中，只需鍵入'？Map'來調出該函數的文檔。 – MrFlick

太棒了。這有助於我知道我在找什麼。如果他們還不知道他們在找什麼，他們會去哪裏？ =）我一直在使用索引[這裏]（https://stat.ethz.ch/R-manual/R-devel/library/base/html/），但它缺少我在別處找到的某些東西。 – Shawn

我建議的read.mtable從my GitHub-only "SOfun" package和merged.stack從我的「splitstackshape」包的組合。

以下是方法。我假設您的數據存儲在您的工作目錄中名爲「somedata.txt」的文件中。

的包，我們需要：

library(splitstackshape) # for merged.stack 
library(SOfun)   # for read.mtable

首先，搶名的載體。在我們看來，將名稱結構從「a_one」更改爲「one_a」 - 對於merged.stack和reshape而言，這是一種更方便的格式。

theNames <- gsub("(.*)_(.*)", "\\2_\\1", 
       tolower(scan(what = "", sep = ",", 
           text = readLines("somefile.txt", n = 1))))

二，使用read.mtable中讀取數據，我們通過識別所有以字母開頭的行創建的數據塊。如果您的實際數據不匹配，則可以使用更具體的正則表達式。

這將創建data.frame的list一個S，所以我們使用do.call(rbind, ...)把它在一個單一的data.frame：

theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",") 

theData <- setNames(do.call(rbind, theData), theNames)

這是該數據現在看起來像：

theData 
#            inc one_a two_a three_a one_b two_b three_b 
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1  1 1.5 5 Things  2 2.5 10 Things 
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2  5 5.5 10 Things  6 6.5 20 Things 
# inc,a_one,a_two,a_three,b_one,b_two,b_three  3  9 9.5 15 Things 10 10.5 30 Things

從在這裏，您可以使用「splitstackshape」中的merged.stack ....

merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_") 
# inc .time_1 one two  three 
# 1: 1  a 1 1.5 5 Things 
# 2: 1  b 2 2.5 10 Things 
# 3: 2  a 5 5.5 10 Things 
# 4: 2  b 6 6.5 20 Things 
# 5: 3  a 9 9.5 15 Things 
# 6: 3  b 10 10.5 30 Things

...或reshape從基礎R：

reshape(theData, direction = "long", idvar = "inc", 
     varying = 2:ncol(theData), sep = "_") 
#  inc time one two  three 
# 1.a 1 a 1 1.5 5 Things 
# 2.a 2 a 5 5.5 10 Things 
# 3.a 3 a 9 9.5 15 Things 
# 1.b 1 b 2 2.5 10 Things 
# 2.b 2 b 6 6.5 20 Things 
# 3.b 3 b 10 10.5 30 Things

來源

2015-05-13 02:17:56 A5C1D2H2I1M1N2O1R2T1

回答

相關問題