下面是使用data.table
一個答案。
library(data.table)
dat <- fread("Origcol
PMID
LID
STAT
MH
RN
OT
PST
LID
STAT
MH
PMID
OT
PST
LID
DEP
RN
PMID
PST")
dat[, old_order := 1:.N]
pst_index <- c(0, which(dat$Origcol == "PST"))
dat[, grp := unlist(lapply(1:(length(pst_index)-1),
function(x) rep(x,
times = (pst_index[x+1] - pst_index[x]))))]
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT",
"MH", "RN", "OT",
"DEP", "PST"))]
dat[order(grp, Origcol)]
結果:
Origcol old_order grp
1: PMID 1 1
2: LID 2 1
3: STAT 3 1
4: MH 4 1
5: RN 5 1
6: OT 6 1
7: PST 7 1
8: PMID 11 2
9: LID 8 2
10: STAT 9 2
11: MH 10 2
12: OT 12 2
13: PST 13 2
14: PMID 17 3
15: LID 14 3
16: RN 16 3
17: DEP 15 3
18: PST 18 3
這樣做的好處是data.table通過引用做了很多的操作,一旦你擴大規模要快。你說你有1400萬行,讓我們試試看。產生這種規模的一些合成數據:
dat_big <- data.table(Origcol = c("PMID", "LID", "STAT", "MH", "RN", "OT", "PST"))
dat_big_add <- rbindlist(lapply(1:10000,
function(x) data.table(Origcol = c(sample(c("PMID", "LID", "STAT",
"MH", "RN", "OT")),
"PST"))))
dat_big <- rbindlist(list(dat_big,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add,
dat_big_add, dat_big_add, dat_big_add, dat_big_add, dat_big_add))
dat <- rbindlist(list(dat_big, dat_big, dat_big, dat_big, dat_big,
dat_big, dat_big, dat_big, dat_big, dat_big))
我們現在有:
Origcol
1: PMID
2: LID
3: STAT
4: MH
5: RN
---
14000066: STAT
14000067: MH
14000068: OT
14000069: PMID
14000070: PST
應用與上面相同的代碼:
dat[, old_order := 1:.N]
pst_index <- c(0, which(dat$Origcol == "PST"))
dat[, grp := unlist(lapply(1:(length(pst_index)-1),
function(x) rep(x,
times = (pst_index[x+1] - pst_index[x]))))]
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT",
"MH", "RN", "OT",
"DEP", "PST"))]
dat[order(grp, Origcol)]
現在,我們得到:
Origcol old_order grp
1: PMID 1 1
2: LID 2 1
3: STAT 3 1
4: MH 4 1
5: RN 5 1
---
14000066: STAT 14000066 2000010
14000067: MH 14000067 2000010
14000068: RN 14000064 2000010
14000069: OT 14000068 2000010
14000070: PST 14000070 2000010
需要多長時間?
library(microbenchmark)
microbenchmark(
"data.table" = {
dat[, old_order := 1:.N]
pst_index <- c(0, which(dat$Origcol == "PST"))
dat[, grp := unlist(lapply(1:(length(pst_index)-1),
function(x) rep(x,
times = (pst_index[x+1] - pst_index[x]))))]
dat[, Origcol := factor(Origcol, levels = c("PMID", "LID", "STAT",
"MH", "RN", "OT",
"DEP", "PST"))]
dat[order(grp, Origcol)]
},
times = 10)
而且它需要:
Unit: seconds
expr min lq mean median uq max neval
data.table 5.755276 5.813267 6.059665 5.87151 6.034506 7.310169 10
在10秒1400萬行。生成測試數據花了很長時間。
'test'看起來不像'data.frame':它沒有列名和行號 – HubertL
它是2400萬個觀察值/行和1列 – sweetmusicality
我不知道如何在列中添加列和行數stackoverflow(沒有它手動) – sweetmusicality