如何讀取具有fread函數的CSV文件的特定行

我有一個大的CSV文件的雙打（1000萬乘500），我只想讀取此文件的幾千行（在1和10之間的各個位置百萬），由長度爲1000萬的二進制向量V定義，如果我不想讀取該行，則假設值爲0，如果我想讀取該行，則假設值爲1。如何讀取具有fread函數的CSV文件的特定行

如何從data.table包獲取io功能fread來執行此操作？我問，因爲fread與其他所有io方法相比如此之快。

最好的解決這個問題，Reading specific rows of large matrix data file，給出瞭如下的解決方案：

read.csv(pipe(paste0("sed -n '" , paste0(c(1 , which(V == 1) + 1) , collapse = "p; ") , "p' C:/Data/target.csv" , collapse = "")) , head=TRUE)

其中C:/Data/target.csv是大的CSV文件和V是0或1載體中。

不過，我已經注意到，這比簡單地對整個矩陣使用fread較慢的訂單，即使V只等於1爲行總數的一小部分。

因此，由於整個矩陣上的fread將主導上述解決方案，因此如何將fread（特別是fread）與行採樣相結合？

這不是重複的，因爲它只是關於功能fread。

這裏是我的問題設置：

#create csv 
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) })) 
#my csv has a header: 
colnames(csv) <- LETTERS[1:5] 
#save csv 
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE) 
#create vector of 0s and 1s that I want to read the CSV from 
read_vec <- rep(0,50) 
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows 
#the following is the effect that I want, but I want an efficient approach to it: 
csv <- read.csv("/home/user/test_csv.csv") #inefficient! 
csv <- csv[which(read_vec==1),] #inefficient! 
#the alternative approach, too slow when scaled up! 
csv <- fread(pipe(paste0("sed -n '" , paste0(c(1 , which(read_vec == 1) + 1) , collapse = "p; ") , "p' /home/user/test_csv.csv" , collapse = "")) , head=TRUE) 
#the fastest approach yet still not optimal because it needs to read all rows 
require(data.table) 
csv <- data.matrix(fread('/home/user/test_csv.csv')) 
csv <- csv[which(read_vec==1),]

來源

2014-02-15 user2763361

這種方法需要一個載體v（對應於你的read_vec），確定行讀的序列，餵養那些順序調用fread(...)，並rbinds結果一起。

如果你想要的行隨機分佈在整個文件中，這可能不會更快。但是，如果行在塊中（例如，c(1:50, 55, 70, 100:500, 700:1500)），那麼將會有很少的電話打到fread(...)，您可能會看到顯着的改進。

# create sample dataset 
set.seed(1) 
m <- matrix(rnorm(1e5),ncol=10) 
csv <- data.frame(x=1:1e4,m) 
write.csv(csv,"test.csv") 
# s: rows we want to read 
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000) 
# v: logical, T means read this row (equivalent to your read_vec) 
v <- (1:1e4 %in% s) 

seq <- rle(v) 
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1 
# indx: start = starting row of sequence, length = length of sequence (compare to s) 
indx <- data.frame(start=idx, length=seq$length[which(seq$values)]) 

library(data.table) 
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

來源

2014-02-15 18:21:50 jlhoward

這看起來很有希望。謝謝。 – user2763361

好方法。花了一些時間來了解基本的R「應用」功能，但這是一個很棒的學習@jlhoward –

如何讀取具有fread函數的CSV文件的特定行

回答

相關問題