從python中的大型數據框中快速採樣大量的數據

我有一個非常大的數據框（大約1.1M行），我試圖對它進行採樣。從python中的大型數據框中快速採樣大量的數據

我有一個索引列表（約70,000個索引），我想從整個數據框中選擇。

這是我用盡爲止，但所有這些方法都服用了太多的時間：

方法1 - 使用大熊貓：

sample = pandas.read_csv("data.csv", index_col = 0).reset_index() 
sample = sample[sample['Id'].isin(sample_index_array)]

方法2：

我試着寫所有采樣的行到另一個csv。

f = open("data.csv",'r') 

out = open("sampled_date.csv", 'w') 
out.write(f.readline()) 

while 1: 
    total += 1 
    line = f.readline().strip() 

    if line =='': 
     break 
    arr = line.split(",") 

    if (int(arr[0]) in sample_index_array): 
     out.write(",".join(e for e in (line)))

任何人都可以請建議一個更好的方法嗎？或者我可以如何修改它以使其更快？

感謝

來源

2016-09-24 user324

如果我理解你是對的，你可以將你的標記轉化爲一個熊貓索引對象。然後將對象饋入DataFrame中直接切片。 – pylang

你似乎可以從簡單的selection methods受益。我們沒有您的數據，因此以下是使用pandas Index對象和.iloc選擇方法選擇子集的示例。

import pandas as pd 
import numpy as np 

# Large Sample DataFrame 
df = pd.DataFrame(np.random.randint(0,100,size=(1000000, 4)), columns=list('ABCD')) 
df.info() 

# Results 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 1000000 entries, 0 to 999999 
Data columns (total 4 columns): 
A 1000000 non-null int32 
B 1000000 non-null int32 
C 1000000 non-null int32 
D 1000000 non-null int32 
dtypes: int32(4) 
memory usage: 15.3 MB 


# Convert a sample list of indices to an `Index` object 
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776] 
idxs = pd.Index(indices) 
subset = df.iloc[idxs, :] 
subset 

# Output 
A B C D 
1  9 33 62 17 
2  44 73 85 11 
3  56 83 85 79 
10  5 72 3 82 
20  72 22 61 2 
30  75 15 51 11 
67  82 12 18 5 
78  95 9 86 81 
900 23 51 3 5 
2176 30 89 67 26 
78776 54 88 56 17

在你的情況，試試這個：

df = pd.read_csv("data.csv", index_col = 0).reset_index() 
idx = pd.Index(sample_index_array)    # assuming a list 
sample = df.iloc[idx, :]

的.iat and .at methods甚至更快，但需要標量指標。

來源

2016-09-24 15:06:56 pylang

謝謝！這應該工作！出於好奇，有沒有辦法在讀取數據時對這些行進行分片？ – user324

如果您要求讀取已過濾的子集，則可以在[read_csv]中['skiprows']（http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html ），但我不認爲他們有'use_rows.'的選項。我會發佈一個問題給github來請求這個功能。 – pylang

好的。我試試skiprows。謝謝！ – user324

從python中的大型數據框中快速採樣大量的數據

回答

相關問題