2016-10-10 59 views
-1
I'm trying to load csv data file: 
ACCEPT,[email protected],t,[email protected],0,UK,3600000,3,1475917200000,1475920800000,MON,9,0,0,0 

以下列方式:NumPy的D型無效指數

dataset = genfromtxt('./training_set.csv', delimiter=',', dtype='a20, a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8') 
print(dataset) 
target = [x[0] for x in dataset] 
train = [x[1:] for x in dataset] 

在最後一行上面,我得到了一個錯誤:

--------------------------------------------------------------------------- 
IndexError        Traceback (most recent call last) 
<ipython-input-66-5d58edf06039> in <module>() 
     4 print(dataset) 
     5 target = [x[0] for x in dataset] 
----> 6 train = [x[1:] for x in dataset] 
     7 
     8 #rf = RandomForestClassifier(n_estimators=100) 

<ipython-input-66-5d58edf06039> in <listcomp>(.0) 
     4 print(dataset) 
     5 target = [x[0] for x in dataset] 
----> 6 train = [x[1:] for x in dataset] 
     7 
     8 #rf = RandomForestClassifier(n_estimators=100) 

IndexError: invalid index 

如何來處理呢?

+0

'dataset'是一維結構陣列。您按名稱而不是列號或切片訪問字段 – hpaulj

回答

1

與那dtype你已經創建了一個結構化數組 - 它是一個複合dtype 1d。

我已經從另一個問題的樣本結構數組:

In [26]: data 
Out[26]: 
array([(b'1Q11', 252.0, 0.0166), (b'2Q11', 212.4, 0.0122), 
     (b'3Q11', 425.9, 0.0286), (b'4Q11', 522.3, 0.0322), 
     (b'1Q12', 263.2, 0.0185), (b'2Q12', 238.6, 0.0131), 
     ... 
     (b'1Q14', 264.5, 0.0179), (b'2Q14', 211.2, 0.0116)], 
     dtype=[('Qtrs', 'S4'), ('Y', '<f8'), ('X', '<f8')]) 

一個記錄是:

In [27]: data[0] 
Out[27]: (b'1Q11', 252.0, 0.0166) 

雖然我可以內的訪問元素的數,它不接受片:

In [36]: data[0][1] 
Out[36]: 252.0 
In [37]: data[0][1:] 
.... 
IndexError: invalid index 

使用結構化記錄訪問元素的首選方式是使用字段名稱:

In [38]: data[0]['X'] 
Out[38]: 0.0166 

這樣的名字讓我在所有記錄訪問該場:

In [39]: data['X'] 
Out[39]: 
array([ 0.0166, 0.0122, 0.0286, ... 0.0116]) 

讀取多領域,需要的字段名稱的列表(且比2D切片多羅嗦):

In [42]: data.dtype.names[1:] 
Out[42]: ('Y', 'X') 

In [44]: data[list(data.dtype.names[1:])] 
Out[44]: 
array([(252.0, 0.0166), (212.4, 0.0122),... (211.2, 0.0116)], 
     dtype=[('Y', '<f8'), ('X', '<f8')]) 

===============

與您的示例行(複製3次)我可以加載:

In [53]: dataset=np.genfromtxt(txt,dtype=None,delimiter=',') 
In [54]: dataset 
Out[54]: 
array([ (b'ACCEPT', b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0), 
     (b'ACCEPT', b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0), 
     (b'ACCEPT', b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0)], 
     dtype=[('f0', 'S6'), ('f1', 'S15'), ('f2', 'S1'), ('f3', 'S8'), ('f4', '<i4'), ('f5', 'S2'), ('f6', '<i4'), ('f7', '<i4'), ('f8', '<i8'), ('f9', '<i8'), ('f10', 'S3'), ('f11', '<i4'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')]) 
In [55]: 

dtype=None產生類似的東西你明確dtype;

要得到你想要的輸出(如數組,而不是名單):

target = dataset['f0'] 
names=dataset.dtype.names[1:] 
train = dataset[list(names)] 

=====================

你也可以細化dtype以使任務更簡單。定義2個字段,第二個字段包含大多數csv列。 genfromtxt處理這種dtype嵌套 - 只要總場數是正確的。

In [106]: dt=[('target','a20'), 
     ('train','a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8')] 
In [107]: dataset=np.genfromtxt(txt,dtype=dt,delimiter=',') 
In [108]: dataset 
Out[108]: 
array([ (b'ACCEPT', (b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0)), 
...], 
     dtype=[('target', 'S20'), ('train', [('f0', 'S20'), ('f1', 'S20'), ('f2', 'S8'), ('f3', '<i8'), ('f4', 'S20'), ('f5', '<i8'), ('f6', '<i8'), ('f7', '<i8'), ('f8', '<i8'), ('f9', 'S3'), ('f10', '<i8'), ('f11', '<i8'), ('f12', '<i8'), ('f13', '<i8')])]) 

現在只需要選擇2頂級域:

In [109]: dataset['target'] 
Out[109]: 
array([b'ACCEPT', b'ACCEPT', b'ACCEPT'], 
     dtype='|S20') 

In [110]: dataset['train'] 
Out[110]: 
array([ (b'[email protected]', b't', b'[email protected]', 0, b'UK', 3600000, 3, 1475917200000, 1475920800000, b'MON', 9, 0, 0, 0), 
...], 
     dtype=[('f0', 'S20'), ('f1', 'S20'), ...]) 

我可以進一步嵌套,分組的i8列進組,每組4:

dt=[('target','a20'), ('train','a20, a20, a8, i8, a20, (4,)i8, a3, (4,)i8')] 
1
n [42]: dataset = np.genfromtxt('./np_inf.txt', delimiter=',', dtype='a20, a20, a20, a8, i8, a20, i8, i8, i8, i8, a3, i8, i8, i8, i8') 

In [43]: [x[0] for x in dataset] 
Out[43]: ['ACCEPT', 'ACCEPT', 'ACCEPT'] 

的問題是,dataset的條目是不是很有用型np.void的。它不允許分片,很明顯,但你可以遍歷它:

In [56]: type(dataset[0]) 
Out[56]: numpy.void 

In [57]: len(dataset[0]) 
Out[57]: 15 

In [58]: z = [[y for j, y in enumerate(x) if j > 0] for x in dataset] 

In [59]: z[0] 
Out[59]: 
['[email protected]', 
't', 
'[email protected]', 
0, 
'UK', 
3600000, 
3, 
1475917200000, 
1475920800000, 
'MON', 
9, 
0, 
0, 
0] 

但是你可能會更好過數組轉換成結構化的D型,而不是使用名單。

還好,考慮用熊貓做pd.read_csv