2016-12-27 88 views
1

假設我有以下numpy的結構數組:轉換numpy的結構數組子集numpy的陣列,而不復制

In [250]: x 
Out[250]: 
array([(22, 2, -1000000000, 2000), (22, 2, 400, 2000), 
     (22, 2, 804846, 2000), (44, 2, 800, 4000), (55, 5, 900, 5000), 
     (55, 5, 1000, 5000), (55, 5, 8900, 5000), (55, 5, 11400, 5000), 
     (33, 3, 14500, 3000), (33, 3, 40550, 3000), (33, 3, 40990, 3000), 
     (33, 3, 44400, 3000)], 
     dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')]) 

我想修改上述陣列的一個子集到正規numpy的陣列。 對於我的應用程序來說,不需要創建副本(僅適用於視圖)。

字段從上述結構的陣列通過使用下面的函數檢索到:

def fields_view(array, fields): 
    return array.getfield(numpy.dtype(
     {name: array.dtype.fields[name] for name in fields} 
    )) 

如果我對字段「F2」和「F3」,我會執行以下操作:

In [251]: y=fields_view(x,['f2','f3']) 
In [252]: y 
Out [252]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0), 
     (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0), 
     (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
     dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12}) 

有一種方法可以直接從原始結構化數組的'f2'和'f3'字段獲得一個ndarray。但是,對於我的應用程序,有必要構建這個中間結構化數組,因爲此數據子集是類的一個屬性。

我無法將中間結構化數組轉換爲常規numpy數組而不做副本。

In [253]: y.view(('<f4', len(y.dtype.names))) 
--------------------------------------------------------------------------- 
ValueError        Traceback (most recent call last) 
<ipython-input-54-f8fc3a40fd1b> in <module>() 
----> 1 y.view(('<f4', len(y.dtype.names))) 

ValueError: new type not compatible with array. 

此功能還可以用來記錄數組轉換爲ndarray:

def recarr_to_ndarr(x,typ): 

    fields = x.dtype.names 
    shape = x.shape + (len(fields),) 
    offsets = [x.dtype.fields[name][1] for name in fields] 
    assert not any(np.diff(offsets, n=2)) 
    strides = x.strides + (offsets[1] - offsets[0],) 
    y = np.ndarray(shape=shape, dtype=typ, buffer=x, 
       offset=offsets[0], strides=strides) 
    return y 

不過,我得到以下錯誤:

In [254]: recarr_to_ndarr(y,'<f4') 
--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-65-2ebda2a39e9f> in <module>() 
----> 1 recarr_to_ndarr(y,'<f4') 

<ipython-input-62-8a9eea8e7512> in recarr_to_ndarr(x, typ) 
     8  strides = x.strides + (offsets[1] - offsets[0],) 
     9  y = np.ndarray(shape=shape, dtype=typ, buffer=x, 
---> 10    offset=offsets[0], strides=strides) 
    11  return y 
    12 

TypeError: expected a single-segment buffer object 

功能工作正常,如果我創建副本:

In [255]: recarr_to_ndarr(np.array(y),'<f4') 
Out[255]: 
array([[ 2.00000000e+00, -1.00000000e+09], 
     [ 2.00000000e+00, 4.00000000e+02], 
     [ 2.00000000e+00, 8.04846000e+05], 
     [ 2.00000000e+00, 8.00000000e+02], 
     [ 5.00000000e+00, 9.00000000e+02], 
     [ 5.00000000e+00, 1.00000000e+03], 
     [ 5.00000000e+00, 8.90000000e+03], 
     [ 5.00000000e+00, 1.14000000e+04], 
     [ 3.00000000e+00, 1.45000000e+04], 
     [ 3.00000000e+00, 4.05500000e+04], 
     [ 3.00000000e+00, 4.09900000e+04], 
     [ 3.00000000e+00, 4.44000000e+04]], dtype=float32) 

兩個陣列似乎沒有區別:

In [66]: y 
Out[66]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0), 
     (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0), 
     (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
     dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12}) 

In [67]: np.array(y) 
Out[67]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0), 
     (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0), 
     (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
     dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12}) 

回答

1

這個答案有點漫長。我從我之前從事數組視圖的工作中瞭解到的內容開始,然後嘗試將其與函數關聯起來。

================

在你的情況下,所有字段都是4個字節長,無論花車和整數。然後我可以將其視爲所有整數或所有浮點數:

In [1431]: x 
Out[1431]: 
array([(22, 2.0, -1000000000.0, 2000), (22, 2.0, 400.0, 2000), 
     (22, 2.0, 804846.0, 2000), (44, 2.0, 800.0, 4000), 
     (55, 5.0, 900.0, 5000), (55, 5.0, 1000.0, 5000), 
     (55, 5.0, 8900.0, 5000), (55, 5.0, 11400.0, 5000), 
     (33, 3.0, 14500.0, 3000), (33, 3.0, 40550.0, 3000), 
     (33, 3.0, 40990.0, 3000), (33, 3.0, 44400.0, 3000)], 
     dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')]) 
In [1432]: x.view('i4') 
Out[1432]: 
array([  22, 1073741824, -831624408,  2000,   22, 
     1073741824, 1137180672,  2000,   22, 1073741824, 
     1229225696,  2000,   44, 1073741824, 1145569280, 
     ....  3000]) 
In [1433]: x.view('f4') 
Out[1433]: 
array([ 3.08285662e-44, 2.00000000e+00, -1.00000000e+09, 
     2.80259693e-42, 3.08285662e-44, 2.00000000e+00, 
    .... 4.20389539e-42], dtype=float32) 

此視圖是1d。我可以重塑和切片2分浮動列

In [1434]: x.shape 
Out[1434]: (12,) 
In [1435]: x.view('f4').reshape(12,-1) 
Out[1435]: 
array([[ 3.08285662e-44, 2.00000000e+00, -1.00000000e+09, 
      2.80259693e-42], 
     [ 3.08285662e-44, 2.00000000e+00, 4.00000000e+02, 
      2.80259693e-42], 
     ... 
     [ 4.62428493e-44, 3.00000000e+00, 4.44000000e+04, 
      4.20389539e-42]], dtype=float32) 

In [1437]: x.view('f4').reshape(12,-1)[:,1:3] 
Out[1437]: 
array([[ 2.00000000e+00, -1.00000000e+09], 
     [ 2.00000000e+00, 4.00000000e+02], 
     [ 2.00000000e+00, 8.04846000e+05], 
     [ 2.00000000e+00, 8.00000000e+02], 
     ... 
     [ 3.00000000e+00, 4.44000000e+04]], dtype=float32) 

認爲這是一種視圖可以通過做了一些就地數學的,看到的結果x進行驗證:

In [1439]: y=x.view('f4').reshape(12,-1)[:,1:3] 
In [1440]: y[:,0] += .5 
In [1441]: y 
Out[1441]: 
array([[ 2.50000000e+00, -1.00000000e+09], 
     [ 2.50000000e+00, 4.00000000e+02], 
     ... 
     [ 3.50000000e+00, 4.44000000e+04]], dtype=float32) 
In [1442]: x 
Out[1442]: 
array([(22, 2.5, -1000000000.0, 2000), (22, 2.5, 400.0, 2000), 
     (22, 2.5, 804846.0, 2000), (44, 2.5, 800.0, 4000), 
     (55, 5.5, 900.0, 5000), (55, 5.5, 1000.0, 5000), 
     (55, 5.5, 8900.0, 5000), (55, 5.5, 11400.0, 5000), 
     (33, 3.5, 14500.0, 3000), (33, 3.5, 40550.0, 3000), 
     (33, 3.5, 40990.0, 3000), (33, 3.5, 44400.0, 3000)], 
     dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')]) 

如果這些字段大小不同,這可能是不可能的。例如,如果浮點數是8個字節。關鍵是想象如何存儲結構化數據,並想象它是否可以被視爲一個簡單的多列數據類型。場選擇必須等同於基本片段。使用['f1','f4']將等同於使用[:,[0,3]進行高級索引,該索引必須是副本。

==========

'直接' 字段建立索引是:

z = x[['f2','f3']].view('f4').reshape(12,-1) 
z -= .5 

修改z但有futurewarning。也不修改x; z已成爲副本。我還可以通過查看z.__array_interface__['data'],數據緩衝區位置(並與xy進行比較)來看到。

=================

fields_view確實創建了一個結構化的視圖:

In [1480]: w=fields_view(x,['f2','f3']) 
In [1481]: w.__array_interface__['data'] 
Out[1481]: (151950184, False) 
In [1482]: x.__array_interface__['data'] 
Out[1482]: (151950184, False) 

可以用來修改xw['f2'] -= .5。所以它比'直接'x[['f2','f3']]更通用。

w D型是

dtype({'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12}) 

添加print(shape, typ, offsets, strides)recarr_to_ndarr,我得到(PY 3)

In [1499]: recarr_to_ndarr(w,'<f4') 
(12, 2) <f4 [4, 8] (16, 4) 
.... 
ValueError: ndarray is not contiguous 

In [1500]: np.ndarray(shape=(12,2), dtype='<f4', buffer=w.data, offset=4, strides=(16,4)) 
... 
BufferError: memoryview: underlying buffer is not contiguous 

contiguous問題一定是指的在w.flags顯示的值:

In [1502]: w.flags 
Out[1502]: 
    C_CONTIGUOUS : False 
    F_CONTIGUOUS : False 
    .... 

有趣的是w.dtype.descr的「偏移量」轉換成未命名字段:

In [1506]: w.__array_interface__ 
Out[1506]: 
{'data': (151950184, False), 
'descr': [('', '|V4'), ('f2', '<f4'), ('f3', '<f4')], 
'shape': (12,), 
'strides': (16,), 
'typestr': '|V12', 
'version': 3} 

一種方式或其它,w具有非鄰接數據緩衝區,它不能被用來創建一個新的數組。變平,數據緩衝器看起來像

xoox|xoox|xoox|... 
# x 4 bytes we want to skip 
# o 4 bytes we want to use 
# | invisible bdry between records in x 

予如上構造具有y

In [1511]: y.__array_interface__ 
Out[1511]: 
{'data': (151950188, False), 
'descr': [('', '<f4')], 
'shape': (12, 2), 
'strides': (16, 4), 
'typestr': '<f4', 
'version': 3} 

所以它訪問o字節與4字節偏移量,然後(16,4)的進步,和(12,2)形狀。

如果我修改ndarray呼叫使用原來的x.data,它的工作原理:

In [1514]: xx=np.ndarray(shape=(12,2), dtype='<f4', buffer=x.data, offset=4, strides=(16,4)) 
In [1515]: xx 
Out[1515]: 
array([[ 2.00000000e+00, -1.00000000e+09], 
     [ 2.00000000e+00, 4.00000000e+02], 
      .... 
     [ 3.00000000e+00, 4.44000000e+04]], dtype=float32) 

與同array_interface作爲我y

In [1516]: xx.__array_interface__ 
Out[1516]: 
{'data': (151950188, False), 
'descr': [('', '<f4')], 
'shape': (12, 2), 
'strides': (16, 4), 
'typestr': '<f4', 
'version': 3} 
+0

非常感謝你爲這個詳細的解答!它幫助我瞭解如何解決我的問題,查看更新後的帖子。 – snowleopard

0

hpaulj是正確地說,這個問題是結構化數組的子集不是連續的。有趣的是,我想出了一個方法,使陣列子集連續使用下列功能:

def view_fields(a, fields): 
     """ 
     `a` must be a numpy structured array. 
     `names` is the collection of field names to keep. 

     Returns a view of the array `a` (not a copy). 
     """ 
     dt = a.dtype 
     formats = [dt.fields[name][0] for name in fields] 
     offsets = [dt.fields[name][1] for name in fields] 
     itemsize = a.dtype.itemsize 
     newdt = np.dtype(dict(names=fields, 
           formats=formats, 
           offsets=offsets, 
           itemsize=itemsize)) 
     b = a.view(newdt) 
     return b 

In [5]: view_fields(x,['f2','f3']).flags 
Out[5]: 
    C_CONTIGUOUS : True 
    F_CONTIGUOUS : True 
    OWNDATA : False 
    WRITEABLE : True 
    ALIGNED : True 
    UPDATEIFCOPY : False 

老功能:

In [10]: fields_view(x,['f2','f3']).flags 
Out[10]: 
    C_CONTIGUOUS : False 
    F_CONTIGUOUS : False 
    OWNDATA : False 
    WRITEABLE : True 
    ALIGNED : True 
    UPDATEIFCOPY : False