Cython：將Unicode存儲在numpy數組中

我是新來的cython，並且我一直有一個重複出現的問題，涉及在numpy數組中編碼unicode。Cython：將Unicode存儲在numpy數組中

這裏的問題的一個例子：

import numpy as np 
cimport numpy as np 

cpdef pass_array(np.ndarray[ndim=1,dtype=np.unicode] a): 
    pass 

cpdef access_unicode_item(np.ndarray a): 
    cdef unicode item = a[0]

實例錯誤：

In [3]: unicode_array = np.array([u"array",u"of",u"unicode"],dtype=np.unicode) 

In [4]: pass_array(unicode_array) 
ValueError: Does not understand character buffer dtype format string ('w') 

In [5]: access_item(unicode_array) 
TypeError: Expected unicode, got numpy.unicode_

的問題似乎是，該值是不是真正的unicode的，而是numpy.unicode_。有沒有辦法將數組中的值編碼爲適當的unicode（以便我可以輸入用於cython代碼的單個項目）？

來源

2016-02-29 kazimir.r

如果您想在您的Cython代碼中使用Python'unicode'對象，最簡單的方法就是給Numpy數組一個'object' dtype。如果你想保留一個固定長度的Unicode數組，可能在某種程度上你可以在必要時使用[PyUnicode_FromUnicode]（https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_FromUnicode）？ – 2016-03-01 14:12:12

在Py2.7

In [375]: arr=np.array([u"array",u"of",u"unicode"],dtype=np.unicode) 

In [376]: arr 
Out[376]: 
array([u'array', u'of', u'unicode'], 
     dtype='<U7') 

In [377]: arr.dtype 
Out[377]: dtype('<U7') 

In [378]: type(arr[0]) 
Out[378]: numpy.unicode_ 

In [379]: type(arr[0].item()) 
Out[379]: unicode

一般而言x[0]在numpy的子類返回x的元件。在這種情況下，np.unicode_是unicode的一個子類。

In [384]: isinstance(arr[0],np.unicode_) 
Out[384]: True 

In [385]: isinstance(arr[0],unicode) 
Out[385]: True

我想你會遇到同樣的問題np.int32和int之間。但我沒有足夠的工作與cython確定。

你在哪裏看過cython指定字符串（Unicode或字節）dtype的代碼？

http://docs.cython.org/src/tutorial/numpy.html具有像

# We now need to fix a datatype for our arrays. I've used the variable 
# DTYPE for this, which is assigned to the usual NumPy runtime 
# type info object. 
DTYPE = np.int 
# "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For 
# every type in the numpy module there's a corresponding compile-time 
# type with a _t-suffix. 
ctypedef np.int_t DTYPE_t 
.... 
def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f):

的[]部分的目的是爲了提高索引效率表達式。

我們需要做的就是輸入ndarray對象的內容。我們用一個必須告訴數據類型（第一個參數）和維數（「ndim」只有關鍵字參數，如果沒有提供則假設一維）的特殊「緩衝」語法來做到這一點。

我不認爲np.unicode會有幫助，因爲它沒有指定字符長度。完整的字符串dtype必須包含字符的數量，例如。在我的例子中爲<U7。

我們需要找到傳遞字符串數組的工作示例 - 無論是在cython文檔還是其他SO cython問題。

對於某些操作，你可以把Unicode的陣列作爲int32數組。

In [397]: arr.nbytes 
Out[397]: 84

3串x 7的字符/串* 4字節/字符

In [398]: arr.view(np.int32).reshape(-1,7) 
Out[398]: 
array([[ 97, 114, 114, 97, 121, 0, 0], 
     [111, 102, 0, 0, 0, 0, 0], 
     [117, 110, 105, 99, 111, 100, 101]])

用Cython給你最大速度提高時，你可以繞過Python函數和方法。這將包括繞過大部分Python字符串和unicode功能。

來源

2016-02-29 21:29:20 hpaulj

Cython：將Unicode存儲在numpy數組中

回答

相關問題