Numpy數組條件匹配

我需要匹配兩個非常大的Numpy數組（一個是20000行，另一個大約100000行），我試圖構建一個腳本來有效地完成它。簡單的循環遍歷數組非常慢，有人可以提出更好的方法嗎？這是我想要做的：數組datesSecondDict和數組pwfs2Dates包含日期時間值，我需要從數組pwfs2Dates（較小的數組）中獲取每個日期時間值，並查看數組中是否有類似的日期時間值（加上減去5分鐘） datesSecondDict（可能有1個以上）。如果有一個（或多個）I使用數組valsSecondDict（它只是數字值爲datesSecondDict的數組）的值（其中一個值）填充新數組（與數組pwfs2Dates的大小相同）。下面是@unutbu和@joaquin一個解決方案，爲我工作（謝謝你們！）：Numpy數組條件匹配

import time 
import datetime as dt 
import numpy as np 

def combineArs(dict1, dict2): 
    """Combine data from 2 dictionaries into a list. 
    dict1 contains primary data (e.g. seeing parameter). 
    The function compares each timestamp in dict1 to dict2 
    to see if there is a matching timestamp record(s) 
    in dict2 (plus/minus 5 minutes). 
    ==If yes: a list called data gets appended with the 
    corresponding parameter value from dict2. 
    (Note that if there are more than 1 record matching, 
    the first occuring value gets appended to the list). 
    ==If no: a list called data gets appended with 0.""" 
    # Specify the keys to use  
    pwfs2Key = 'pwfs2:dc:seeing' 
    dimmKey = 'ws:seeFwhm' 

    # Create an iterator for primary dict 
    datesPrimDictIter = iter(dict1[pwfs2Key]['datetimes']) 

    # Take the first timestamp value in primary dict 
    nextDatePrimDict = next(datesPrimDictIter) 

    # Split the second dictionary into lists 
    datesSecondDict = dict2[dimmKey]['datetime'] 
    valsSecondDict = dict2[dimmKey]['values'] 

    # Define time window 
    fiveMins = dt.timedelta(minutes = 5) 
    data = [] 
    #st = time.time() 
    for i, nextDateSecondDict in enumerate(datesSecondDict): 
     try: 
      while nextDatePrimDict < nextDateSecondDict - fiveMins: 
       # If there is no match: append zero and move on 
       data.append(0) 
       nextDatePrimDict = next(datesPrimDictIter) 
      while nextDatePrimDict < nextDateSecondDict + fiveMins: 
       # If there is a match: append the value of second dict 
       data.append(valsSecondDict[i]) 
       nextDatePrimDict = next(datesPrimDictIter) 
     except StopIteration: 
      break 
    data = np.array(data) 
    #st = time.time() - st  
    return data

感謝，艾娜。

來源

2011-12-19 Aina

數組日期排序？

如果是的話，你可以一次大會的日期比由外環作出之日起更大加快從內循環比較破壞你的比較。這樣一來，你會做了一個通的比較，而不是循環dimVals項目len(pwfs2Vals)次
如果沒有，也許你應該改變目前的pwfs2Dates陣列，例如，對[(date, array_index),...]的數組，然後你可以按更新所有的陣列，使以上，並在同時表示單次操作比較能夠得到設定data[i]

例如需要原始的索引，如果數組已經排序（我使用列表在這裏，不知道你需要陣列）：（編輯：現在在每個步驟中使用和迭代器從一開始就沒有循環pwfs2Dates）：

pdates = iter(enumerate(pwfs2Dates)) 
i, datei = pdates.next() 

for datej, valuej in zip(dimmDates, dimvals): 
    while datei < datej - fiveMinutes: 
     i, datei = pdates.next() 
    while datei < datej + fiveMinutes: 
     data[i] = valuej 
     i, datei = pdates.next()

否則，如果他們不下令，你創建的排序，索引列表如下：

pwfs2Dates = sorted([(date, idx) for idx, date in enumerate(pwfs2Dates)]) 
dimmDates = sorted([(date, idx) for idx, date in enumerate(dimmDates)])

代碼將是：
（編輯：現在使用和迭代器未在每個步驟循環pwfs2Dates從開始）：

pdates = iter(pwfs2Dates) 
datei, i = pdates.next() 

for datej, j in dimmDates: 
    while datei < datej - fiveMinutes: 
     datei, i = pdates.next() 
    while datei < datej + fiveMinutes: 
     data[i] = dimVals[j] 
     datei, i = pdates.next()

太棒了！

注意dimVals：
```
dimVals = np.array(dict1[dimmKey]['values']) 
```
在你的代碼不使用，可以消除。
請注意，您的代碼被大大地通過陣列本身的循環，而不是使用的xrange

簡化

編輯：答案從unutbu地址上面的代碼中的一些薄弱環節。我表示他們這裏completness：

使用 next

：next(iterator)是首選到iterator.next()。 iterator.next()是傳統命名規則的一個例外，已在py3k中修復，將此方法重命名爲 iterator.__next__()。
檢查迭代器的末尾是否有try/except。在迭代器中的所有項完成後，下一個對next() 的調用會產生StopIteration異常。當發生這種情況時，請使用try/except友善地打出。對於 OP問題的具體情況，這不是一個問題，因爲兩個數組的大小相同，所以for循環與迭代器同時完成。所以沒有異常上升。但是，可能有一些情況是dict1和dict2 的大小不一樣。在這種情況下，例外的可能性正在上升。問題是：什麼是更好的，使用嘗試/除了或準備數組循環之前通過均衡他們到較短的一個。

來源

2011-12-19 18:07:49 joaquin

感謝這麼多，它完全成功了！ – Aina 2011-12-20 20:48:33

我認爲你可以用更少的一個循環做到這一點：

import datetime 
import numpy 

# Test data 

# Create an array of dates spaced at 1 minute intervals 
m = range(1, 21) 
n = datetime.datetime.now() 
a = numpy.array([n + datetime.timedelta(minutes=i) for i in m]) 

# A smaller array with three of those dates 
m = [5, 10, 15] 
b = numpy.array([n + datetime.timedelta(minutes=i) for i in m]) 

# End of test data 

def date_range(date_array, single_date, delta): 
    plus = single_date + datetime.timedelta(minutes=delta) 
    minus = single_date - datetime.timedelta(minutes=delta) 
    return date_array[(date_array < plus) * (date_array > minus)] 

dates = [] 
for i in b: 
    dates.append(date_range(a, i, 5)) 

all_matches = numpy.unique(numpy.array(dates).flatten())

有肯定是一個更好的辦法來收集和合並的比賽，但你的想法......你也可以使用numpy.argwhere((a < plus) * (a > minus))返回索引而不是日期，並使用索引獲取整行並將其放入新數組中。

來源

2011-12-19 18:49:24 Benjamin

大廈joaquin's idea：

import datetime as dt 
import itertools 

def combineArs(dict1, dict2, delta = dt.timedelta(minutes = 5)): 
    marks = dict1['datetime'] 
    values = dict1['values'] 
    pdates = iter(dict2['datetime']) 

    data = [] 
    datei = next(pdates) 
    for datej, val in itertools.izip(marks, values): 
     try: 
      while datei < datej - delta: 
       data.append(0) 
       datei = next(pdates) 
      while datei < datej + delta: 
       data.append(val) 
       datei = next(pdates) 
     except StopIteration: 
      break 
    return data 

dict1 = { 'ws:seeFwhm': 
      {'datetime': [dt.datetime(2011, 12, 19, 12, 0, 0), 
         dt.datetime(2011, 12, 19, 12, 1, 0), 
         dt.datetime(2011, 12, 19, 12, 20, 0), 
         dt.datetime(2011, 12, 19, 12, 22, 0), 
         dt.datetime(2011, 12, 19, 12, 40, 0), ], 
      'values': [1, 2, 3, 4, 5] } } 
dict2 = { 'pwfs2:dc:seeing': 
      {'datetime': [dt.datetime(2011, 12, 19, 12, 9), 
         dt.datetime(2011, 12, 19, 12, 19), 
         dt.datetime(2011, 12, 19, 12, 29), 
         dt.datetime(2011, 12, 19, 12, 39), 
         ], } } 

if __name__ == '__main__': 
    dimmKey = 'ws:seeFwhm' 
    pwfs2Key = 'pwfs2:dc:seeing'  
    print(combineArs(dict1[dimmKey], dict2[pwfs2Key]))

產生

[0, 3, 0, 5]

來源

2011-12-19 19:09:25 unutbu

+1使其實際工作 – joaquin 2011-12-20 21:03:12

Numpy數組條件匹配

回答

相關問題