NumPy的方式進行梳理凌亂的陣列密謀

我有兩個數組情節存儲在未排序方式的數據，這樣的情節跳躍，從一個地方到另一個地方間斷：我已經試過one example of finding the closest point in a 2D array：NumPy的方式進行梳理凌亂的陣列密謀

import numpy as np 

def distance(pt_1, pt_2): 
    pt_1 = np.array((pt_1[0], pt_1[1])) 
    pt_2 = np.array((pt_2[0], pt_2[1])) 
    return np.linalg.norm(pt_1-pt_2) 

def closest_node(node, nodes): 
    nodes = np.asarray(nodes) 
    dist_2 = np.sum((nodes - node)**2, axis=1) 
    return np.argmin(dist_2) 

a = [] 
for x in range(50000): 
    a.append((np.random.randint(0,1000),np.random.randint(0,1000))) 
some_pt = (1, 2) 

closest_node(some_pt, a)

我可以用它來「清理」我的數據嗎？（在上面的代碼，a可以是我的數據）

從我的計算的示例性數據是：用（喬金頓）radial_sort_line我已收到以下情節

array([[ 2.08937872e+001, 1.99020033e+001, 2.28260611e+001, 
      6.27711094e+000, 3.30392288e+000, 1.30312878e+001, 
      8.80768833e+000, 1.31238275e+001, 1.57400130e+001, 
      5.00278061e+000, 1.70752624e+001, 1.79131456e+001, 
      1.50746185e+001, 2.50095731e+001, 2.15895974e+001, 
      1.23237801e+001, 1.14860312e+001, 1.44268222e+001, 
      6.37680265e+000, 7.81485403e+000], 
     [ -1.19702178e-001, -1.14050879e-001, -1.29711421e-001, 
      8.32977493e-001, 7.27437322e-001, 8.94389885e-001, 
      8.65931116e-001, -6.08199292e-002, -8.51922900e-002, 
      1.12333841e-001, -9.88131292e-324, 4.94065646e-324, 
     -9.88131292e-324, 4.94065646e-324, 4.94065646e-324, 
      0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 
     -4.94065646e-324, 0.00000000e+000]])

後：

來源

2016-02-24 Ohm

你能發佈你的數據還是你如何獲得數據？ –

上述示例中的數據爲'a' – Ohm

'a'並不是您所顯示的情節的原因 –

這實際上是一個比你想象的更嚴厲的一般問題。

在你的確切情況下，你可能能夠逃避y值排序。從情節中很難確定。

因此，對於像這樣的圓形形狀，更好的方法是進行徑向排序。

例如，讓我們產生了一些有點類似於你的數據：

import numpy as np 
import matplotlib.pyplot as plt 

t = np.linspace(.2, 1.6 * np.pi) 
x, y = np.cos(t), np.sin(t) 

# Shuffle the points... 
i = np.arange(t.size) 
np.random.shuffle(i) 
x, y = x[i], y[i] 

fig, ax = plt.subplots() 
ax.plot(x, y, color='lightblue') 
ax.margins(0.05) 
plt.show()

好了，現在讓我們嘗試通過使用徑向排序撤消洗牌。我們將通過這個角度使用點的質心爲中心，計算角度，以每個點，然後進行排序：

x0, y0 = x.mean(), y.mean() 
angle = np.arctan2(y - y0, x - x0) 

idx = angle.argsort() 
x, y = x[idx], y[idx] 

fig, ax = plt.subplots() 
ax.plot(x, y, color='lightblue') 
ax.margins(0.05) 
plt.show()

好了，八九不離十！如果我們使用閉合的多邊形，我們就完成了。

但是，我們有一個問題 - 這會彌補錯誤的差距。我們寧願將角度從行中最大差距的位置開始。

因此，我們需要計算出我們新的生產線，並把差距縮小到每個相鄰點重新做基於一個新的起點角度排序：

dx = np.diff(np.append(x, x[-1])) 
dy = np.diff(np.append(y, y[-1])) 
max_gap = np.abs(np.hypot(dx, dy)).argmax() + 1 

x = np.append(x[max_gap:], x[:max_gap]) 
y = np.append(y[max_gap:], y[:max_gap])

導致：

作爲一個完整的，獨立的例子：

import numpy as np 
import matplotlib.pyplot as plt 

def main(): 
    x, y = generate_data() 
    plot(x, y).set(title='Original data') 

    x, y = radial_sort_line(x, y) 
    plot(x, y).set(title='Sorted data') 

    plt.show() 

def generate_data(num=50): 
    t = np.linspace(.2, 1.6 * np.pi, num) 
    x, y = np.cos(t), np.sin(t) 

    # Shuffle the points... 
    i = np.arange(t.size) 
    np.random.shuffle(i) 
    x, y = x[i], y[i] 

    return x, y 

def radial_sort_line(x, y): 
    """Sort unordered verts of an unclosed line by angle from their center.""" 
    # Radial sort 
    x0, y0 = x.mean(), y.mean() 
    angle = np.arctan2(y - y0, x - x0) 

    idx = angle.argsort() 
    x, y = x[idx], y[idx] 

    # Split at opening in line 
    dx = np.diff(np.append(x, x[-1])) 
    dy = np.diff(np.append(y, y[-1])) 
    max_gap = np.abs(np.hypot(dx, dy)).argmax() + 1 

    x = np.append(x[max_gap:], x[:max_gap]) 
    y = np.append(y[max_gap:], y[:max_gap]) 
    return x, y 

def plot(x, y): 
    fig, ax = plt.subplots() 
    ax.plot(x, y, color='lightblue') 
    ax.margins(0.05) 
    return ax 

main()

來源

2016-02-24 16:54:02

這很好用，但對於某些情況下，當我繪製數據時，它會填充某個方向上存在多個「y」值的區域。 – Ohm

如果我們假設數據是2D 和 x軸應該是增加的方式，那麼y ou可以：

對x軸數據進行排序，例如， x_old並將結果存儲在不同的變量中，例如， x_new
在x_new每個元素找到其在x_old陣列
重新排序指數根據您上一步得到了指數的y_axis數組中的元素

我會做Python列表而不是numpy數組，因爲list.index方法比numpy.where方法更容易操作。

E.g. （並假設x_old和y_old分別爲X和Y軸你以前numpy的變量）

import numpy as np 

x_new_tmp = x_old.tolist() 
y_new_tmp = y_old.tolist() 

x_new = sorted(x_new_tmp) 

y_new = [y_new_tmp[x_new_tmp.index(i)] for i in x_new]

然後你就可以繪製x_new和y_new

來源

2016-02-24 16:31:23 Xxxo

爲問題中顯示的圖像，按x值排序不會在這裏工作。這可能適用於y值 – tom

正如湯姆指出的，在這種情況下，這不適用於x，但可能適用於y。無論如何，如果你有numpy數組，就不要使用這個列表。請改用'x，y = x [y.argsort（）]，x [y.argsort（）]'。 –

所以rting角度相對其對數據的基礎上，以中心爲中@JoeKington的溶液可能具有與數據的一些部分的問題：

In [1]: 

import scipy.spatial as ss 
import matplotlib.pyplot as plt 
import numpy as np 
import re 
%matplotlib inline 
In [2]: 

data=np.array([[ 2.08937872e+001, 1.99020033e+001, 2.28260611e+001, 
        6.27711094e+000, 3.30392288e+000, 1.30312878e+001, 
        8.80768833e+000, 1.31238275e+001, 1.57400130e+001, 
        5.00278061e+000, 1.70752624e+001, 1.79131456e+001, 
        1.50746185e+001, 2.50095731e+001, 2.15895974e+001, 
        1.23237801e+001, 1.14860312e+001, 1.44268222e+001, 
        6.37680265e+000, 7.81485403e+000], 
       [ -1.19702178e-001, -1.14050879e-001, -1.29711421e-001, 
        8.32977493e-001, 7.27437322e-001, 8.94389885e-001, 
        8.65931116e-001, -6.08199292e-002, -8.51922900e-002, 
        1.12333841e-001, -9.88131292e-324, 4.94065646e-324, 
       -9.88131292e-324, 4.94065646e-324, 4.94065646e-324, 
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000, 
       -4.94065646e-324, 0.00000000e+000]]) 
In [3]: 

plt.plot(data[0], data[1]) 
plt.title('Unsorted Data') 
Out[3]: 
<matplotlib.text.Text at 0x10a5c0550>

參見第15和20之間的x值不排序正確。

In [10]: 

#Calculate the angle in degrees of [0, 360] 
sort_index = np.angle(np.dot((data.T-data.mean(1)), np.array([1.0, 1.0j]))) 
sort_index = np.where(sort_index>0, sort_index, sort_index+360) 

#sorted the data by angle and plot them 
sort_index = sort_index.argsort() 
plt.plot(data[0][sort_index], data[1][sort_index]) 
plt.title('Data Sorted by angle relatively to the centroid') 

plt.plot(data[0], data[1], 'r+') 
Out[10]: 
[<matplotlib.lines.Line2D at 0x10b009e10>]

我們可以根據最近鄰方法對數據進行排序，但因爲x和y有很大的不同規模，距離度量的選擇成爲一個重要問題。我們將只嘗試所有可用的距離度量scipy獲得一個想法：

In [7]: 

def sort_dots(metrics, ax, start): 
    dist_m = ss.distance.squareform(ss.distance.pdist(data.T, metrics)) 

    total_points = data.shape[1] 
    points_index = set(range(total_points)) 
    sorted_index = [] 
    target = start 
    ax.plot(data[0, target], data[1, target], 'o', markersize=16) 

    points_index.discard(target) 
    while len(points_index)>0: 
     candidate = list(points_index) 
     nneigbour = candidate[dist_m[target, candidate].argmin()] 
     points_index.discard(nneigbour) 
     points_index.discard(target) 
     #print points_index, target, nneigbour 
     sorted_index.append(target) 
     target = nneigbour 
    sorted_index.append(target) 

    ax.plot(data[0][sorted_index], data[1][sorted_index]) 
    ax.set_title(metrics) 
In [6]: 

dmetrics = re.findall('pdist\(X\,\s+\'(.*)\'', ss.distance.pdist.__doc__) 
In [8]: 

f, axes = plt.subplots(4, 6, figsize=(16,10), sharex=True, sharey=True) 
axes = axes.ravel() 
for metrics, ax in zip(dmetrics, axes): 
    try: 
     sort_dots(metrics, ax, 5) 
    except: 
     ax.set_title(metrics + '(unsuitable)')

它看起來像標準化歐氏和馬氏指標給出最好的結果。請注意，我們選擇第6個數據的起點（索引5），它是數據點這個最大的y值（當然，使用argmax來獲得索引）。

In [9]: 

f, axes = plt.subplots(4, 6, figsize=(16,10), sharex=True, sharey=True) 
axes = axes.ravel() 
for metrics, ax in zip(dmetrics, axes): 
    try: 
     sort_dots(metrics, ax, 13) 
    except: 
     ax.set_title(metrics + '(unsuitable)')

這是，如果你選擇最大的出發點發生了什麼。 x值（索引13）。看起來mahanalobis指標比標準化的歐幾里德指標更好，因爲它不受我們選擇的起點的影響。

來源

2016-03-04 22:29:08

好主意，是否也可用於將數據拆分爲不同的曲線？真實的數據應該包含一條半圓形曲線，向右開放，連同一條平直的水平線。 – Ohm

NumPy的方式進行梳理凌亂的陣列密謀

回答

相關問題