2012-06-07 65 views
2

我有兩個載體xww是與x相同長度的權重的數值向量,給出用於x的元素的權重。R中載體中類似元素的加權平均值

我想給出加載平均值的矢量x,它們的差異很小(例如1e-1或1e-2)以減少矢量的長度x。例如,這些向量如下:

w =c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04, 
     2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04, 
     4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03, 
     1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02, 
     1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03, 
     3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01, 
     3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04, 
     7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04, 
     8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04, 
     5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07, 
     3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07, 
     2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07, 
     2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 
     2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03) 

    x =c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900, 
      0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500, 
      1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027, 
      2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953, 
      2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953, 
      3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358, 
      3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813, 
      0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596, 
      2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608, 
      3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866) 

我知道如何按照權重向量x排序,但我怎麼能認識到類似值向量x,然後讓他們的加權平均值?

+0

你不明白你在問什麼,特別是考慮到e-01和e-02的值不小; e-07值很小。但是:如果你只是想讓加權平均值超前並乘以* x和w,那麼mean()就是結果。 –

+1

我的解釋是,我們的目標是找到「向量x中的相似值」......如果將這些值組合在一起,那麼這將減少向量x的長度。直觀地說,這可能意味着兩個具有權重w1和w2的相似元素x1和x2被具有權重w1 + w2的妥協值x所取代,使得總的加權平均值保持不變... –

+0

@TimP:感謝您的評論,當我們有兩個相似的值時,你的解釋是正確的,但實際上我可能有兩個以上相似的值(例如3或4,...相似的值爲0.4和3,4個相似的值爲1.7)。我想找到這些相似的值,並得到該組的加權平均值,並具有權重w1 + w2 + w3 + ...您編寫的代碼找到兩個相似的值。當它超過兩個相似的值時,我該怎麼辦? –

回答

2

修訂ANSWER

如何這樣的事情...? (見下面的代碼)

我把你的原始矢量叫做origx和origw,這樣重新排列的就是x和w。代碼在x和w(稱爲xtemp和wtemp)的臨時副本上被破壞,並在變量xnew和wnew中構建新的x和w(即您尋求的「較短」向量)。

簡而言之,代碼查看xtemp並找到超出閾值大小(例如0.05)的第一個間隙,並將xtemp開始運行之前的所有元素組合到「大」間隙。 (如果沒有這樣的差距,它將整個xtemp作爲一個組)。代碼然後將該組轉換爲稱爲wgroup的單個權重(組權重的總和)以及稱爲xgroup的單個代表性x值(例如xgroup * wgroup與所有組元素的加權和相同)。然後,我們將xgroup和wgroup保存到向量xnew和wnew中,消除當前組(通過從xtemp和wtemp中刪除它),然後以相同的方式繼續,直到所有內容都被分組爲止。

給它一個試運行,看看你的想法:)

origw = c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04, 
      2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04, 
      4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03, 
      1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02, 
      1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03, 
      3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01, 
      3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04, 
      7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04, 
      8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04, 
      5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07, 
      3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07, 
      2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07, 
      2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 
      2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03) 

origx = c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900, 
      0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500, 
      1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027, 
      2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953, 
      2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953, 
      3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358, 
      3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813, 
      0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596, 
      2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608, 
      3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866) 

reord = order(origx) 
x = origx[reord] 
w = origw[reord] 

xnew = wnew = c() 

thresh = 0.05 
xtemp = x 
wtemp = w 
while (length(xtemp) > 0) { 
nextgap = which(diff(xtemp) > thresh)[1] 
if (!is.na(nextgap)) { 
    group = seq_len(nextgap) 
} else { 
    group = seq_along(xtemp) 
} 
xgroup = sum((xtemp*wtemp)[group])/sum(wtemp[group]) 
wgroup = sum(wtemp[group]) 
xnew = c(xnew, xgroup) 
wnew = c(wnew, wgroup) 
xtemp = xtemp[-group] 
wtemp = wtemp[-group] 
} 

OLD響應如下(按上述取代...)

我建議重新排序X和W使得x是嚴格數字順序,然後使用diff功能:

reord = order(x) 
x2 = x[reord] 
w2 = w[reord] 
which(diff(x2)<0.01) 

上述最後一個命令指示哪些元素在x2x的排序版本)在次高元素的0.01以內。第一個值是2,因爲x2的元素2和3就是這樣一個例子:x2[2]=0.4399527x2[3]=0.4429813

此外,如果你做

sort(diff(x2)) 

你可以看到排列數字順序的所有差異,這可能會幫助你決定適合的截止應該是什麼。

+0

非常感謝。這是我的目標。 –

+0

沒問題,給它一個測試運行,讓我知道,如果一切正常:) –

+1

我給了一個測試運行。它工作正常。 –