2017-10-05 68 views
2

我有數據,我試圖從中識別模式。但是,每個表中的數據都不完整(缺少行)。我想將表格分成完整的數據塊,然後確定每個模式的模式。我有一列可以用來確定數據是否完整或未被調用sequenceSQL通過連續增加序列來分割數據,然後每個都通過一個模式子集

數據看起來就像是:

Sequence  Position 
1    open 
2    closed 
3    open 
4    open 
5    closed 
8    closed 
9    open 
11    open 
13    closed 
14    open 
15    open 
18    closed 
19    open 
20    closed 

首先,我想將數據分割成完整的部分:

Sequence  Position 
    1    open 
    2    closed 
    3    open 
    4    open 
    5    closed 
--------------------------- 
    8    closed 
    9    open 
--------------------------- 
    11    open 
--------------------------- 
    13    closed 
    14    open 
    15    open 
--------------------------- 
    18    closed 
    19    open 
    20    closed 

然後我想識別模式closed open, ..., open, closed這樣纔好從關閉到打開n行(其中n至少爲1),然後返回關閉

從樣本數據中可以看出:

 Sequence  Position 
     2    closed 
     3    open 
     4    open 
     5    closed 
    --------------------------- 
     18    closed 
     19    open 
     20    closed 

這使我可以進行分析的最終表格,因爲我知道沒有破碎的序列。如果這更容易處理,我還有另一列position是二進制文件。

表格很大,所以儘管我認爲我可以編寫循環來計算出我的結果,但我認爲這種方法不夠高效。另外我要整個表拉入R,然後找到結果表,但是這需要拉一切都變成R第一所以我不知道如果這是在SQL

編輯可行的:這是比較有代表性的不同樣本數據:

Sequence  Position 
    1    open 
    2    closed 
    3    open 
    4    open 
    5    closed 
    8    closed 
    9    open 
    11    open 
    13    closed 
    14    open 
    15    open 
    18    closed 
    19    open 
    20    closed 
    21    closed 
    22    closed 
    23    closed 
    24    open 
    25    open 
    26    closed 
    27    open 

注意這應該有相同的結果,但也與

23    closed 
    24    open 
    25    open 
    26    closed 

212227不是因爲他們不符合closedopen ... openclosed模式

但是如果我們28 closed我們希望2728因爲沒有時間間隔和圖案將適合。如果不是28它是29 closed我們不希望2729(因爲雖然模式是正確的序列中斷)。

要添加一些上下文,請考慮從停止,運行到停止的計算機。我們記錄了這些數據,但是在記錄中存在空白,這些記錄是通過破壞序列來表示的。以及停止運行停止循環中的數據丟失;數據有時會在機器已經運行時開始記錄,或者在機器停止前停止記錄。我不想要這些數據,因爲它不是停止,運行,停止的完整循環。我只想要那些完整的週期,並且序列是連續的。 這意味着我可以將我的原始數據集轉換爲一個一個接一個完整的循環。

+0

我建議你設置SQL小提琴或Rextester。 –

+0

實際上你想要的是Spilled意味着什麼?爲此表格分配表格? –

+0

不只是一個'select'來過濾數據 – Olivia

回答

1

您可以使用它。

DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10)) 

INSERT INTO @MyTable 
VALUES 
(1,'open'), 
(2,'closed') , 
(3,'open'), 
(4,'open'), 
(5,'closed'), 
(8,'closed'), 
(9,'open'), 
(11,'open'), 
(13,'closed'), 
(14,'open') , 
(15,'open'), 
(18,'closed'), 
(19,'open'), 
(20,'closed'), 
(21,'closed'), 
(22,'closed'), 
(23,'closed'), 
(24,'open'), 
(25,'open'), 
(26,'closed'), 
(27,'open') 


;WITH CTE AS(
    SELECT * , 
     CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark 
    FROM @MyTable 
) 
,CTE_2 AS 
(
    SELECT 
     [New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING)) 
     , [Sequence] 
     , Position 
    FROM CTE 
) 
,CTE_3 AS (
    SELECT *, 
    RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence]) 
    FROM CTE_2 
) 
,CTE_4 AS 
(
    SELECT ([New_Sequence] - RN) G 
    , MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq 
    , MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq 
    FROM CTE_3 
    GROUP BY ([New_Sequence] - RN) 
) 
SELECT 
    CTE.Sequence, CTE.Position 
FROM CTE_4 
    INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq) 
WHERE 
    CTE_4.MaxCloseSq > CTE_4.MinCloseSq 
    AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL) 

結果:

Sequence Position 
----------- ---------- 
2   closed 
3   open 
4   open 
5   closed 
---   --- 
18   closed 
19   open 
20   closed 
---   --- 
23   closed 
24   open 
25   open 
26   closed 
+0

這似乎不適用於我的真實數據。我的數據有更長時間的關閉和/或打開重複。但是格式是一樣的。這是怎麼回事? - 我說1000閉合,然後千開等 – Olivia

+0

你可以添加更多的測試數據? –

+0

對不起,我注意到它的數據就是這個問題。我使用循環創建序列(((round(convert(float,datetime),5) - 42961.58227)* 99999.97 + 1),1)'但注意到一些重複/奇怪的日期,所以即時只是要刪除它們,然後再試一次 - 儘管 – Olivia

0

我覺得實際上有一個比較簡單的方法來看待這個。您可以通過以下方法確定收盤順序號:

  • 縱觀前收盤
  • 望着累積的順序爲前收盤和當前接近
  • 做算術,以確保所有的中間體打開在數據

這變成了查詢:

select t.*, 
     lag(sequence) over (partition by position order by sequence) as prev_sequence, 
     lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens 
from (select t.*, 
      sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens 
     from t 
    ) t 
where position = 'close' and 
     (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and 
     sequence > prev_sequence - 1; 

現在你已經確定的順序,你可以加入回去取原始行:

select t.* 
from t join 
    (select t.*, 
      lag(sequence) over (partition by position order by sequence) as prev_sequence, 
      lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens 
     from (select t.*, 
        sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens 
      from t 
      ) t 
     where position = 'close' and 
      (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and 
      sequence > prev_sequence - 1 
    ) seqs 
    on t.sequence between seqs.prev_sequence and seqs.sequence; 

我承認我沒有測試過這一點。不過,我確實認爲這個想法很有效。一件事是它會爲每個序列組選擇多個「關閉」時段。

相關問題