2016-06-20 39 views
2

我有一個非常大的數據集,看起來像這樣在如何獲得每半小時

Column A 
     Date 
2016-02-29 15:59:59.674 
2016-02-29 15:59:59.695 
2016-02-29 15:59:59.716 
2016-02-29 15:59:59.752 
2016-02-29 15:59:59.804 
2016-02-29 15:59:59.869 
2016-02-29 15:59:59.888 
2016-02-29 15:59:59.941 
2016-02-29 16:00:00.081 <-- get closest date since .081 < .941 
2016-02-29 16:00:00.168 
2016-02-29 16:00:00.189 
2016-02-29 16:00:00.198 
2016-02-29 16:00:00.247 
2016-02-29 16:00:00.311 
2016-02-29 16:00:00.345 
2016-02-29 16:00:00.357 

and for the other half an hour 

2016-02-29 16:29:58.628 
2016-02-29 16:29:58.639 
2016-02-29 16:29:58.689 
2016-02-29 16:29:58.706 
2016-02-29 16:29:58.761 
2016-02-29 16:29:58.865 
2016-02-29 16:29:59.142 
2016-02-29 16:29:59.542 
2016-02-29 16:29:59.578 
2016-02-29 16:30:00.171 <-- Get this date since .171 < .578 
2016-02-29 16:30:00.209 
2016-02-29 16:30:00.217 
2016-02-29 16:30:00.245 
2016-02-29 16:30:00.254 
2016-02-29 16:30:00.347 
2016-02-29 16:30:00.422 
2016-02-29 16:30:00.457 
2016-02-29 16:30:00.491 
2016-02-29 16:30:00.555 
2016-02-29 16:30:00.557 
2016-02-29 16:30:00.645 

數據集現在總行後最接近的日期大約是5468389這是非常大的Excel中將所有內容導入到一列中,因此我正在嘗試處理部分數據。

有沒有其他方法呢?我可以通過它處理所有數據? 我試圖直接閱讀和寫入文本,但每當我試圖閱讀它的日期,它給了我一個Type Mismatch錯誤,因爲格式。出於同樣的原因,我沒有用python解決這個問題,也因爲我不熟悉python,所以我想在Excel VBA中這樣做。

另外我不太清楚這個邏輯,所以我需要一些幫助。

Option Explicit 

Sub Get_Closest_Dates() 

Application.ScreenUpdating = False 

Dim WI As Worksheet, WO As Worksheet 
Dim i As Long, ct As Long 
Dim num1 As Integer, num2 As Integer, num3 As Integer 
Dim df1, df2 


Set WI = Sheet1 'INPUT SHEET 
Set WO = Sheet2 'OUTPUT SHEET 

WI.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS" 
WO.Range("A:A").NumberFormat = "YYYY-MM-DD HH:MM:SS" 

WI.Range("B1") = "HOUR" 
WI.Range("C1") = "MINUTE" 

With WI 

    .Range("B2").Formula = "=HOUR(A2)" 
    .Range("B2:B" & Rows.Count).FillDown 

    .Range("C2").Formula = "=MINUTE(A2)" 
    .Range("C2:C" & Rows.Count).FillDown 

ct = WO.Range("A" & Rows.Count).End(xlUp).Row + 1 

For i = 2 To 10000 

    num1 = .Range("C" & i).Value 'get Minutes 
    num2 = .Range("C" & i + 1).Value 

    If (num1 = 29 And num2 = 30) Then 

     df1 = 0.5 - TimeValue(.Range("A" & i)) 
     df2 = TimeValue(.Range("A" & i + 1)) - 0.5 

     If df1 < df2 Then 
      WO.Range("A" & ct) = .Range("A" & i) 
      ct = ct + 1 
     Else 
      WO.Range("A" & ct) = .Range("A" & i + 1) 
      ct = ct + 1 
     End If 

    End If 


    If (num1 = 59 And num2 = 0) Then 
     df1 = 1 - TimeValue(.Range("A" & i)) 
     df2 = TimeValue(.Range("A" & i + 1)) - 1 

     If df1 < df2 Then 
      WO.Range("A" & ct) = .Range("A" & i) 
      ct = ct + 1 
     Else 
      WO.Range("A" & ct) = .Range("A" & i + 1) 
      ct = ct + 1 
     End If 
    End If 

Next i 

End With 

Application.ScreenUpdating = True 
MsgBox "Process Completed" 

End Sub 

而且我不知道我怎樣才能得到毫秒部分之日起將避免計算兩個日期的差別

像15:59:59.674我怎樣才能674從時間?

+0

林有點困惑,你可以在Excel中加載,我認爲最大行限制爲1,048,576,使您的數據集幾個百萬到大。 –

+0

@GaryEvans 5468389 - 1048576 = 4419813 –

+0

您是否確定它的行數超過400萬? –

回答

1

似乎您的第一個問題是將數據導入到Excel中。瞭解Excel可能不是處理如此大量數據的最佳程序(如Access等DB程序可能更好),則需要將數據拆分爲多個列或工作表;或採取數據的樣本。

您選擇採樣,所以我會在讀取數據時進行採樣和測試。

您還必須在處理包含毫秒的日期/時間戳中處理Excel/VBA限制。

但是爲了測試數據的目的,不需要關心毫秒。只要您的數據按照升序排列,那麼第一行的日期/時間戳就會等於或大於30分鐘的增量,這是最早的一次。

下面的代碼應該只讀取符合該條件的大文件的行。請閱讀有關額外信息的評論。

這些行被收集到一個集合;然後聲明,填充結果數組,並將結果寫入工作表。

如果每一行由多個字段組成,而不僅僅是顯示的單行,那麼在編寫結果時,您會聲明結果數組來保存所有列,並在此時填充它。

使用集合/數組/寫入工作表序列將比在處理工作表時逐行寫入每行一個更快。

有一些方法可以加速代碼,還有一些方法來處理可能的「內存不足」錯誤,但這取決於您的真實數據以及這些簡單代碼的結果。目前,我們需要將日期/時間標記轉換爲「真實」日期/時間,這取決於您想要對後續數據執行什麼操作。

==========================================

Option Explicit 
'Set Reference to Microsoft Scripting Runtime 
Sub GetBigData() 
    Dim FSO As FileSystemObject 
    Dim TS As TextStream 
    Dim vFileName As Variant 
    Dim sLine As String 
    Dim dtLineTime As Date 
    Dim dtNextTime As Date 
    Dim colLines As Collection 

vFileName = Application.GetOpenFilename("Text Files(*.txt), *.txt") 
If vFileName = False Then Exit Sub 

Set FSO = New FileSystemObject 
Set TS = FSO.OpenTextFile(vFileName, ForReading, False, TristateFalse) 
Set colLines = New Collection 

With TS 
    'Assumes date/time stamps are contiguous 
    'skip any header lines 
    Do 
     sLine = .ReadLine 
    Loop Until InStr(sLine, ".") > 0 

'Compute first "NextTime" 
' note that it might be the first entry 
' comment line 3 below if want first entry 
' but would need to add logic if using other time increments 
dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1)) 
dtNextTime = Int(dtLineTime) + TimeSerial(Hour(dtLineTime), Int(Minute(dtLineTime)/30) * 30, 0) 
If Not (Minute(dtLineTime) = 30 Or Minute(dtLineTime) = 60) Then dtNextTime = dtNextTime + TimeSerial(0, 30, 0) 

Do 
    'Due to IEEE rounding problems, need to test equality as a very small value 
    'Could use a value less than 1 second = 1/86400 or smaller 
    If Abs(dtLineTime - dtNextTime) < 0.00000001 Or _ 
     dtLineTime > dtNextTime Then 
      colLines.Add sLine 
      dtNextTime = dtNextTime + TimeSerial(0, 30, 0) 
    End If 
    If Not .AtEndOfStream Then 
     sLine = .ReadLine 
     dtLineTime = CDate(Left(sLine, InStr(sLine, ".") - 1)) 
    End If 
Loop Until .AtEndOfStream 

.Close 
End With 

'Write the collection to the worksheet 
Dim V As Variant 
Dim wsResults As Worksheet, rResults As Range 
Dim I As Long 

Set wsResults = Worksheets("sheet1") 
Set rResults = wsResults.Cells(1, 1) 

ReDim V(1 To colLines.Count, 1 To 1) 
Set rResults = rResults.Resize(UBound(V, 1), UBound(V, 2)) 

For I = 1 To UBound(V, 1) 
    V(I, 1) = CStr(colLines(I)) 
Next I 

With rResults 
    .EntireColumn.Clear 
    .NumberFormat = "@" 
    .Value = V 
    .EntireColumn.AutoFit 
End With 

End Sub 

==========================================

編輯添加了時間戳轉換功能。 這可以在數據從集合對象複製到變量數組的位置實現。 EG:

V(I, 1) = ConvertTimeStamp(colLines(I)) 

由於接收的值爲雙數據類型,你需要還適當格式的工作表上的列,而不是有它作爲文本:

.NumberFormat = "yyyy-mm-dd hh:mm:ss.000" 

我們必須由於VBA日期類型數據不支持毫秒,因此將該值作爲Double返回。

==============================

Private Function ConvertTimeStamp(sTmStmp As String) As Double 
    Dim dtPart As Date 
    Dim dMS As Double 'milliseconds 
    Dim V As Variant 

'Convert the date and time 
V = Split(sTmStmp, ".") 
dtPart = CDate(V(0)) 
dMS = V(1) 

ConvertTimeStamp = dtPart + dMS/86400/1000 

End Function 

========= =====================

0

如果反轉排序順序,則可以使用Match函數查找列表中剛好大於(緊接)特定時間的條目的索引。 喜歡的東西:

= MATCH(HalfHourValue,RangeContainingTimes,-1)

你必須顛倒順序;它給你的指數而不是實際價值。

爲了讓你剛發現的項的值的毫秒,像下面應該工作:

= RIGHT(TEXT(INDEX(RangeContainingTimes,IxFromAbove,1),「HH:MM:SS。 000「),3)