2012-12-10 73 views
0

我正在處理超過6MM行的股票代碼數據。我想抓住符號的所有數據,執行我需要的處理,並輸出結果。如何知道文件指針的位置,以便我可以識別文件中的起始位置

我已經編寫了代碼,告訴我每個代碼的起始行(參見下面的代碼)。我認爲如果我知道一個新符號在什麼位置開始(而不是行號)會更有效率,所以我可以使用seek(#)輕鬆跳轉到股票代碼的起始位置。我也很好奇如何擴展這個邏輯來讀取股票代碼的整個數據塊(start_position到end_position)。

import csv 
data_line  = 0 # holds the file line number for the symbol 
ticker_start  = 0 
ticker_end   = 0 
cur_sec_ticker = "" 
ticker_dl = [] # array for holding the line number in the source file for the start of each ticker 
reader = csv.reader(open('C:\\temp\sample_data.csv', 'rb'), delimiter=',') 
for row in reader: 
    if cur_sec_ticker != row[1]: # only process a new ticker 
     ticker_fr = str(data_line) + ',' + row[1] # prep line for inserting into array 

     # desired line for inserting into array, ticker_end would be the last 
     # of the current ticker data block, which is the start of the next ticker 
     # block (ticker_start - 1) 
     #ticker_fr = str(ticker_start) + str(ticker_end) + str(data_line) + ',' + row[1] 

     print ticker_fr 
     ticker_dl.append(ticker_fr) 
     cur_sec_ticker = row[1] 
    data_line += 1 
print ticker_dl 

下面我放在如何將數據文件的小樣本:

seq,Symbol,Date,Open,High,Low,Close,Volume,MA200Close,MA50Close,PrimaryLast,filter_$ 
1,A,1/1/2008,36.74,36.74,36.74,36.74,0, , ,1,1 
2,A,1/2/2008,36.67,36.8,36.12,36.3,1858900, , ,1,1 
3,A,1/3/2008,36.3,36.35,35.87,35.94,1980100, , ,1,1 
1003,AA,1/1/2008,36.55,36.55,36.55,36.55,0, , ,1,1 
1004,AA,1/2/2008,36.46,36.78,36,36.13,7801600, , ,1,1 
1005,AA,1/3/2008,36.18,36.67,35.74,36.19,7169000, , ,1,1 
2005,AAN,4/20/2009,20,20.7,18.2067,18.68,808700, , ,1,1 
2006,AAN,4/21/2009,18.7,19.06,18.6533,18.9933,530200, , ,1,1 
2007,AAN,4/22/2009,19.2867,19.6267,18.54,19.1333,801100, , ,1,1 
2668,AAP,1/1/2008,37.99,37.99,37.99,37.99,0, , ,1,1 
2669,AAP,1/2/2008,37.99,38.15,37.17,37.59,1789200, , ,1,1 
2670,AAP,1/3/2008,37.58,38.16,37.35,37.95,1584700, , ,1,1 
3670,AAR,1/1/2008,22.94,22.94,22.94,22.94,0, , ,1,1 
3671,AAR,1/2/2008,23.1,23.38,22.86,23.15,17100, , ,1,1 
3672,AAR,1/3/2008,23,23,22,22.16,45600, , ,1,1 
6886,ABB,1/1/2008,28.8,28.8,28.8,28.8,0, , ,1,1 
6887,ABB,1/2/2008,29,29.11,28.23,28.64,4697700, , ,1,1 
6888,ABB,1/3/2008,27.92,28.35,27.79,28.08,5240100, , ,1,1 
+0

seek()的反義詞是tell() – SpacedMonkey

回答

1

在一般情況下,你可以得到與tell方法的文件對象的當前位置。但是,可能很難使用當前代碼將文件讀取到csv模塊。在逐行讀取時很難做到這一點,因爲底層文件對象可能會以比單行更大的塊讀取(readlinereadlines方法會在後臺執行一些緩存以將其隱藏)。

雖然我忽略了閱讀特定字節的全部想法,但如果它對您的程序來說真的值得,那麼您可能需要負責閱讀您自己的文件,以便您可以確切地瞭解您的位置該文件在任何時候。 tell可能沒有必要。

像這樣的東西可能會奏效讀取數據塊,然後將其拆分成線和價值觀,同時跟蹤的多少字節迄今已閱讀:

def generate_values(f): 
    buf = "" # a buffer of data read from the file 
    pos = 0 # the position of our buffer within the file 

    while True: # loop until we return at the end of the file 
     new_data = f.read(4096) # read up to 4k bytes at a time 

     if not new_data: # quit if we got nothing 
      if buf: 
       yield pos, buf.split(",") # handle any data after last newline 
      return 

     buf += new_data 
     line_start = 0 # index into buf 

     try: 
      while True: # loop until an exception is raised at end of buf 
       line_end = buf.index("\n", line_start) # find end of line 
       line = buf[line_start:line_end] # excludes the newline 

       if line: # skips blank lines 
        yield pos+line_start, line.split(",") # yield pos,data tuple 

       line_start = line_end+1 
     except ValueError: # raised by `index()` 
      pass 

     pos += line_end + 1 
     buf = buf[line_end + 1:] # keep left over data from end of the buffer 

這可能需要一些調整,如果你的文件有\n以外的行結束,但它不應該太難。

+0

謝謝,我明白你已採取邏輯的位置,並理解其原因。鑑於我的代碼,如果我累積每行的長度,我是否可以做類似的事情?我仍然能夠分析我所在的股票,並且至少可以獲取每個股票的起始文件位置。 –

+0

@ Dr.EMG:理論上你可能能夠重建一條線的長度,但實際上可能很難做到這一點,因爲你無法控制線條閱讀,價值分割或其他細節,並且幾個字節可能會在這裏或那裏錯位,而沒有你注意的機會。如果你想堅持'csv'模塊,我建議你避免處理文件的位置,並簡單地使用它們讀取的行。 – Blckknght

+0

由於csv解析,實現我的建議不起作用(正如你在書中所暗示的那樣)。我可以合併這兩個過程:使用readline()獲取長度並累積它,然後使用csv分析器解析該行以識別我所處的行情。我可以放棄捕獲結束,但也可以通過排隊ticker_dl數組的當前和下一個索引來知道下一個符號開始時的結束位置。 –

相關問題