2016-08-05 62 views
0

我的數據集看起來像下面:填寫缺少的字段使用導入CSV

W000000457,, 
,9/18/2016 11:28,37 
,4/21/2016 0:07,54 
,11/5/2016 12:05,42 
,7/14/2016 15:43,54 
W000000457 - Count,,100 
2069320,, 
,12/10/2016 0:22,12 
,9/25/2016 14:07,28 
,1/24/2016 6:54,59 
2069320 - Count,,100 
111,, 
,1/16/2016 10:25,58 
,6/11/2016 4:17,43 
,4/21/2016 7:56,47 
,3/17/2016 3:48,20 
111 - Count,,100 

的列是ID,日期,金額。我做了2個主要清潔/按摩的數據。

1)第1行中使用的ID,我填寫以下行 2)行中刪除與「伯爵」的所有行[0]

我的目標是得到這樣的:

W000000457,9/18/2016 11:28,37 
W000000457,4/21/2016 0:07,54 
W000000457,11/5/2016 12:05,42 
W000000457,7/14/2016 15:43,54 
2069320,12/10/2016 0:22,12 
2069320,9/25/2016 14:07,28 
2069320,1/24/2016 6:54,59 
111,1/16/2016 10:25,58 
111,6/11/2016 4:17,43 
111,4/21/2016 7:56,47 
111,3/17/2016 3:48,20 

這是我的代碼至今:

import csv 
    with open('data.txt','rb') as f_in: 
     reader = csv.reader(f_in) 
     row = next(reader) 
     last_row = row 
     for row in reader: 
      row = [x if x else y for x, y in zip(row, last_row)] 
      if 'COUNT' not in row[0].upper(): 
       print row 
      last_row = row 

這讓我接近,但問題是處理其間的不同ID的例子中的記錄:

W000000457,, 
,1/24/2016 6:54,59 
2069320 - Count,,100 
111,, 
,1/16/2016 10:25,58 

將成爲(使用我的代碼):

W000000457,1/24/2016 6:54,59 
111,1/24/2016 6:54,100 
111,1/16/2016 10:25,58 

ID 111的第一個實例是不是從以前的現有價值進行一個真正的價值。

或者在上面的例子中,我得到:在**都是假值

,我應該如何處理這個任何想法

W000000457,9/18/2016 11:28,37 
W000000457,4/21/2016 0:07,54 
W000000457,11/5/2016 12:05,42 
W000000457,7/14/2016 15:43,54 
**2069320,7/14/2016 15:43,100** 
2069320,12/10/2016 0:22,12 
2069320,9/25/2016 14:07,28 
2069320,1/24/2016 6:54,59 
**111,1/24/2016 6:54,100** 
111,1/16/2016 10:25,58 
111,6/11/2016 4:17,43 
111,4/21/2016 7:56,47 
111,3/17/2016 3:48,20 

領域?

我正在考慮刪除每個ID的第一個實例或尋找一種方法來替換我的csvreader的[0]而不是每個字段。

回答

1

對於csv類型的數據,請使用pandas

讀取數據:

import pandas as pd 
from io import StringIO 

df = pd.read_csv(StringIO('''W000000457,, 
,9/18/2016 11:28,37 
,4/21/2016 0:07,54 
,11/5/2016 12:05,42 
,7/14/2016 15:43,54 
W000000457 - Count,,100 
2069320,, 
,12/10/2016 0:22,12 
,9/25/2016 14:07,28 
,1/24/2016 6:54,59 
2069320 - Count,,100 
111,, 
,1/16/2016 10:25,58 
,6/11/2016 4:17,43 
,4/21/2016 7:56,47 
,3/17/2016 3:48,20 
111 - Count,,100'''), names=['col1', 'col2', 'col3']) 

正向填寫NaN的項目在第一列:

df['col1'] = df['col1'].fillna(method='ffill') 

篩選出來的項目,其中第一列包含 '計數'

df = df[~df['col1'].str.contains('Count')] 

降仍然有NaN的行:

df = df.dropna() 

最終結果:

  col1    col2 col3 
1 W000000457 9/18/2016 11:28 37.0 
2 W000000457 4/21/2016 0:07 54.0 
3 W000000457 11/5/2016 12:05 42.0 
4 W000000457 7/14/2016 15:43 54.0 
7  2069320 12/10/2016 0:22 12.0 
8  2069320 9/25/2016 14:07 28.0 
9  2069320 1/24/2016 6:54 59.0 
12   111 1/16/2016 10:25 58.0 
13   111 6/11/2016 4:17 43.0 
14   111 4/21/2016 7:56 47.0 
15   111 3/17/2016 3:48 20.0