2016-05-23 37 views
1

我是熊貓新手,我認爲給它一個旋轉是一個好主意,但往往第一次看起來並不那麼容易。熊貓在沒有標題的txt文件中讀取

我基本上嘗試了以下內容: Pandas read in table without headers

而且我得到以下錯誤(不幸的是Usecols不匹配,因爲我設置標題爲無不響鐘名):

ValueError: Usecols do not match names.

這是我的代碼:

import numpy as np 

DATA_FOLDER = 'season_1/training_data/' 

#data = np.loadtxt(DATA_FOLDER + 'order_data/order_data_sample', 
#     dtype={'names': ('order_id', 'driver_id', 'passenger_id', 'start_district_hash', 
#         'dest_distric_hash', 'price', 'time'), 
#...      'formats': ('S32', 'S32', 'S32', 'S32', 'S32', 'f6', 'f4')}) 

import pandas as pd 
df = pd.read_csv(DATA_FOLDER + 'order_data/order_data_sample', parse_dates=[6], header=None, usecols=[3, 4, 6]) 
df 

和我的數據:

97ebd0c6680f7c0535dbfdead6e51b4b dd65fa250fca2833a3a8c16d2cf0457c ed180d7daf639d936f1aeae4f7fb482f 4725c39a5e5f4c188d382da3910b3f3f 3e12208dd0be281c92a6ab57d9a6fb32 24 2016-01-01 13:37:23 
92c3ac9251cc9b5aab90b114a1e363be c077e0297639edcb1df6189e8cda2c3d 191a180f0a262aff3267775c4fac8972 82cc4851f9e4faa4e54309f8bb73fd7c b05379ac3f9b7d99370d443cfd5dcc28 2 2016-01-01 09:47:54 
abeefc3e2aec952468e2fd42a1649640 86dbc1b68de435957c61b5a523854b69 7029e813bb3de8cc73a8615e2785070c fff4e8465d1e12621bc361276b6217cf fff4e8465d1e12621bc361276b6217cf 9 2016-01-01 18:24:02 
cb31d0be64cda3cc66b46617bf49a05c 4fadfa6eeaa694742de036dddf02b0c4 21dc133ac68e4c07803d1c2f48988a83 4b7f6f4e2bf237b6cc58f57142bea5c0 4b7f6f4e2bf237b6cc58f57142bea5c0 11 2016-01-01 22:13:27 
139d492189ae5a933122c098f63252b3 NULL 26963cc76da2d8450d8f23fc357db987 fc34648599753c9e74ab238e9a4a07ad 87285a66236346350541b8815c5fae94 4 2016-01-01 17:00:06 

我希望我已經使用了正確的標籤這個...

+0

您的分隔符不是逗號(由read_csv爲假設),但空白。使用sep關鍵字參數,或者閱讀read_table文檔。 – mdurant

+0

@mdurant謝謝,這很有幫助!我不知道 – mark

回答

4

您可以使用read_csv並添加參數names用於設置新的列名。然後,你必須設置parse_dates=['c']

import pandas as pd 
import io 

temp=u"""97ebd0c6680f7c0535dbfdead6e51b4b dd65fa250fca2833a3a8c16d2cf0457c ed180d7daf639d936f1aeae4f7fb482f 4725c39a5e5f4c188d382da3910b3f3f 3e12208dd0be281c92a6ab57d9a6fb32 24 2016-01-01 13:37:23 
92c3ac9251cc9b5aab90b114a1e363be c077e0297639edcb1df6189e8cda2c3d 191a180f0a262aff3267775c4fac8972 82cc4851f9e4faa4e54309f8bb73fd7c b05379ac3f9b7d99370d443cfd5dcc28 2 2016-01-01 09:47:54 
abeefc3e2aec952468e2fd42a1649640 86dbc1b68de435957c61b5a523854b69 7029e813bb3de8cc73a8615e2785070c fff4e8465d1e12621bc361276b6217cf fff4e8465d1e12621bc361276b6217cf 9 2016-01-01 18:24:02 
cb31d0be64cda3cc66b46617bf49a05c 4fadfa6eeaa694742de036dddf02b0c4 21dc133ac68e4c07803d1c2f48988a83 4b7f6f4e2bf237b6cc58f57142bea5c0 4b7f6f4e2bf237b6cc58f57142bea5c0 11 2016-01-01 22:13:27 
139d492189ae5a933122c098f63252b3 NULL 26963cc76da2d8450d8f23fc357db987 fc34648599753c9e74ab238e9a4a07ad 87285a66236346350541b8815c5fae94 4 2016-01-01 17:00:06""" 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), 
       sep="\s+", #or delim_whitespace=True, #separator is whitespace 
       header=None, #no header 
       usecols=[3, 4, 6], #parse only 3,4,6 columns 
       names=['a','b','c'], #set columns names 
       parse_dates=['c']) #parse datetime 


print (df) 
            a         b \ 
0 4725c39a5e5f4c188d382da3910b3f3f 3e12208dd0be281c92a6ab57d9a6fb32 
1 82cc4851f9e4faa4e54309f8bb73fd7c b05379ac3f9b7d99370d443cfd5dcc28 
2 fff4e8465d1e12621bc361276b6217cf fff4e8465d1e12621bc361276b6217cf 
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4b7f6f4e2bf237b6cc58f57142bea5c0 
4 fc34648599753c9e74ab238e9a4a07ad 87285a66236346350541b8815c5fae94 

      c 
0 2016-01-01 
1 2016-01-01 
2 2016-01-01 
3 2016-01-01 
4 2016-01-01 

print (df.dtypes) 
a   object 
b   object 
c datetime64[ns] 
dtype: object 

如果需要time太,加列dparse_dates=[['c', 'd']]

#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), 
       delim_whitespace=True, 
       usecols=[3, 4, 6, 7], 
       names=['a','b','c','d'], 
       parse_dates=[['c', 'd']]) 


print (df) 
        c_d         a \ 
0 2016-01-01 13:37:23 4725c39a5e5f4c188d382da3910b3f3f 
1 2016-01-01 09:47:54 82cc4851f9e4faa4e54309f8bb73fd7c 
2 2016-01-01 18:24:02 fff4e8465d1e12621bc361276b6217cf 
3 2016-01-01 22:13:27 4b7f6f4e2bf237b6cc58f57142bea5c0 
4 2016-01-01 17:00:06 fc34648599753c9e74ab238e9a4a07ad 

            b 
0 3e12208dd0be281c92a6ab57d9a6fb32 
1 b05379ac3f9b7d99370d443cfd5dcc28 
2 fff4e8465d1e12621bc361276b6217cf 
3 4b7f6f4e2bf237b6cc58f57142bea5c0 
4 87285a66236346350541b8815c5fae94 

print (df.dtypes) 
c_d datetime64[ns] 
a    object 
b    object 
dtype: object 
+0

之前的read_table感謝您幫助我!熊貓的速度也很快,所以它的效果很好。我基本上用你的解決方案。我只改變了sep ='\ t'。 – mark

+0

超級,如果分隔符是'tab',則使用'\ t'。很高興可以幫助你!祝你好運! – jezrael