2017-10-09 62 views
-2

我有一個例子來解析具有類似的格式的文件:
數據例子(。數據):試圖解析.dat文件,並存儲到2-d陣列中熊貓

+ Naoki Abe 
- Myriam Abramson 
+ David W. Aha 
+ Kamal M. Ali 
- Eric Allender 

這裏是用於商店蟒例如代碼成一個二維數組:

df = pd.read_csv(
    filepath_or_buffer='path/to/.data/file', 
    header=None, 
    sep=',') 

# separate names from classes 
vals = df.loc[:,:].values 
names = [n[0][2:] for n in vals] 
cls = [n[0][0] for n in vals] 

從我的理解,這蟒代碼裝置的數據將是可變的df並提取與每個人在vals變量相關聯的字符串數據。然後,它將vals的字符串拆分爲namesclsnamescls列表應該包含這些組件,以便第i個人的姓名將在names[i]及其關聯的類別cls[i]中。

然而,當我想用​​類似的方法來分析另一個類似的數據集(.DAT),

-1 this is comment1 blah blah blah (it is a big paragraph) 
-1 this is comment2 blah blah blah (it is a big paragraph) 
-1 this is comment3 blah blah blah (it is a big paragraph) 

因此,我修改例子如:

# read in the dataset 
df = pd.read_csv(
    engine='python', 
    filepath_or_buffer='data/Pro1/train.dat', 
    header=None, 
    sep='\t+') 

# separate names from classes 
vals = df.loc[:,:].values 
comm = [n[0][2:] for n in vals] 
rates = [n[:1][0] for n in vals] 

我有錯誤消息:TypeError: 'long' object has no attribute '__getitem__'comm = [n[0][2:] for n in vals]
我搜索了錯誤消息,它解釋說,這意味着我試圖存儲一個int到字符串(?)。我試圖存儲整個評論段落,它是一個字符串。在這個例子中,它存儲了一個名字的字符串。 另一個問題我是因爲我不得不解析.dat文件,我猜測它是TAB背後-1而不是空間,我不知道,如果陣列的我設定的範圍是正確的**

我的經驗:我不是python的專家,因爲你可能已經想通了,我可以確定地閱讀代碼,但是在編寫代碼時一定要做一些研究。 Python是我現在做這種數據分析的唯一選擇。

回答

0

第一個文件中沒有逗號分隔符,因此文件中的每一行都會生成一個單獨的字符串,例如'+ Naoki Abe'。因此,您可以使用字符串切片將名稱與其餘字符串分開。

>>> import pandas as pd 
>>> df = pd.read_csv('temp.csv', header=None, sep=',') 
>>> vals = df.loc[:,:].values 
>>> vals 
array([['+ Naoki Abe'], 
     ['- Myriam Abramson'], 
     ['+ David W. Aha'], 
     ['+ Kamal M. Ali'], 
     ['- Eric Allender']], dtype=object) 
>>> names = [n[0][2:] for n in vals] 
>>> names 
['Naoki Abe', 'Myriam Abramson', 'David W. Aha', 'Kamal M. Ali', 'Eric Allender'] 
>>> cls = [n[0][0] for n in vals] 
>>> cls 
['+', '-', '+', '+', '-'] 

我也懷疑是否有一個製表符,將-1與每行的其餘部分分開。結果是熊貓在標籤處分割每一行。在這種情況下,只要將該選項卡聲明爲分隔符,就不能使用字符串切片。

>>> df2 = pd.read_csv('temp2.csv', engine='python', header=None, sep='\t') 
>>> vals2 = df2.loc[:,:].values 
>>> vals2 
array([[-1, 'this is comment1 blah blah blah (it is a big paragraph)'], 
     [-1, 'this is comment2 blah blah blah (it is a big paragraph)'], 
     [-1, 'this is comment3 blah blah blah (it is a big paragraph)']], dtype=object) 
>>> first = [val[0] for val in vals2] 
>>> first 
[-1, -1, -1] 
>>> second = [val[1] for val in vals2] 
>>> second 
['this is comment1 blah blah blah (it is a big paragraph)', 'this is comment2 blah blah blah (it is a big paragraph)', 'this is comment3 blah blah blah (it is a big paragraph)'] 

但是不要絕望!

有一種方法可以用類似的方式處理兩個數據文件。

使用sep='\s+'以便標籤和空格的處理方式相似。然後熊貓會將每一行變成一串字符串。你現在需要做的就是挑選第一個項目並重新組裝其他項目。

>>> df3 = pd.read_csv('temp2.csv', engine='python', header=None, sep='\s+') 
>>> vals3 = df3.loc[:,:].values 
>>> vals3 
array([[-1, 'this', 'is', 'comment1', 'blah', 'blah', 'blah', '(it', 'is', 
     'a', 'big', 'paragraph)'], 
     [-1, 'this', 'is', 'comment2', 'blah', 'blah', 'blah', '(it', 'is', 
     'a', 'big', 'paragraph)'], 
     [-1, 'this', 'is', 'comment3', 'blah', 'blah', 'blah', '(it', 'is', 
     'a', 'big', 'paragraph)']], dtype=object) 
>>> first = [val[0] for val in vals3] 
>>> first 
[-1, -1, -1] 
>>> second = [' '.join(val[1:]) for val in vals3] 
>>> second 
['this is comment1 blah blah blah (it is a big paragraph)', 'this is comment2 blah blah blah (it is a big paragraph)', 'this is comment3 blah blah blah (it is a big paragraph)'] 

我最後的評論:我質疑你在csv模塊上使用熊貓。