2017-08-23 152 views
0

我在下面的表格數據: enter image description here組織數據(熊貓據幀)

  product/productId           B000EVS4TY 
1   product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1... 
2   product/price           unknown 
3   review/userId          A2SRVDDDOQ8QJL 
4  review/profileName           MJ23447 
5  review/helpfulness            2/4 
6    review/score            4.0 
7    review/time           1206576000 
8   review/summary        Delicious cookie mix 
9    review/text I thought it was funny that I bought this pro... 
10  product/productId           B0000DF3IX 
11   product/title       Paprika Hungarian Sweet 
12   product/price           unknown 
13   review/userId          A244MHL2UN2EYL 
14  review/profileName       P. J. Whiting "book cook" 
15  review/helpfulness            0/0 
16   review/score            5.0 
17    review/time           1127088000 

我想將它轉換成數據幀,使得在第一列

 product/productId           
     product/title 
     product/price            
     review/userId          
    review/profileName            
    review/helpfulness             
     review/score                
     review/time           
     review/summary        
      review/text 

項是列標題,其值與表中每個標題對應排列。

+2

我想你需要轉置,df.T – Vaishali

+0

我不明白你提供的示例行是否以任何文件格式存儲?它是否有任何列分隔符? – Pedro

+0

數據以(.txt)格式堆疊(連續) –

回答

0

我對你的文件還有一點懷疑,但是因爲我的建議都很相似,所以我會盡量解決你可能遇到的兩種情況。

如果你的文件實際上並沒有在其內部的行號,這應該這樣做:

filepath = "./untitled.txt" # you need to change this to your file path 
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below... 

# engine='python' surpresses a warning by pandas 
# header=None is that so all lines are considered 'data' 
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None) 

df = df.set_index(0)   # this takes column '0' and uses it as the dataframe index 
df = df.T      # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row) 
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1' 

# you could just do the last 3 lines with: 
# df = df.set_index(0).T.reset_index(drop=True) 

如果你有行號,那麼我們就需要做一些小的調整

filepath = "./untitled1.txt" 
column_separator="\s{3,}" 

df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0) 
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity 
  • 在最後這種情況下,我勸你改變它,以便在所有的人都行號(在本例中,你提供的,NU mbering開始於第二行,這可能是你如何處理頭導出數據時,在任何工具,你可能會使用

  • 關於正則表達式選項,需要注意的是,「\ S {3} 「查找任何3個連續空格或更多的塊來確定列分隔符。這裏的問題是我們將依賴數據來查找列。例如,如果任何一個值恰好出現3個連續的空格,熊貓將引發一個異常,因爲該行將有多於一列的列。解決這個問題的辦法可能是將其增加到任何其他「適當」的數字,但是我們仍然依賴這些數據(例如,在你的例子中,超過3個,「評論/文本」將有足夠的空間用於兩列實現你所說的「堆積」

    無論「行號方案」你,你需要確保你總是有相同數量的意思後,待鑑定)

編輯所有寄存器列並重塑連續數據幀類似於此:

number_of_columns = 10    # you'll need to make sure all "registers" do have the same number of columns otherwise this will break 
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns 
final_df = pd.DataFrame(data = df.values.reshape(new_shape) 
        ,columns=df.columns.tolist()[:-10]) 

再次注意確保所有行的列數相同(例如,假設有10列,只提供您提供的數據的文件將不起作用)。此外,此解決方案假定所有列都具有相同的名稱。