組織數據（熊貓據幀）

我在下面的表格數據：組織數據（熊貓據幀）

  product/productId           B000EVS4TY 
1   product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1... 
2   product/price           unknown 
3   review/userId          A2SRVDDDOQ8QJL 
4  review/profileName           MJ23447 
5  review/helpfulness            2/4 
6    review/score            4.0 
7    review/time           1206576000 
8   review/summary        Delicious cookie mix 
9    review/text I thought it was funny that I bought this pro... 
10  product/productId           B0000DF3IX 
11   product/title       Paprika Hungarian Sweet 
12   product/price           unknown 
13   review/userId          A244MHL2UN2EYL 
14  review/profileName       P. J. Whiting "book cook" 
15  review/helpfulness            0/0 
16   review/score            5.0 
17    review/time           1127088000

我想將它轉換成數據幀，使得在第一列

 product/productId           
     product/title 
     product/price            
     review/userId          
    review/profileName            
    review/helpfulness             
     review/score                
     review/time           
     review/summary        
      review/text

項是列標題，其值與表中每個標題對應排列。

來源

2017-08-23 Tushar Khandelwal

我想你需要轉置，df.T – Vaishali

我不明白你提供的示例行是否以任何文件格式存儲？它是否有任何列分隔符？ – Pedro

數據以（.txt）格式堆疊（連續） –

我對你的文件還有一點懷疑，但是因爲我的建議都很相似，所以我會盡量解決你可能遇到的兩種情況。

如果你的文件實際上並沒有在其內部的行號，這應該這樣做：

filepath = "./untitled.txt" # you need to change this to your file path 
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below... 

# engine='python' surpresses a warning by pandas 
# header=None is that so all lines are considered 'data' 
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None) 

df = df.set_index(0)   # this takes column '0' and uses it as the dataframe index 
df = df.T      # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row) 
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1' 

# you could just do the last 3 lines with: 
# df = df.set_index(0).T.reset_index(drop=True)

如果你有行號，那麼我們就需要做一些小的調整

filepath = "./untitled1.txt" 
column_separator="\s{3,}" 

df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0) 
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity

在最後這種情況下，我勸你改變它，以便在所有的人都行號（在本例中，你提供的，NU mbering開始於第二行，這可能是你如何處理頭導出數據時，在任何工具，你可能會使用
關於正則表達式選項，需要注意的是，「\ S {3} 「查找任何3個連續空格或更多的塊來確定列分隔符。這裏的問題是我們將依賴數據來查找列。例如，如果任何一個值恰好出現3個連續的空格，熊貓將引發一個異常，因爲該行將有多於一列的列。解決這個問題的辦法可能是將其增加到任何其他「適當」的數字，但是我們仍然依賴這些數據（例如，在你的例子中，超過3個，「評論/文本」將有足夠的空間用於兩列實現你所說的「堆積」

無論「行號方案」你，你需要確保你總是有相同數量的意思後，待鑑定）

編輯所有寄存器列並重塑連續數據幀類似於此：

number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns final_df = pd.DataFrame(data = df.values.reshape(new_shape) ,columns=df.columns.tolist()[:-10])

再次注意確保所有行的列數相同（例如，假設有10列，只提供您提供的數據的文件將不起作用）。此外，此解決方案假定所有列都具有相同的名稱。

來源

2017-08-24 14:27:02 Pedro

組織數據（熊貓據幀）

回答

相關問題