2016-09-17 57 views
1

我是新來的Python和Pandas,並且正在通過UCI玩一個心臟疾病數據集。 https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data重塑大熊貓數據框:每76個入口新行

每個人和303人有76個屬性,所以我想結束每個人作爲一個行和76列。我無法安排到一個數據幀,因爲數據似乎是呈現在行9。

我試過導入數據集到一個熊貓數據框使用空格或換行符作爲分隔符,但我仍無法阻止想要每8個值後,分割數據:

df = pd.read_table('https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data', sep=' ') 

DF 導致表是這樣的:

1254 0 40 1 1.1 0.1 0.2 
-9.0 2 140.0 0.0 289 -9.0 -9.0 -9.0 
0.0 -9 -9.0 0.0 12 16.0 84.0 0.0 
0.0 0 0.0 0.0 150 18.0 -9.0 7.0 
172.0 86 200.0 110.0 140 86.0 0.0 0.0 
0.0 -9 26.0 20.0 -9 -9.0 -9.0 -9.0 

我會很感激的任何建議,您可能對如何將其拆分後創建一個新行第76個價值。每個第76個值都是字符串'name',這表示一個人數據的結尾。謝謝!

+1

這是可行的,但痛苦的數據幀rubikscubing。由於輸入文件不是那麼大,我會處理輸入字符串並替換\ n和名稱以獲得對齊的行以提供read_table – Boud

回答

1

由於@Boud has already said它更容易預先處理你的數據,而不是按摩「錯誤地建有」 DF:

import io 
import requests 
import pandas as pd 

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data' 
r = requests.get(url) 
if r.status_code != requests.codes.ok: 
    r.raise_for_status() 

data = r.text.replace('\n', ' ').replace(' name ', ' name\n') 

df = pd.read_table(io.StringIO(data), sep='\s+', header=None) 
print(df) 

輸出:

In [20]: df 
Out[20]: 
     0 1 2 3 4 5 6 7 8 9 ... 66 67 68 69 70 71 72 73 74 75 
0 1254 0 40 1 1 0 0 -9 2 140 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
1 1255 0 49 0 1 0 0 -9 3 160 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
2 1256 0 37 1 1 0 0 -9 2 130 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
3 1257 0 48 0 1 1 1 -9 4 138 ... 2 -9 1 1 1 1 1 -9.0 -9.0 name 
4 1258 0 54 1 1 0 1 -9 3 150 ... 1 -9 1 1 1 1 1 -9.0 -9.0 name 
5 1259 0 39 1 1 0 1 -9 3 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
6 1260 0 45 0 0 1 0 -9 2 130 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
7 1261 0 54 1 1 0 0 -9 2 110 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
8 1262 0 37 1 1 1 1 -9 4 140 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
9 1263 0 48 0 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
10 1264 0 37 0 1 0 1 -9 3 130 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
11 1265 0 58 1 1 0 0 -9 2 136 ... -9 2 1 1 1 7 1 -9.0 -9.0 name 
12 1266 0 39 1 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
13 1267 0 49 1 1 1 1 -9 4 140 ... 2 -9 1 1 1 1 1 -9.0 -9.0 name 
14 1268 0 42 0 1 0 1 -9 3 115 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
15 1269 0 54 0 1 1 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
16 1270 0 38 1 1 1 1 -9 4 110 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
17 1271 0 43 0 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
18 1272 0 60 1 1 1 1 -9 4 100 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
19 1273 0 36 1 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
20 1274 0 43 0 0 0 0 -9 1 100 ... -9 -9 1 1 1 1 2 -9.0 -9.0 name 
21 1275 0 44 1 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
22 1276 0 49 0 1 0 0 -9 2 124 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
23 1277 0 44 1 1 0 0 -9 2 150 ... 2 -9 1 1 1 1 1 67.0 -9.0 name 
24 1278 0 40 1 1 0 1 -9 3 130 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
.. ... .. .. .. .. .. .. .. .. ... ... .. .. .. .. .. .. .. ... ... ... 
269 1032 0 54 1 1 1 0 -9 4 130 ... -9 2 1 1 1 7 1 66.0 -9.0 name 
270 1033 0 47 0 1 0 0 -9 3 130 ... -9 -9 1 1 1 1 1 68.0 -9.0 name 
271 1034 0 45 1 1 1 1 -9 4 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
272 1035 0 32 0 1 0 0 -9 2 105 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
273 1036 0 55 1 1 1 1 -9 4 140 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
274 1037 0 55 1 1 0 0 -9 3 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
275 1038 0 45 0 0 0 0 -9 2 180 ... -9 -9 1 1 1 1 1 70.0 -9.0 name 
276 1039 0 59 1 1 0 1 -9 3 180 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
277 1041 0 51 1 1 0 0 -9 3 135 ... 2 -9 1 1 3 8 2 -9.0 -9.0 name 
278 1042 0 52 1 1 1 1 -9 4 170 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
279 1043 0 57 0 1 1 1 -9 4 180 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
280 1044 0 54 0 1 0 0 -9 2 130 ... -9 -9 1 1 1 1 3 -9.0 -9.0 name 
281 1045 0 60 1 1 0 0 -9 3 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
282 1046 0 49 1 1 1 1 -9 4 150 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
283 1047 0 51 0 1 0 1 -9 3 130 ... -9 -9 1 1 1 1 1 61.0 -9.0 name 
284 1048 0 55 0 0 0 0 -9 2 110 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
285 1049 0 42 1 1 1 1 -9 4 140 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
286 1050 0 51 0 1 0 1 -9 3 110 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
287 1051 0 59 1 1 1 1 -9 4 140 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
288 1052 0 53 1 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
289 1053 0 48 0 0 0 0 -9 2 -9 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
290 1054 0 36 1 1 0 0 -9 2 120 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
291 5001 0 48 1 0 0 0 -9 3 110 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
292 5000 0 47 0 0 0 0 -9 2 140 ... -9 -9 1 1 1 1 1 -9.0 -9.0 name 
293 5002 0 53 1 1 1 1 -9 4 130 ... 1 1 1 1 1 1 1 -9.0 -9.0 name 

[294 rows x 76 columns]