2014-11-05 91 views
-1

使用seek和tell功能無法正常工作,因爲tell以字節爲單位返回當前位置;我需要獲取行號而不是文件指針的位置。從python中的文件中讀取特定的元組

我有一個文件glass.csv,我需要聚集數據集。文件中的每一行包含了一些1,2,3...像下面:

65,1.52172,13.48,3.74,0.90,72.01,0.18,9.61,0.00,0.07,1 
66,1.52099,13.69,3.59,1.12,71.96,0.09,9.40,0.00,0.00,1 
67,1.52152,13.05,3.65,0.87,72.22,0.19,9.85,0.00,0.17,1 
68,1.52152,13.05,3.65,0.87,72.32,0.19,9.85,0.00,0.17,1 
69,1.52152,13.12,3.58,0.90,72.20,0.23,9.82,0.00,0.16,1 
70,1.52300,13.31,3.58,0.82,71.99,0.12,10.17,0.00,0.03,1 
71,1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12,2 
72,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0.00,0.32,2 
73,1.51593,13.09,3.59,1.52,73.10,0.67,7.83,0.00,0.00,2 
74,1.51631,13.34,3.57,1.57,72.87,0.61,7.89,0.00,0.00,2 
142,1.51851,13.20,3.63,1.07,72.83,0.57,8.41,0.09,0.17,2 
143,1.51662,12.85,3.51,1.44,73.01,0.68,8.23,0.06,0.25,2 
144,1.51709,13.00,3.47,1.79,72.72,0.66,8.18,0.00,0.00,2 
145,1.51660,12.99,3.18,1.23,72.97,0.58,8.81,0.00,0.24,2 
146,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.00,0.35,2 
147,1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00,3 
148,1.51610,13.33,3.53,1.34,72.67,0.56,8.33,0.00,0.00,3 
149,1.51670,13.24,3.57,1.38,72.70,0.56,8.44,0.00,0.10,3 
150,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.00,0.00,3 

我需要從具有1作爲最後一個數字的元組需要一定的投入,並將其保存在另一個文件中,(train.txt),並在剩餘另一個文件,(test.txt)。同樣,我需要從2作爲最後一個號碼,並追加到第一個文件,即train.txt和其餘test.txt

我不能得到第二個輸入,而是追加第一個結果本身。

+0

上面完全沒有要求將每個文件的70%放入一個文件中,並將30%放入另一個文件中。此外,它是否必須是第一個70%,在這種情況下,您需要首先對它們進行計數,或者每個10箇中的前7個足夠接近? – 2014-11-06 20:19:48

+0

請通過這個鏈接 - > archive.ics.uci.edu/ml/machine-learning-databases/glass/......這是我的數據集,正是這個,我提到了70-30分裂。每個元組以1或2.etc結尾..我需要將第70個存儲到train.txt,剩餘30個存放到test.txt中。此後,隨後檢索2,3個元組作爲最後一個值相同的70-30基礎需要被附加到上述文件..希望這使得我的問題具體 – Devi 2014-11-07 08:27:11

回答

0

讀取文本文件的默認行爲是逐行的。你可以做這樣的事情:

with open('input.csv', 'r') as f, open('output_1.csv') as output_1, open('output_2.csv') as output_2: 
    for line in f: 
     line_fields = line.strip().split()[','] 
     if line_fields[-1] == '1': 
      output_1.write(line) 
      continue 
     if line_fields[-1] == '2': 
      output_2.write(line) 

或者你也可以使用CSV模塊,它更容易https://docs.python.org/2/library/csv.html

+1

因爲它是一個文本字段,你真的應該比較最後一個字段與'1'等,或先將其轉換爲int。 – 2014-11-05 12:39:44

+0

編輯,謝謝史蒂夫。 – Emam 2014-11-05 12:45:25

+0

通過這個鏈接 - > http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data...我會lyk分割每個數據集有1,2,3, 5,6,7作爲最後兩個值,即train.txt文件中的前70%,test.txt中剩餘的30%。一旦完成了劃分和寫入1,接下來對具有2的元組應用相同的過程作爲最後一個值..這次的值應該是「附加」到上述文件..類似3,5,6,7.Can你可以修改上述代碼以這種方式工作 – Devi 2014-11-06 10:27:30

0

最簡單的方式,假設你有一個大的文件,並不能簡單地加載整個文件將每個使用1個文件進行排序。如果它是一個小的(ish)輸入文件,那麼只需使用csv模塊作爲逗號分隔文件加載。

作爲一個快速和骯髒的方法,(假設小文件)。

data = [] 
with open('glass.csv', 'r') as infile: 
    for line in infile: 
     linedata = [float(val) for val in line.strip().split(',')] 
     data.append(linedata) 

adata = sorted(data, key=lambda items: items[-1]) 
## Then open both your output files and write them in the required fields. 
+0

,但我需要的值被存儲在兩個文件中,即train.txt和test.txt中。最後得到1的數據集被採用並且按照比率70:30進行劃分,即對於訓練爲70,對於測試爲30。類似地,以元組結尾2,3,5,6,7但他們的部門將被追加到上述文件並且不會被覆蓋。提前感謝 – Devi 2014-11-06 13:15:44

+0

這是我的輸入數據集鏈接> - > archive.ics.uci.edu/m L /機器學習的數據庫/玻璃/ ... – Devi 2014-11-06 13:17:26