如何從數據文件中提取特定行

我有一個問題，但我覺得解決方案應該很簡單。我正在構建一個模型，並希望通過10倍交叉驗證來測試其準確性。爲此，我必須將訓練語料庫90％/ 10％分成訓練和測試部分，然後訓練我的模型90％和測試10％。這個我想做10次，每次做不同的90％/ 10％的分割，這樣最終每個比特的語料庫都被用作測試數據。然後我會平均每個10％測試的結果。如何從數據文件中提取特定行

我試圖編寫一個腳本來提取10％的訓練語料庫並將其寫入一個新文件，但到目前爲止我沒有得到它的工作。我所做的是計算文件中的總行數，然後將這個數字除以10以知道我想提取的十個不同測試集中每一個的大小。

trainFile = open("danish.train") 
numberOfLines = 0 

for line in trainFile: 
    numberOfLines += 1 

lengthTest = numberOfLines/10

對於我自己的培訓文件，我發現它包含3638行，因此每個測試應該大致由363行組成。

如何將第1-363行，第364-726行等寫入不同的測試文件？

來源

2013-02-05 Johanna

那麼哪部分不適合你？我試過你的代碼（用我自己的文本文件），它告訴我每個「測試塊」的大小。您是否需要幫助編寫部分內容：「如何將第1-363行，第364-726行等寫入不同的測試文件？」就這樣？或者在你到達那裏之前還有其他的東西不適合你？ –

一旦你的行數，回到文件的開頭，並開始複製了線danish.train.part-01。當行號是10％測試集大小的倍數時，爲下一部分打開一個新文件。

#!/usr/bin/env python2.7 

trainFile = open("danish.train") 
numberOfLines = 0 

for line in trainFile: 
    numberOfLines += 1 

lengthTest = numberOfLines/10 

# rewind file to beginning 
trainFile.seek(0) 

numberOfLines = 0 
file_number = 0 
for line in trainFile: 
    if numberOfLines % lengthTest == 0: 
     file_number += 1 
     output = open('danish.train.part-%02d' % file_number, 'w') 

    numberOfLines += 1 
    output.write(line)

在此輸入文件（對不起，我不會講丹麥語！）：

one 
two 
three 
four 
five 
six 
seven 
eight 
nine 
ten 
eleven 
twelve 
thirteen 
fourteen 
fifteen 
sixteen 
seventeen 
eighteen 
nineteen 
twenty 
twenty-one 
twenty-two 
twenty-three 
twenty-four 
twenty-five 
twenty-six 
twenty-seven 
twenty-eight 
twenty-nine 
thirty

這將創建文件

danish.train.part-01 
danish.train.part-02 
danish.train.part-03 
danish.train.part-04 
danish.train.part-05 
danish.train.part-06 
danish.train.part-07 
danish.train.part-08 
danish.train.part-09 
danish.train.part-10

和第5部分，例如，包含：

thirteen 
fourteen 
fifteen

來源

2013-02-05 18:50:50 andrewdotn

非常感謝您的幫助！它完美的作品。我也有一個相關的問題。每次我從訓練文件中提取一塊以將其保存爲測試時，我需要創建一個伴隨的新訓練文件，從中刪除該精確塊（以便我可以在新訓練文件上訓練模型，然後使用其他文件來測試模型）。我一直在嘗試編輯你的代碼，但到目前爲止它沒有做我想做的事。我怎樣才能擴展上面的代碼來做到這一點？ – Johanna

未經檢驗的，但這裏的基本思想是：

def getNthSeg(fpath, n, segSize): 
    """Get the nth segment of segSize many lines""" 
    answer = [] 
    with open(fpath) as f: 
     for i,line in enumerate(f): 
      if (segSize-1)*n <= i < segSize*n: 
       answer.append(line) 
    return answer 

def getFolds(fpath, k): 
    """ In your case, k is 10""" 
    with open(fpath) as f: 
     numLines = len(f.readlines()) 
    segSize = numLines/k 
    answer = [] 
    for n in xrange(k): 
     fold = getNthSeg(fpath, n, segSize) 
     answer.append(fold) 
    return answer

來源

2013-02-05 18:50:27 inspectorG4dget

感謝您的幫助！ – Johanna

如果你的文件不是很大，你可以把它分成90/10像這樣：

trainFile = open("danish.train") 
lines = list(trainFile) 
N = len(lines) 
testing = lines[:N/10] 
training = lines[N/10:]

來源

2013-02-05 18:54:04 bogatron

如何從數據文件中提取特定行

回答

相關問題