Python：通過文件快速迭代

我需要遍歷數百萬次的兩個文件，統計整個文件中單詞對出現的次數。（爲了打造的兩個詞列聯表來計算費希爾精確檢驗得分）Python：通過文件快速迭代

我目前使用

from itertools import izip 
src=tuple(open('src.txt','r')) 
tgt=tuple(open('tgt.txt','r')) 
w1count=0 
w2count=0 
w1='someword' 
w2='anotherword' 
for x,y in izip(src,tgt): 
    if w1 in x: 
     w1count+=1 
    if w2 in y: 
     w2count+=1 
    .....

雖然這是不壞，我想知道是否有任何更快的方法遍歷兩個文件，希望顯着更快。

我很感謝你的幫助。

來源

2013-10-17 ytrewq

您需要提供更多信息。請澄清您的具體問題或添加更多的細節，以確切地突出你所需要的。正如目前所寫，很難確切地說出你在問什麼。 –

@InbarRose我添加了更多信息。請讓我知道如果它仍然不夠:) – ytrewq

那麼，仍然有很多信息丟失。你在這裏顯示的任何代碼中使用的任何變量，你應該顯示聲明，例如：什麼是src，tgt，w1，w2，w1count和w2count？ –

我還是不太明白你想要做什麼，但是這裏有一些示例代碼可能會指出你正確的方向。

我們可以使用一個字典或一個collections.Counter實例來統計所有發生的單詞和對在一次通過通過文件。之後，我們只需要查詢內存中的數據。

import collections 
import itertools 
import re 

def find_words(line): 
    for match in re.finditer("\w+", line): 
     yield match.group().lower() 

counts1 = collections.Counter() 
counts2 = collections.Counter() 
counts_pairs = collections.Counter() 

with open("src.txt") as f1, open("tgt.txt") as f2: 
    for line1, line2 in itertools.izip(f1, f2): 
     words1 = list(find_words(line1)) 
     words2 = list(find_words(line2)) 
     counts1.update(words1) 
     counts2.update(words2) 
     counts_pairs.update(itertools.product(words1, words2)) 

print counts1["someword"] 
print counts1["anotherword"] 
print counts_pairs["someword", "anotherword"]

來源

2013-10-17 11:03:24

謝謝soooo多!!!!!! – ytrewq

對不起，還有一個問題。運行此程序後，如何檢索每個單詞或單詞對的數量？ – ytrewq

btw我不得不改變你的代碼yield str（word）.lower（） – ytrewq

一般來說，如果你的數據是足夠小，以適應到內存中，然後你最好的選擇是：

處理前數據到內存
從內存結構迭代

如果文件很大，您可以預先處理數據結構（如壓縮數據），並保存爲加載速度快得多的格式，例如pickle在單獨的文件中工作，然後處理該文件。

來源

2013-10-17 10:02:47

我的文件是37MB和36MB。它足夠小以適應內存嗎？ – ytrewq

@CosmicRabbitMediaInc：幾乎可以肯定。但我認爲改變你的算法將是正確的方法。 –

@SvenMarnach thanx。有關如何更改算法的任何建議？ – ytrewq

就像開箱即用的思考解決方案：您是否嘗試將文件製作成Pandas數據框？即我假設你已經從輸入中刪除了一個單詞列表（通過刪除諸如。和的閱讀符號）並使用input.split（''）或類似的東西。然後你可以將它製作成DataFrames，執行一個單詞計數然後進行笛卡爾連接？

import pandas as pd 
df_1 = pd.DataFrame(src, columns=['word_1']) 
df_1['count_1'] = 1 
df_1 = df_1.groupby(['word_1']).sum() 
df_1 = df_1.reset_index() 

df_2 = pd.DataFrame(trg, columns=['word_2']) 
df_2['count_2'] = 1 
df_2 = df_2.groupby(['word_2']).sum() 
df_2 = df_2.reset_index() 

df_1['link'] = 1 
df_2['link'] = 1 

result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link') 
del result_df['link']

我用這樣的東西進行購物籃分析，效果很好。

來源

2013-10-17 10:18:40 Carst

Python：通過文件快速迭代

回答

相關問題