2013-07-23 75 views
1

我被要求在大文件中找到字符串「And」的出現次數,該文件大小爲10GB,並且有1GB RAM。我將如何有效地做到這一點。我回答說我們需要以每個100MB的內存塊讀取文件,然後在每個內存塊中查找「And」的總髮生次數,並保留字符串「And」的累計值。採訪者對我的回答不滿意,他告訴我unix命令grep是如何工作的。寫一個類似於python的代碼,但我不知道答案。我會很感激這個問題的答案。在大文件中發現不匹配內存的字符串

+2

[這](http://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-使用str.count文件分析 - python中的各種方法)可能會有所幫助。 –

+0

不要忘記檢查邊界,如果你不是按行閱讀 –

回答

4

如果您使用generators您可以訪問一個大文件並進行處理。

簡單grep命令,

def command(f): 
    def g(filenames, **kwa): 
     lines = readfiles(filenames) 
     lines = (outline for line in lines for outline in f(line, **kwa)) 
     # lines = (line for line in lines if line is not None) 
     printlines(lines) 
    return g 

def readfiles(filenames): 
    for f in filenames: 
     for line in open(f): 
      yield line 


def printlines(lines): 
    for line in lines: 
      print line.strip("\n") 

@command 
def grep(line, pattern): 
    if pattern in line: 
     yield line 


if __name__ == '__main__': 
    import sys 
    pattern = sys.argv[1] 
    filenames = sys.argv[2:] 
    grep(filenames, pattern=pattern) 
5

遍歷文件,返回線。在這種情況下,很容易,因爲搜索字符串不包含行尾字符,所以我們不需要擔心跨越行的匹配。

with open("file.txt") as fin: 
    print sum(line.count('And') for line in fin) 

在每一行

 
>>> help(str.count) 
Help on method_descriptor: 

count(...) 
    S.count(sub[, start[, end]]) -> int 

    Return the number of non-overlapping occurrences of substring sub in 
    string S[start:end]. Optional arguments start and end are interpreted 
    as in slice notation.