匹配txt文件中的每個單詞

我正在處理Project Euler問題（爲了好玩）。它配備了包含有超過名稱的列表1號線在這樣的格式46KB txt文件：匹配txt文件中的每個單詞

"MARIA","SUSAN","ANGELA","JACK"...

我的計劃是寫一個方法來提取每一個名字，並把它們添加到一個Python列表。正則表達式是解決這個問題的最好武器嗎？
我擡起頭看Python re doc，但我很難找出正確的正則表達式。

來源

2011-10-04 William Li

如果該文件的格式爲你說它是，即

這是一個單行
格式是這樣的：「MARIA」，「蘇珊」，「安吉拉」「JACK」

那麼這應該工作：

>>> import csv 
>>> lines = csv.reader(open('words.txt', 'r'), delimiter=',') 
>>> words = lines.next() 
>>> words 
['MARIA', 'SUSAN', 'ANGELA', 'JACK']

來源

2011-10-04 01:43:09 varunl

謝謝，它完美的作品！ –

看起來像csv模塊可以幫助的格式。那麼你不必寫任何正則表達式。

來源

2011-10-04 01:11:07 imm

正則表達式可以完成工作，但效率不高。使用csv可以工作，但它可能無法很好地處理單個行中的5000個單元。至少它必須加載整個文件並在內存中維護整個名稱列表（對於您來說這可能不是問題，因爲這是非常少量的數據）。如果你想爲相對較大的文件的迭代器（遠遠大於5000名），狀態機將這樣的伎倆：

def parse_chunks(iter, quote='"', delim=',', escape='\\'): 
    in_quote = False 
    in_escaped = False 

    buffer = '' 

    for chunk in iter: 
     for byte in chunk: 
      if in_escaped: 
       # Done with the escape char, add it to the buffer 
       buffer += byte 
       in_escaped = False    
      elif byte == escape: 
       # The next charachter will be added literally and not parsed 
       in_escaped = True   
      elif in_quote: 
       if byte == quote: 
        in_quote = False 
       else: 
        buffer += byte 
      elif byte == quote: 
       in_quote = True 
      elif byte in (' ', '\n', '\t', '\r'): 
       # Ignore whitespace outside of quotes 
       pass 
      elif byte == delim: 
       # Done with this block of text 
       yield buffer 
       buffer = ''      
      else: 
       buffer += byte 

    if in_quote: 
     raise ValueError('Found unbalanced quote char %r' % quote) 
    elif in_escaped: 
     raise ValueError('Found unbalanced escape char %r' % escape) 

    # Yield the last bit in the buffer 
    yield buffer 

data = r""" 
"MARIA","SUSAN", 
"ANG 
ELA","JACK",,TED,"JOE\"" 
""" 
print list(parse_chunks(data)) 

# ['MARIA', 'SUSAN', 'ANG\nELA', 'JACK', '', 'TED', 'JOE"'] 

# Use a fixed buffer size if you know the file has only one long line or 
# don't care about line parsing 
buffer_size = 4096 
with open('myfile.txt', 'r', buffer_size) as file: 
    for name in parse_chunks(file): 
     print name

來源

2011-10-04 01:56:57 six8

如果你能做到這一點簡單的，然後做更簡單。無需使用csv模塊。我不認爲5000個名字或46KB是足夠擔心的。

names = [] 
f = open("names.txt", "r") 

# In case there is more than one line... 
for line in f.readlines(): 
    names = [x.strip().replace('"', '') for x in line.split(",")] 

print names 
#should print ['name1', ... , ...]

來源

2011-10-04 01:58:34 dicato

匹配txt文件中的每個單詞

回答

相關問題