蟒蛇：在不同的文件

搜索的話，我想寫收集在一個目錄下的所有文件的名稱，然後搜索他們每個人給定詞的腳本。每次找到該單詞時，都應該打印該文件的名稱和包含該單詞的完整行。此外，在一個新文件中，我想打印找到該單詞的次數。蟒蛇：在不同的文件

這是我到目前爲止有：

import os 

print(os.listdir('./texts'), '\n\n\n') 

suchwort ={"computational":0,"linguistics":0,"processing":0,"chunking":0,"coreference":0,"html":0,"machine":0} 
hitlist = './hits.txt' 


with open(hitlist, 'a+') as hits: 
    for elem in os.listdir('./texts'): 
     with open(os.path.join("./texts",elem)) as fh: 
     for line in fh: 
      words = line.split(' ') 
      print(elem, " : ",line) 
      for n in words: 
       if n in suchwort: 
        if n in suchwort.keys(): 
        suchwort[n]+=1 
        else: 
        suchwort[n]=1 
    for k in suchwort: 
     print(k,":",suchwort[k],file=hits)

在新文件中（hits.txt）結果是：

chunking : 0 
machine : 9 
html : 0 
processing : 4 
linguistics : 12 
coreference : 1 
computational : 12

的值。然而似乎是錯誤的，因爲這個詞「html」位於其中一個文件中。

來源

2015-11-03 Tuấn Phạm

排序無關，但這個'如果n在suchwort.keys（）：'是uneccesary，因爲多數民衆贊成什麼'如果n在suchwort：'做了。 –

回到問題，這可能是一個套管問題？嘗試'如果n.lower（）在suchwort：'而是看看是否有幫助？ –

謝謝，但它不是套管問題，我只是以小寫字母搜索「html」。 –

該問題是由文件逐行迭代的方式引起的。在下面的代碼片段中，每個「行」都會有最後一個換行符。所以做一個分割留下尾行換行符的最後一行。

with open(os.path.join("./texts",elem)) as fh: 
    for line in fh: 
     words = line.split(' ')

如果打印的話「再版」，

print repr(words)

你會看到最後一個字包含尾隨換行符，

['other', 'word\n']

而不是預期的：

['other', 'word']

爲了解決這個問題在處理每行之前，您可以使用「strip」：

line = line.strip()

刪除字符串的尾隨和空白空格。

來源

2015-11-04 00:30:50 memoselyk

謝謝！我知道我現在沒有錯，那是因爲我只用「」所以在Word的HTML「分裂」無法被發現。 –

鑑於更多的信息，你可以使用正則表達式來獲得完整的單詞，'re.findall（R 「\ B [A-Z] + \ B」行，如re.I）'，而不是'line.split（）'。 – memoselyk

import itertools 
import multiprocessing as mp 
import glob 

def filesearcher(qIn, qOut): 
    for fpath in iter(qIn.get, None): 
     keywords = {"computational":{'count':0, 'lines':[]}, 
        "linguistics":{'count':0, 'lines':[]}, 
        "processing":{'count':0, 'lines':[]}, 
        "chunking":{'count':0, 'lines':[]}, 
        "coreference":{'count':0, 'lines':[]}, 
        "html":{'count':0, 'lines':[]}, 
        "machine":{'count':0, 'lines':[]}} 

     with open(fpath) as infile: 
      for line in file: 
       for word in line.split(): 
        word = word.lower() 
        if word not in keywords: continue 
        keywords[word]['count'] += 1 
        keywords[word]['lines'].append(line) 
     qOut.put(fpath, keywords) 
    qOut.put(None) 


def main(): 
    numProcs = 4 # fiddle to taste 
    qIn, qOut = [mp.Queue() for _ in range(2)] 
    procs = [mp.Process(target=filesearcher, args=(qIn, qOut)) for _ in range(numProcs)] 
    for p in procs: p.start() 
    for fpath in glob.glob('./texts/*'): qIn.put(fpath) 
    for _ in procs: qIn.put(None) 

    done = 0 
    while done < numProcs: 
     d = qOut.get() 
     if d is None: 
      done += 1 
      continue 
     fpath, stats = d 
     print("showing results for", fpath) 
     for word, d in stats.items(): 
      print(word, ":", d['count']) 
      for line in d['lines']: 
       print('\t', line) 

    for p in procs: p.terminate()

來源

2015-11-03 20:17:07 inspectorG4dget

蟒蛇：在不同的文件

回答

相關問題