將文本文件拆分成塊，然後在這些段中搜索關鍵短語

我是Python新手，我已經是該語言的粉絲。我有一個程序，執行以下操作：將文本文件拆分成塊，然後在這些段中搜索關鍵短語

打開具有用星號（***）
採用split()功能，這個文本文件分成由分開的部分分離的部分文字的文本文件這些星號。星號行在文本文件中是統一的。
我希望我的代碼，通過這些部分的迭代，並做到以下幾點：
- 我已分配值「關鍵短語」一本字典。字典中每個鍵的值是0。
- 代碼需要遍歷從拆分創建的每個部分，並檢查每個部分是否找到字典中的鍵。如果找到一個關鍵術語，則該鍵的值將增加1.
- 一旦代碼遍歷一個部分並計算了該部分中有多少個鍵並相應地添加了值，則應打印出字典鍵和該設置的計數（值），將值設置爲0，然後再次轉到從＃3開始的下一部分文本。

我的代碼是：

from bs4 import BeautifulSoup 
    import re 
    import time 
    import random 
    import glob, os 
    import string 


termz = {'does not exceed' : 0, 'shall not exceed' : 0, 'not exceeding' : 0, 
    'do not exceed' : 0, 'not to exceed' : 0, 'shall at no time exceed' : 0, 
    'shall not be less than' : 0, 'not less than' : 0} 
with open('Q:/hello/place/textfile.txt', 'r') as f: 
    sections = f.read().split('**************************************************') 
    for p in sections[1:]: 
     for eachKey in termz.keys(): 
     if eachKey in p: 
      termz[eachKey] = termz.get(eachKey) + 1 
      print(termz) 


#print(len(sections)) #there are thirty sections  

     #should be if code encounters ***** then it resets the counters and just moves on.... 
     #so far only can count the phrases over the entire text file.... 

#GO BACK TO .SPLIT() 
# termz = dict.fromkeys(termz,0) #resets the counter

它吐出來的是什麼管用的，但它不是第一個，最後，甚至整個它的跟蹤文件 - 我不知道它在做什麼。

最後的打印語句不合適。 termz = dict.fromkeys(termz,0)行是一種方法，我發現將字典的值重置爲0，但被註釋掉，因爲我不知道如何處理這個問題。本質上，與Python控制結構掙扎。如果有人能指引我走向正確的方向，那會很棒。

來源

2017-07-06 Th3SniperSpirit

您的代碼非常接近。請參見下面的評論：

termz = { 
    'does not exceed': 0, 
    'shall not exceed': 0, 
    'not exceeding': 0, 
    'do not exceed': 0, 
    'not to exceed': 0, 
    'shall at no time exceed': 0, 
    'shall not be less than': 0, 
    'not less than': 0 
} 

with open('Q:/hello/place/textfile.txt', 'r') as f: 
    sections = f.read().split('**************************************************') 

    # Skip the first section. (I assume this is on purpose?) 
    for p in sections[1:]: 
     for eachKey in termz: 
      if eachKey in p: 
       # This is simpler than termz[eachKey] = termz.get(eachKey) + 1 
       termz[eachKey] += 1 

     # Move this outside of the inner loop 
     print(termz) 

     # After printing the results for that section, reset the counts 
     termz = dict.fromkeys(termz, 0)

編輯

樣品的輸入和輸出：

input = ''' 
Section 1: 

This section is ignored. 
does not exceed 
************************************************** 
Section 2: 

shall not exceed 
not to exceed 
************************************************** 
Section 3: 

not less than''' 

termz = { 
    'does not exceed': 0, 
    'shall not exceed': 0, 
    'not exceeding': 0, 
    'do not exceed': 0, 
    'not to exceed': 0, 
    'shall at no time exceed': 0, 
    'shall not be less than': 0, 
    'not less than': 0 
} 

sections = input.split('**************************************************') 

# Skip the first section. (I assume this is on purpose?) 
for p in sections[1:]: 
    for eachKey in termz: 
     if eachKey in p: 
      # This is simpler than termz[eachKey] = termz.get(eachKey) + 1 
      termz[eachKey] += 1 

    # Move this outside of the inner loop 
    print(termz) 

    # After printing the results for that section, reset the counts 
    termz = dict.fromkeys(termz, 0) 

# OUTPUT: 
# {'not exceeding': 0, 'shall not exceed': 1, 'not less than': 0, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 1, 'do not exceed': 0, 'does not exceed': 0} 
# {'not exceeding': 0, 'shall not exceed': 0, 'not less than': 1, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 0, 'do not exceed': 0, 'does not exceed': 0}

來源

2017-07-06 18:38:24 smarx

感謝@smarx。它實際上輸出與以前相同的東西......它只是一次打印出字典（這讓我有一段時間感到困惑），並且最重要的是，輸出看起來相當隨機......它不包括第一部分，最後一部分或任何有序的東西。 – Th3SniperSpirit

您可能需要分享您的輸入。（也許會製作一個虛擬的簡短版本的文件。）我真的不知道輸出結果可能如何相同......我們在循環之外移動了一個「print」語句。 – smarx

看我的編輯...我包括一個示例輸入和程序的輸出。它似乎工作正常，所以我想象你的輸入是不同的。 – smarx

if eachKey in p: 
      termz[eachKey] += 1 # might do it 
      print(termz)

來源

2017-07-06 18:37:52

肯定該行的一個簡化版本 – Th3SniperSpirit

將文本文件拆分成塊，然後在這些段中搜索關鍵短語

回答

相關問題