如何在條件之後對字符進行計數？

在python中，我試圖將一個文本文件。搜索每個角色，當我找到一個大寫字母時，我想跟蹤字符數，直到找到'？'，'！'或'。'。基本上，我正在閱讀大量的文本文件，並試圖計算出有多少句子以及總字符來查找平均句子長度。（我知道會有一些錯誤的東西，如先生或EG，但我可以用蟲子生活。該數據集是如此之大，誤差可以忽略不計。）如何在條件之後對字符進行計數？

char = '' 
for line in sys.stdin: 
    words = line 
    for char in words: 
    if char.isupper(): 
     # read each char until you see a ?,!, or . and keep track 
     # of the number of characters in the sentence.

來源

2015-04-04 princess_slayer

http://stackoverflow.com/questions/3549075/regex-to-find-all-sentences-of-text – 2015-04-04 00:28:07

你是否正在計算跨越換行符，還是句子完全在給定的行內？ – geoelectric 2015-04-04 00:41:56

您可能要使用nltk模塊來標記句子，而不是試圖重新發明車輪。它涵蓋了括號和其他奇怪句子結構等各種角落案例。

它有句子分詞器nltk.sent_tokenize。請注意，在使用之前，您必須先使用nltk.download()下載英文模型。

這裏是你將如何使用NLTK解決您的問題：

sentences = nltk.sent_tokenize(stdin.read()) 

print sum(len(s) for s in sentences)/float(len(sentences))

來源

2015-04-04 00:53:25

此解決方案，如果你想通過線從標準像當前的代碼去行。它使用雙狀態機器在中斷期間進行計數。

import sys 

in_a_sentence = False 
count = 0 
lengths = [] 

for line in sys.stdin: 
    for char in line: 
     if char.isupper(): 
      in_a_sentence = True 
     elif char in '.?!': 
      lengths.append(count+1) 
      in_a_sentence = False 
      count = 0 

     if in_a_sentence: 
      count += 1 

print lengths

輸出：

mbp:scratch geo$ python ./count.py 
This is a test of the counter. This test includes 
line breaks. See? Pretty awesome, 
huh! 
^D[30, 31, 4, 20]

但是，如果你也能看懂整個事情在一次爲一個變量，你可以做更多的東西，如：

import re 
import sys 

data = sys.stdin.read() 
lengths = [len(x) for x in re.findall(r'[A-Z][^.?!]*[.?!]', data)] 

print lengths

那會給你相同的結果。

來源

2015-04-04 01:08:29 geoelectric

如何在條件之後對字符進行計數？

回答

相關問題