首先,下面的代碼按原樣運行。我更多的是Ruby程序員,所以我仍然感覺我在Python中的方式,我相信,必須有更多的DRY方法來完成我在下面做的事情。Pythonic:收集任意字符串 - 索引器
我正在構建一個索引器,它創建一個在文檔中重複的術語字典以及一個計數,然後將計算結果輸出到條目中。現在它最多支持四個單詞短語。有沒有更好的方式讓我抽象出這種邏輯,以便我可以做同樣的事情,但對於任意長度的短語而不需要添加更多和更多的條件?
import sys
file=open(sys.argv[1],"r")
wordcount = {}
last_word = ""
last_last_word = ""
last_last_last_word = ""
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
if last_last_last_word != "":
if "{} {} {} {}".format(last_last_last_word,last_last_word,last_word,word) not in wordcount:
wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] = 1
else:
wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] += 1
last_last_last_word = last_last_word
if last_last_word != "":
if last_last_word + " " + last_word + " " + word not in wordcount:
wordcount[last_last_word + " " + last_word + " " + word ] = 1
else:
wordcount[last_last_word + " " + last_word + " " + word ] += 1
last_last_word = last_word
if last_word != "":
if last_word + " " + word not in wordcount:
wordcount[last_word + " " + word] = 1
else:
wordcount[last_word + " " + word] += 1
last_word = word
for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True):
print k,v
我包括更廣泛的示例輸入和輸出。我對這段長度表示歉意,但這段代碼的性質往往會產生大量輸出。
該輸入:
this is a sample input file an input file will always be all lower case with no punctuation
產生這樣的輸出:
file 2
input 2
input file 2
an input file 1
all 1
lower case 1
be 1
is 1
file will always 1
an 1
sample 1
case 1
always be all lower 1
this is a 1
will always be 1
sample input file 1
will always 1
is a sample 1
all lower 1
lower case with no 1
no 1
with 1
with no 1
file will always be 1
with no punctuation 1
lower 1
be all lower case 1
no punctuation 1
an input file will 1
input file an 1
file an 1
input file an input 1
always be 1
file an input file 1
be all 1
is a 1
input file will 1
file will 1
an input 1
input file will always 1
will always be all 1
always be all 1
lower case with 1
a sample 1
a sample input file 1
a sample input 1
is a sample input 1
be all lower 1
a 1
sample input file an 1
sample input 1
case with no punctuation 1
all lower case with 1
this 1
always 1
file an input 1
case with 1
case with no 1
will 1
all lower case 1
punctuation 1
this is 1
this is a sample 1
注意,每個字已被計數,每對詞,詞的各三人和詞語的每個四方。我想幹掉這段代碼,這樣我可以使這個返回值計數到一組任意的單詞。
那麼你是指「四個單詞短語」呢?你能給我們一個輸入和期望輸出的例子嗎? –
我認爲他的意思是四個字的短語。 – Pablo
@Pablo:那麼如何抓住四個字的短語呢? - 對於OP:你的意思是隻是分割塊'file.read()。split()'? –