2016-01-01 21 views
4

首先,下面的代碼按原樣運行。我更多的是Ruby程序員,所以我仍然感覺我在Python中的方式,我相信,必須有更多的DRY方法來完成我在下面做的事情。Pythonic:收集任意字符串 - 索引器

我正在構建一個索引器,它創建一個在文檔中重複的術語字典以及一個計數,然後將計算結果輸出到條目中。現在它最多支持四個單詞短語。有沒有更好的方式讓我抽象出這種邏輯,以便我可以做同樣的事情,但對於任意長度的短語而不需要添加更多和更多的條件?

import sys 
file=open(sys.argv[1],"r") 
wordcount = {} 
last_word = "" 
last_last_word = "" 
last_last_last_word = "" 

for word in file.read().split(): 
    if word not in wordcount: 
     wordcount[word] = 1 
    else: 
     wordcount[word] += 1 

    if last_last_last_word != "": 
     if "{} {} {} {}".format(last_last_last_word,last_last_word,last_word,word) not in wordcount: 
      wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] = 1 
     else: 
      wordcount[last_last_last_word + " " + last_last_word + " " + last_word + " " + word ] += 1 
    last_last_last_word = last_last_word 

    if last_last_word != "": 
     if last_last_word + " " + last_word + " " + word not in wordcount: 
      wordcount[last_last_word + " " + last_word + " " + word ] = 1 
     else: 
      wordcount[last_last_word + " " + last_word + " " + word ] += 1 
    last_last_word = last_word 

    if last_word != "": 
     if last_word + " " + word not in wordcount: 
      wordcount[last_word + " " + word] = 1 
     else: 
      wordcount[last_word + " " + word] += 1 
    last_word = word 

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True): 
    print k,v 

我包括更廣泛的示例輸入和輸出。我對這段長度表示歉意,但這段代碼的性質往往會產生大量輸出。

該輸入:

this is a sample input file an input file will always be all lower case with no punctuation 

產生這樣的輸出:

file 2 
input 2 
input file 2 
an input file 1 
all 1 
lower case 1 
be 1 
is 1 
file will always 1 
an 1 
sample 1 
case 1 
always be all lower 1 
this is a 1 
will always be 1 
sample input file 1 
will always 1 
is a sample 1 
all lower 1 
lower case with no 1 
no 1 
with 1 
with no 1 
file will always be 1 
with no punctuation 1 
lower 1 
be all lower case 1 
no punctuation 1 
an input file will 1 
input file an 1 
file an 1 
input file an input 1 
always be 1 
file an input file 1 
be all 1 
is a 1 
input file will 1 
file will 1 
an input 1 
input file will always 1 
will always be all 1 
always be all 1 
lower case with 1 
a sample 1 
a sample input file 1 
a sample input 1 
is a sample input 1 
be all lower 1 
a 1 
sample input file an 1 
sample input 1 
case with no punctuation 1 
all lower case with 1 
this 1 
always 1 
file an input 1 
case with 1 
case with no 1 
will 1 
all lower case 1 
punctuation 1 
this is 1 
this is a sample 1 

注意,每個字已被計數,每對詞,詞的各三人和詞語的每個四方。我想幹掉這段代碼,這樣我可以使這個返回值計數到一組任意的單詞。

+2

那麼你是指「四個單詞短語」呢?你能給我們一個輸入和期望輸出的例子嗎? –

+0

我認爲他的意思是四個字的短語。 – Pablo

+0

@Pablo:那麼如何抓住四個字的短語呢? - 對於OP:你的意思是隻是分割塊'file.read()。split()'? –

回答

0

在那裏你去的人。我認爲你一直在尋找的是這個。

string="this is a sample input file an input file will always be all lower case with no punctuation" 

def words(count): 
    return [" ".join(string.split()[a:b]) for a in range(len(string.split())) for b in range(a+count+1) if len(string.split()[a:b]) == count] 

它基於切片輸入文本並返回適當長度的短語列表。

使用您一直在尋找的序列的長度調用列表。

lst = words(3) 

當你用循環查找結果時;

for word in set(lst): 
    print word, lst.count(word) 

an input file 1 
file will always 1 
is a sample 1 
be all lower 1 
file an input 1 
with no punctuation 1 
input file will 1 
lower case with 1 
this is a 1 
always be all 1 
will always be 1 
sample input file 1 
a sample input 1 
all lower case 1 
case with no 1 
input file an 1 

是的,正如評論所言,這是一個不太合適的方法,所以我必須爲此道歉。

你說你想通過任意lenght提取的短語,所以如果我的第一個假設是不正確的,這裏是另一個解決方案爲您提供了短語組合的數量,而不使用.Count中()方法。

但是通過使用這個,整個文本也算作一個整體的短語,所以確保你真正確定你想要的這些短語的長度。

words_list = string.split() 
words_dict = {} 

for a in range(len(words_list)): 
    for b in range(a): 
     phrase = " ".join(words_list[b:a]) 
     if phrase in words_dict: 
      words_dict[phrase] += 1 
     else: 
      words_dict[phrase] = 1 

for i in words_dict: 
    print i, words_dict[i] 

給你所有的長度。

+0

這與所提供的不匹配輸出。 – UtsavShah

+0

這也是非常低效的,調用list.count是一個非常糟糕的方法來獲得計數。 –

+0

讓我試試別的,只是一會兒。 – Rockybilly

0

這是對您的代碼的快速重構,defaultdict是您的朋友。

這需要您將其用作第二個參數的單詞數。

import sys 
from collections import defaultdict 

file=open(sys.argv[1],"r") 

wordcount = defaultdict(int) 
wordlist = ["" for i in range(int(sys.argv[2]))] 

def check(wordcount, wordlist, word): 

    wordlist.append(word) 
    for i, word in enumerate(wordlist): 
     if word != "": 
      current = "".join([w + " " for w in wordlist[i:]]) 
      wordcount[current] += 1 

    return wordlist[1:] 

for word in file.read().split(): 
    wordlist = check(wordcount, wordlist, word) 

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True): 
    print k,v 
+0

這仍然不會幹。 –

+0

另外,我想你已經打破了四個單詞短語的邏輯。 –

+0

@DavidHoelzer現在看看? – UtsavShah

0

更新使其懶惰

from collections import Counter 
import itertools 
import operator as op 


def count_phrases(words, phrase_len): 
    return reduce(op.add, 
    (Counter(tuple(words[i:i+l]) for i in xrange(len(words)-l+1)) for l in phrase_len)) 

例子:

words = "a b c a a".split() 
for phrase, count in count_phrases(words, [1, 2]).iteritems(): 
    print " ".join(phrase), counts 

輸出:

b c 1 
a 3 
c 1 
b 1 
c a 1 
a a 1 
a b 1 
+0

爲什麼downvote? –

+0

我沒有downvote,但邏輯被打破。這些空間是必需的,不再在您的解決方案中維護。 – UtsavShah

+0

@UtsavShah空間是如何來的? –

0

檢查:

def parser(data,size): 
    chunked = data.split() 
    phrases = [] 
    for i in xrange(len(chunked)-size): 
     phrase=' '.join(chunked[i:size+i]) 
     phrases.append(phrase) 
    return phrases 

def parse_file(fname,size):  
    result = [] 
    with open(fname,'r') as f:  
     for data in f.readlines(): 
      for i in xrange(1,size): 
       result+=parser(data.strip(),i) 

    return Counter(result) 


result= parse_file('file.txt',4) 
print sorted(result.items(),key=lambda x:x[1],reverse=True) 

[('file', 2), 
('input', 2), 
('input file', 2), 
('an input file', 1), 
('all', 1), 
('always be all', 1), 
('is', 1), 
('an', 1), 
('sample', 1), 
('this is a', 1), 
('will always be', 1), 
('sample input file', 1), 
('will always', 1), 
('is a sample', 1), 
('all lower', 1), 
('no', 1), 
('with no', 1), 
('lower case', 1), 
('case', 1), 
('input file will', 1), 
('case with no', 1), 
('input file an', 1), 
('file an', 1), 
('be', 1), 
('always be', 1), 
('be all lower', 1), 
('be all', 1), 
('lower', 1), 
('is a', 1), 
('an input', 1), 
('a sample input', 1), 
('lower case with', 1), 
('a sample', 1), 
('file will', 1), 
('with', 1), 
('a', 1), 
('file will always', 1), 
('sample input', 1), 
('this', 1), 
('always', 1), 
('file an input', 1), 
('case with', 1), 
('will', 1), 
('all lower case', 1), 
('this is', 1)] 
+0

你已經打開了一個沒有管理上下文的文件,並忘記關閉它。 –

+1

正在運行的文件將非常龐大。首先將整個文件讀入內存看起來不是最佳的。 –

+0

你也可以使用yield。我可以更新代碼,如果這是唯一的問題 –

0

綿薄之力

import sys 
file=open(sys.argv[1],"r") 
wordcount = {} 
nb_words = 4 
last_words = [] 

for word in file.read().split(): 
    last_words = [word] + last_words 
    if len (last_words) > nb_words: 
     last_words.pop() 
    for i in range(len(last_words)-1,-1,-1): 
     if last_words[i] != "": 
      key = ' '.join(last_words[:i+1]) 
      if key not in wordcount: 
       wordcount[key] = 1 
      else: 
       wordcount[key] += 1 

for k,v in sorted(wordcount.items(), key=lambda x:x[1], reverse=True): 
    print k,v 

我編程的循環來替代變量。所以現在你有一個參數超過4個單詞。 編輯:經過一些錯誤修正後,我現在確定它產生相同的輸出

3

如果你關心一個大文件(可能是一個甚至沒有行結尾來允許逐行迭代)的話就可以存儲映射它(保持存儲器使用的低),並使用一個正則表達式來隔離所有小寫詞語,創建的N個字的滑動窗口,然後適當地更新一個Counter,例如:

import re 
import mmap 
from itertools import islice, izip, tee 
from collections import Counter 
from pprint import pprint 

def word_grouper(filename, size): 
    counts = Counter() 
    with open(filename) as fin: 
     mm = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ) 
     words = (m.group() for m in re.finditer('[a-z]+', mm)) 
     sliding = [islice(w, n, None) for n, w in enumerate(tee(words, size+1))] 
     for slide in izip(*sliding): 
      counts.update(slide[:n] for n in range(1, len(slide))) 

    return counts 

counts = word_grouper('input filename', 4) 
# do appropriate formatting instead of just `pprint`ing 
pprint(counts.most_common()) 

實施例輸出(其中輸入文件包含您的示例字符串):

[(('file',), 2), 
(('input', 'file'), 2), 
(('input',), 2), 
(('a', 'sample', 'input'), 1), 
(('file', 'will', 'always', 'be'), 1), 
(('sample', 'input', 'file', 'an'), 1), 
(('this', 'is', 'a', 'sample'), 1), 
(('this', 'is'), 1), 
(('will',), 1), 
(('lower', 'case', 'with'), 1), 
(('an', 'input', 'file', 'will'), 1), 
(('sample', 'input'), 1), 
(('is', 'a'), 1), 
(('all', 'lower', 'case', 'with'), 1), 
(('input', 'file', 'will'), 1), 
(('an',), 1), 
(('always', 'be'), 1), 
(('lower', 'case', 'with', 'no'), 1), 
(('an', 'input'), 1), 
(('be', 'all', 'lower'), 1), 
(('this',), 1), 
(('be', 'all', 'lower', 'case'), 1), 
(('this', 'is', 'a'), 1), 
(('sample',), 1), 
(('sample', 'input', 'file'), 1), 
(('will', 'always', 'be', 'all'), 1), 
(('a',), 1), 
(('a', 'sample'), 1), 
(('is', 'a', 'sample'), 1), 
(('will', 'always'), 1), 
(('lower',), 1), 
(('lower', 'case'), 1), 
(('file', 'an'), 1), 
(('file', 'an', 'input'), 1), 
(('file', 'will'), 1), 
(('is',), 1), 
(('all', 'lower'), 1), 
(('input', 'file', 'an', 'input'), 1), 
(('always', 'be', 'all', 'lower'), 1), 
(('an', 'input', 'file'), 1), 
(('input', 'file', 'an'), 1), 
(('be', 'all'), 1), 
(('input', 'file', 'will', 'always'), 1), 
(('be',), 1), 
(('all',), 1), 
(('always', 'be', 'all'), 1), 
(('is', 'a', 'sample', 'input'), 1), 
(('always',), 1), 
(('all', 'lower', 'case'), 1), 
(('file', 'an', 'input', 'file'), 1), 
(('file', 'will', 'always'), 1), 
(('a', 'sample', 'input', 'file'), 1), 
(('will', 'always', 'be'), 1)]