迭代字符串列表中的字符的最快對象

我遍歷單詞列表來查找單詞之間最常用的字符（即在列表中[hello, hank],'h'計數爲出現兩次，而'l'計爲出現一次。）。一個python列表工作正常，但我也在研究NumPy（dtype數組？）和Pandas。看起來Numpy可能會走，但是還有其他的軟件包需要考慮嗎？我該怎樣才能使這個功能更快？迭代字符串列表中的字符的最快對象

守則問：

def mostCommon(guessed, li): 
    count = Counter() 
    for words in li: 
      for letters in set(words): 
       count[letters]+=1 
    return count.most_common()[:10]

感謝。

來源

2017-01-13 ZtoYi

燦你解釋「最常見的獨特性格」是什麼意思？幷包含一些示例輸入和輸出數據 –

@Chris_Rands如果您需要更多，請使用示例編輯lmk。 – ZtoYi

所以你只想要最頻繁的角色或所有角色的頻率？ –

下面是使用它的views-concept一個NumPy的方法 -

def tabulate_occurrences(a):   # Case sensitive 
    chars = np.asarray(a).view('S1') 
    valid_chars = chars[chars!=''] 
    unqchars, count = np.unique(valid_chars, return_counts=1) 
    return pd.DataFrame({'char':unqchars, 'count':count}) 

def topNchars(a, N = 10):    # Case insensitive 
    s = np.core.defchararray.lower(a).view('uint8') 
    unq, count = np.unique(s[s!=0], return_counts=1) 
    sidx = count.argsort()[-N:][::-1] 
    h = unq[sidx] 
    return [str(unichr(i)) for i in h]

採樣運行 -

In [322]: a = ['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS'] 

In [323]: tabulate_occurrences(a) # Case sensitive 
Out[323]: 
    char count 
0 I  3 
1 S  2 
2 e  2 
3 i  1 
4 o  1 
5 r  2 
6 s  2 
7 u  1 
8 y  1 

In [533]: topNchars(a, 5)   # Case insensitive 
Out[533]: ['s', 'i', 'r', 'e', 'y'] 

In [534]: topNchars(a, 10)  # Case insensitive 
Out[534]: ['s', 'i', 'r', 'e', 'y', 'u', 'o']

來源

2017-01-13 19:21:16 Divakar

Hm。我只是計時，看起來比常規列表解決方案慢。 – ZtoYi

@ZtoYi您是否需要實際發生的前10位字符的計數，或者只是想查找這些字符？ – Divakar

只是人物會很好。 – ZtoYi

-1

只是做

counter = Counter(''.join(li)) 
most_common = counter.most_common()

，大功告成

來源

2017-01-13 18:49:15

這看起來是非常快的已經，並在O(n)運行。我所看到的唯一真正的改進機會是將li分成多個部分，從而將此過程並行化。

來源

2017-01-13 18:54:16

這是一個O（n）的問題，但想知道是否使用不同的對象可能有助於提高速度。（即這篇文章http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists） – ZtoYi

選項1

def pir1(li): 
    sets = [set(s) for s in li] 
    ul = np.array(list(set.union(*sets))) 
    us = np.apply_along_axis(set, 1, ul[:, None]) 
    c = (sets >= us).sum(1) 
    a = c.argsort()[:-11:-1] 
    return ul[a]

選項2

def pir2(li): 
    return Counter(chain.from_iterable([list(set(i)) for i in li])).most_common(10)

假設單詞列表li

import pandas as pd 
import numpy as np 
from string import ascii_lowercase 

li = pd.DataFrame(
    np.random.choice(list(ascii_lowercase), (1000, 10)) 
).sum(1).tolist()

包括Divakar的和OP的功能

def tabulate_occurrences(a): 
    chars = np.asarray(a).view('S1') 
    valid_chars = chars[chars!=''] 
    unqchars, count = np.unique(valid_chars, return_counts=1) 
    return pd.DataFrame({'char':unqchars, 'count':count}) 

def topNchars(a, N = 10): 
    s = np.core.defchararray.lower(a).view('uint8') 
    unq, count = np.unique(s[s!=0], return_counts=1) 
    sidx = count.argsort()[-N:][::-1] 
    h = unq[sidx] 
    return [str(chr(i)) for i in h] 

def mostCommon(li): 
    count = Counter() 
    for words in li: 
      for letters in set(words): 
       count[letters]+=1 
    return count.most_common()[:10]

測試

import pandas as pd 
import numpy as np 
from string import ascii_lowercase 
from timeit import timeit 

results = pd.DataFrame(
    index=pd.RangeIndex(5, 405, 5, name='No. Words'), 
    columns=pd.Index('pir1 pir2 mostCommon topNchars'.split(), name='Method'), 
) 

np.random.seed([3,1415]) 
for i in results.index:  
    li = pd.DataFrame(
     np.random.choice(list(ascii_lowercase), (i, 10)) 
    ).sum(1).tolist() 
    for j in results.columns: 
     v = timeit(
      '{}(li)'.format(j), 
      'from __main__ import {}, li'.format(j), 
      number=100 
     ) 
     results.set_value(i, j, v) 

ax = results.plot(title='Time Testing') 
ax.set_ylabel('Time of 100 iterations')

來源

2017-01-13 18:57:18 piRSquared

所以，最後的O/P將是一個最大的字符，對不對？ – Divakar

@Divakar基於他們的代碼，他們想要前10名。同時評論其他答案「如果我希望下一個最頻繁的？」 – piRSquared

@piRSquared當你吐出來的時候，我正在嘗試這些選項。謝謝！你能按照我能得到前10名字符的方式排序它們嗎（關係不重要）？謝謝。 – ZtoYi

假設你只需要最常見的字符，每個字符只有每字計數一次：

>>> from itertools import chain 
>>> l = ['hello', 'hank'] 
>>> chars = list(chain.from_iterable([list(set(word)) for word in l])) 
>>> max(chars, key=chars.count) 
'h'

使用max與list.count比使用Counter快了很多，由於C級執行。

來源

2017-01-13 19:09:07

如果我想要下一個最頻繁的信件，該怎麼辦？ – ZtoYi

@ZtoYi然後你可能不想使用這種方法;儘管仍然有可能，但可以使用'heapq'或者對'chars'列表進行排序，但是兩者都會增加相當多的開銷 –

這裏是uniqueifies每個字符串一個純Python的解決方案，加入集合，然後計算結果（使用Divakar的例子列表）

>>> li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS'] 
>>> Counter(e for sl in map(list, map(set, li)) for e in sl) 
Counter({'I': 3, 'e': 2, 's': 2, 'S': 2, 'r': 2, 'o': 1, 'i': 1, 'u': 1, 'y': 1})

如果你想上限和下限案件算作同一封信：

>>> Counter(e for sl in map(list, map(set, [s.lower() for s in li])) for e in sl) 
Counter({'i': 4, 's': 4, 'e': 2, 'r': 2, 'o': 1, 'u': 1, 'y': 1})

現在，讓我們一次：

from __future__ import print_function 
from collections import Counter 
import numpy as np 
import pandas as pd 

def dawg(li): 
    return Counter(e for sl in map(list, map(set, li)) for e in sl) 

def nump(a): 
    chars = np.asarray(a).view('S1') 
    valid_chars = chars[chars!=''] 
    unqchars, count = np.unique(valid_chars, return_counts=1) 
    return pd.DataFrame({'char':unqchars, 'count':count}) 

if __name__=='__main__': 
    import timeit 
    li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS'] 
    for f in (dawg, nump): 
     print(" ",f.__name__, timeit.timeit("f(li)", setup="from __main__ import f, li", number=100))

結果：

dawg 0.00134205818176 
nump 0.0347728729248

Python的解決方案顯著加快在這種情況下

來源

2017-01-13 19:50:18 dawg

迭代字符串列表中的字符的最快對象

回答

相關問題