2015-10-06 95 views
0

我試圖計算我收集的一些演講中出現口頭收縮的次數。一個特殊的演講是這樣的:從列表中計算字符串中元素的出現次數?

speech = "I've changed the path of the economy, and I've increased jobs in our own 
home state. We're headed in the right direction - you've all been a great help." 

所以,在這種情況下,我想計算四(4)個收縮。我有宮縮的列表,這裏有一些最初的幾個術語:

contractions = {"ain't": "am not; are not; is not; has not; have not", 
"aren't": "are not; am not", 
"can't": "cannot",...} 

我的代碼看起來是這樣的,首先:

count = 0 
for word in speech: 
    if word in contractions: 
     count = count + 1 
print count 

我不是這個Anywhere入門但是,因爲代碼遍歷每一個字母,而不是整個單詞。

+5

for word in speech.split(''): – Monkpit

+0

我沒有得到你的字典中的值在做什麼,你有一個字典順便說一句btw沒有列表 –

+0

我在我的答案中添加了很多東西應該給你一些額外的。 – colidyre

回答

5

使用str.split()拆就空白的字符串:

for word in speech.split(): 

這將各執任意空白;這意味着空格,製表符,換行符和一些更具異國情調的空白字符,以及任意數量的連續字符。

您可能需要使用str.lower()小寫你的話(否則Ain't不會被發現,例如),並去掉標點符號:

from string import punctuation 

count = 0 
for word in speech.lower().split(): 
    word = word.strip(punctuation) 
    if word in contractions: 
     count += 1 

我使用str.strip() method這裏;它會從單詞的開頭和結尾中刪除在string.punctuation string中找到的所有內容。

1

你正在遍歷一個字符串。所以這些項目是字符。爲了從字符串中獲得單詞,你可以使用一些天真的方法,例如str.split(),它可以爲你創建(現在你可以迭代一個字符串列表(在str.split()的參數上分割的單詞,默認:在空格上分割)。甚至有re.split(),這是更強大。但我不認爲你需要用拆分正則表達式中的文本。

,你所要做的,至少是str.lower()爲小寫的字符串或把所有可能出現次數(也是大寫字母),我強烈推薦第一個替代方案,後者並不是真正可行的,去除標點符號也是一個責任,但這仍然是天真的,如果你需要更復雜的方法,你必須通過詞分詞器分割文本。NLTK是一個很好的起點,請參閱nltk tokenizer。但我強烈地認爲這個問題不是你的主要問題,或者真的影響你解決你的問題。 :)

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help.""" 
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter. 
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ... 

# with re you can define advanced regexes, but maybe 
# from string import punctuation (suggestion from Martijn Pieters answer 
# is still enough for you) 
import re 

def abbreviation_counter(input_text, abbreviation_dict): 
    count = 0 
    # what you want is a list of words. str.split() does this job for you. 
    # " " is default and you can also omit this. But if you really need better 
    # methods (see answer text abover), you have to take a word tokenizer tool 
    # or have to write your own. 
    for word in input_text.split(" "): 
     # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
     # using re over `from string import punctuation` is that you have more 
     # control in what you want to remove. That means that you can add or 
     # remove easily any punctuation mark. It could be very handy. It could be 
     # also overpowered. If the latter is the case, just stick to Martijn Pieters 
     # solution. 
     if re.sub(',|;', '', word).lower() in abbreviation_dict: 
      count += 1 

    return count 

print abbrev_counter(speech, contractions) 
2 # yeah, it worked - I've included I've in your list :) 

這是一個豆蔻有點沮喪給在作爲的Martijn Pieters的做同樣的時間回答),但我希望我仍然產生了一些價值你。這就是爲什麼我編輯了我的問題,以便爲未來的工作提供一些提示。

+0

感謝您的輸入,但我已經從這個問題轉向了。但是,您的解決方案確實奏效!我只是不想回去重新格式化我的整個'contractions'字典:) – blacksite

+0

是的,這只是一個建議。如果能夠以任何方式提供幫助,我將很樂意爲我的工作得到讚揚。 :) – colidyre

+0

我已經得到你:) – blacksite

0

A for Python中的循環遍歷迭代中的所有元素。在字符串的情況下,元素是字符。

您需要將字符串拆分爲包含單詞的字符串的列表(或元組)。您可以使用.split(delimiter)

你的問題是相當普遍的,所以Python有一個快捷方式:speech.split()拆分任何數量的空格/製表符/換行符,所以你只能在列表中獲得你的單詞。

所以,你的代碼應該是這樣的:

count = 0 
for word in speech.split(): 
    if word in contractions: 
     count = count + 1 
print(count) 

speech.split(" ")工作過,但只在拆分空格而不是製表符,換行符,如果有雙空格,你會得到你的結果列表空元素。

相關問題