2017-05-01 94 views
-1

我想簡單地處理一些Twitter數據,我想在其中計算數據集中產生的最頻繁詞彙。閱讀CSV文件時列表索引超出範圍

不過,我不斷收到關於45號線以下錯誤:

IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>() 
43 for line in f: 
44 parts = re.split("^\d+\s", line) 
45 tweet = re.split("\s(Status)", parts[-1])[10] 
46 tweet = tweet.replace("\\n"," ") 
47 terms_all = [term for term in process_tweet(tweet)] 
IndexError: list index out of range 

我已經加了我完整的代碼進行審查,有人可以請告知。

import codecs 
import re 
from collections import Counter 
from nltk.corpus import stopwords 

word_counter = Counter() 

def punctuation_symbols(): 
    return [".", "", "$","%","&",";",":","-","&amp;","?"] 

def is_rt_marker(word): 
    if word == "b\"rt" or word == "b'rt" or word == "rt": 
     return True 
    return False 

def strip_quotes(word): 
    if word.endswith(""): 
     word = word[0:-1] 
    if word.startswith(""): 
     word = word[1:] 
    return word 

def process_tweet(tweet): 
    keep = [] 
    for word in tweet.split(" "): 
     word = word.lower() 
     word = strip_quotes(word) 
     if len(word) == 0: 
      continue 
     if word.startswith("https"): 
      continue 
     if word in stopwords.words('english'): 
      continue 
     if word in punctuation_symbols(): 
      continue 
     if is_rt_marker(word): 
      continue 
     keep.append(word) 
    return keep 

with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f: 
    n = 0 
    for line in f: 
     parts = re.split("^\d+\s", line) 
     tweet = re.split("\s(Status)", parts[1])[0] 
     tweet = tweet.replace("\\n"," ") 
     terms_all = [term for term in process_tweet(tweet)] 
     word_counter.update(terms_all) 

     n += 1 
     if n == 50: 
      break 

print(word_counter.most_common(10)) 
+1

你分享的追蹤引用的是不同於你粘貼在它下面的代碼。特別是'tweet = re.split(「\ s(Status)」,parts [-1])[10]'與'tweet = re.split(「\ s(Status)」,parts [1] ]'。你能澄清嗎? – etemple1

+0

@ etemple1:道歉也應該是1,0。我嘗試了不同的組合,並且回溯是爲之前的迭代生成的。任何想法爲什麼[1],[0]不會工作?還要澄清n = 0是否設置索引,並且[1]是否定義行開始正確? –

+0

順便說一下'[term for process_tweet(tweet)]'相當於'list(process_tweet(tweet))',在你的情況下,它相當於'process_tweet(tweet)'。 – 9000

回答

-1
parts = re.split("^\d+\s", line) 
tweet = re.split("\s(Status)", parts[1])[0] 

這很可能是有問題的線路。

您認爲parts確實分裂並且具有多個元素。分割可能無法找到line中的分割字符串,因此parts等於[line]。然後parts[1]崩潰。

在第二行之前添加一個檢查。打印line值以更好地瞭解發生了什麼。