閱讀CSV文件時列表索引超出範圍

-1

我想簡單地處理一些Twitter數據，我想在其中計算數據集中產生的最頻繁詞彙。閱讀CSV文件時列表索引超出範圍

不過，我不斷收到關於45號線以下錯誤：

IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>() 
43 for line in f: 
44 parts = re.split("^\d+\s", line) 
45 tweet = re.split("\s(Status)", parts[-1])[10] 
46 tweet = tweet.replace("\\n"," ") 
47 terms_all = [term for term in process_tweet(tweet)] 
IndexError: list index out of range

我已經加了我完整的代碼進行審查，有人可以請告知。

import codecs 
import re 
from collections import Counter 
from nltk.corpus import stopwords 

word_counter = Counter() 

def punctuation_symbols(): 
    return [".", "", "$","%","&",";",":","-","&amp;","?"] 

def is_rt_marker(word): 
    if word == "b\"rt" or word == "b'rt" or word == "rt": 
     return True 
    return False 

def strip_quotes(word): 
    if word.endswith(""): 
     word = word[0:-1] 
    if word.startswith(""): 
     word = word[1:] 
    return word 

def process_tweet(tweet): 
    keep = [] 
    for word in tweet.split(" "): 
     word = word.lower() 
     word = strip_quotes(word) 
     if len(word) == 0: 
      continue 
     if word.startswith("https"): 
      continue 
     if word in stopwords.words('english'): 
      continue 
     if word in punctuation_symbols(): 
      continue 
     if is_rt_marker(word): 
      continue 
     keep.append(word) 
    return keep 

with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f: 
    n = 0 
    for line in f: 
     parts = re.split("^\d+\s", line) 
     tweet = re.split("\s(Status)", parts[1])[0] 
     tweet = tweet.replace("\\n"," ") 
     terms_all = [term for term in process_tweet(tweet)] 
     word_counter.update(terms_all) 

     n += 1 
     if n == 50: 
      break 

print(word_counter.most_common(10))

來源

2017-05-01 Ankhit Sharma

你分享的追蹤引用的是不同於你粘貼在它下面的代碼。特別是'tweet = re.split（「\ s（Status）」，parts [-1]）[10]'與'tweet = re.split（「\ s（Status）」，parts [1] ]'。你能澄清嗎？ – etemple1

@ etemple1：道歉也應該是1,0。我嘗試了不同的組合，並且回溯是爲之前的迭代生成的。任何想法爲什麼[1]，[0]不會工作？還要澄清n = 0是否設置索引，並且[1]是否定義行開始正確？ –

順便說一下'[term for process_tweet（tweet）]'相當於'list（process_tweet（tweet））'，在你的情況下，它相當於'process_tweet（tweet）'。 – 9000

-1

parts = re.split("^\d+\s", line) 
tweet = re.split("\s(Status)", parts[1])[0]

這很可能是有問題的線路。

您認爲parts確實分裂並且具有多個元素。分割可能無法找到line中的分割字符串，因此parts等於[line]。然後parts[1]崩潰。

在第二行之前添加一個檢查。打印line值以更好地瞭解發生了什麼。

來源

2017-05-01 18:36:04 9000

閱讀CSV文件時列表索引超出範圍

回答

相關問題