-1
我想簡單地處理一些Twitter數據,我想在其中計算數據集中產生的最頻繁詞彙。閱讀CSV文件時列表索引超出範圍
不過,我不斷收到關於45號線以下錯誤:
IndexError Traceback (most recent call last) <ipython-input 346-f03e745247f4> in <module>()
43 for line in f:
44 parts = re.split("^\d+\s", line)
45 tweet = re.split("\s(Status)", parts[-1])[10]
46 tweet = tweet.replace("\\n"," ")
47 terms_all = [term for term in process_tweet(tweet)]
IndexError: list index out of range
我已經加了我完整的代碼進行審查,有人可以請告知。
import codecs
import re
from collections import Counter
from nltk.corpus import stopwords
word_counter = Counter()
def punctuation_symbols():
return [".", "", "$","%","&",";",":","-","&","?"]
def is_rt_marker(word):
if word == "b\"rt" or word == "b'rt" or word == "rt":
return True
return False
def strip_quotes(word):
if word.endswith(""):
word = word[0:-1]
if word.startswith(""):
word = word[1:]
return word
def process_tweet(tweet):
keep = []
for word in tweet.split(" "):
word = word.lower()
word = strip_quotes(word)
if len(word) == 0:
continue
if word.startswith("https"):
continue
if word in stopwords.words('english'):
continue
if word in punctuation_symbols():
continue
if is_rt_marker(word):
continue
keep.append(word)
return keep
with codecs.open("C:\\Users\\XXXXX\\Desktop\\USA_TWEETS-out.csv", "r", encoding="utf-8") as f:
n = 0
for line in f:
parts = re.split("^\d+\s", line)
tweet = re.split("\s(Status)", parts[1])[0]
tweet = tweet.replace("\\n"," ")
terms_all = [term for term in process_tweet(tweet)]
word_counter.update(terms_all)
n += 1
if n == 50:
break
print(word_counter.most_common(10))
你分享的追蹤引用的是不同於你粘貼在它下面的代碼。特別是'tweet = re.split(「\ s(Status)」,parts [-1])[10]'與'tweet = re.split(「\ s(Status)」,parts [1] ]'。你能澄清嗎? – etemple1
@ etemple1:道歉也應該是1,0。我嘗試了不同的組合,並且回溯是爲之前的迭代生成的。任何想法爲什麼[1],[0]不會工作?還要澄清n = 0是否設置索引,並且[1]是否定義行開始正確? –
順便說一下'[term for process_tweet(tweet)]'相當於'list(process_tweet(tweet))',在你的情況下,它相當於'process_tweet(tweet)'。 – 9000