2014-09-12 169 views
-2

以下代碼不斷給我012行上的錯誤IndexError: list index out of rangeprint (aTweet + '~' + timeSource[x] + '~' + keyWord[i])。這與keyword[i]術語有關嗎?我明白Index out of range通常意味着提供一個索引,其中不存在列表元素。這是否意味着錯誤實際上可能在於本節:Python:索引超出範圍錯誤

if (len(splitSource) > 20): 
       max_range = 19 
      else: 
       max_range = len(splitSource) 

參考代碼:

import re 
from re import sub 
import time 
import cookielib 
from cookielib import CookieJar 
import urllib2 
from urllib2 import urlopen 
import difflib 
import sys 

cj = CookieJar() 
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 

keyWord = ["Scotch"] 

def main(): 
    i=0 
    while i<len(keyWord): 
     startingLink = 'https://twitter.com/search/realtime?q='+keyWord[i] 
     tUrl = startingLink+'&src=hash' 

     oldTwit = [] 
     newTwit = [] 


     howSimAr = [.5,.5,.5,.5,.5] 

     sourceCode = opener.open(tUrl).read() 
     splitSource = re.findall(r'<p class="js-tweet-text tweet-text">(.*?)</p>',sourceCode) 
     timeSource = re.findall(r'js-nav" title="(.*?)"',sourceCode) 

     if (len(splitSource) > 20): 
      max_range = 19 
     else: 
      max_range = len(splitSource) 

     print '' 
     print '' 
     print '' 
     ##print 'Keyword: ' + keyWord[i] 
     print ''    

     for x in range (0, max_range): 
      aTweet = re.sub(r'<.*?>','',splitSource[x]) 
      print (aTweet + '~' + timeSource[x] + '~' + keyWord[i]) 
      #print ';' 
      newTwit.append(aTweet) 

##  comparison = difflib.SequenceMatcher(None, newTwit, oldTwit) 
##  howSim = comparison.ratio() 
##  print ';' 
##  print 'This selection is',howSim,'similar to the past' 
##  howSimAr.append(howSim) 
##  howSimAr.remove(howSimAr[0]) 
## 
##  waitMultiplier = reduce(lambda x, y: x+y, howSimAr)/len(howSimAr) 
## 
##  print '' 
##  print 'The current similarity array:',howSimAr 
##  print 'Our current Multiplier:', waitMultiplier 

     oldTwit = [None] 
     for eachItem in newTwit: 
      oldTwit.append(eachItem) 

     newTwit = [None] 

     time.sleep(2) 
     x = 0 
     i = i + 1 

## except Exception, e: 
##  print str(e) 
##  print 'errored in the main try' 
main() 
+0

您正在將'timeSource'索引爲'x',但'x'的範圍由'splitSource'的長度決定(通過'max_range')。如果'splitSource'比'timeSource'更長(包含更多元素),這將不起作用。 – 2014-09-12 15:05:15

+0

@Tom有道理,創建另一個變量會更好嗎? – 2014-09-12 15:09:36

+0

我不清楚'splitSource's和'timeSource's之間的關係是什麼,或者你的代碼試圖做什麼。他們似乎都與推文有關,但我不知道你期望的數據是什麼?例如。當你搜索關鍵字「蘇格蘭威士忌」時,你期望'splitSource'中有多少物品,'timeSource'中有多少物品? – 2014-09-12 15:19:25

回答

0

在Twitter搜索頁面的源代碼零次出現js-nav" title="所以第二個正則表達式的會一無所獲。事實上,加入

print "len(timeSource) =", len(timeSource) 
print "max_range =", max_range 

for x in range (0, max_range): 

將顯示:

len(timeSource) = 0 
max_range = 20 
不管你想archieve

,你會過得更好使用HTMLParser左右,與HTML工作比使用re。這將更容易確保timeSource[x]splitSource[x]將全部歸於x