爲什麼我的Python腳本不能正確返回頁面源代碼？

-1

我剛剛寫了一個腳本，意在通過字母表，並找到所有無人認領的四字母嘰嘰喳喳名稱（真的只是爲了練習，因爲我是新來的Python）。我已經寫了幾個使用'urllib2'從url獲取網站html的腳本，但這一次它似乎沒有工作。這裏是我的腳本：爲什麼我的Python腳本不能正確返回頁面源代碼？

import urllib2 

src='' 
url='' 
print "finding four-letter @usernames on twitter..." 
d_one='' 
d_two='' 
d_three='' 
d_four='' 
n_one=0 
n_two=0 
n_three=0 
n_four=0 
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] 

while (n_one > 26): 
    while(n_two > 26): 
     while (n_three > 26): 
      while (n_four > 26): 
       d_one=letters[n_one] 
       d_two=letters[n_two] 
       d_three=letters[n_three] 
       d_four=letters[n_four] 
       url = "twitter.com/" + d_one + d_two + d_three + d_four 

       src=urllib2.urlopen(url) 
       src=src.read() 
       if (src.find('Sorry, that page doesn’t exist!') >= 0): 
        print "nope" 
        n_four+=1 
       else: 
        print url 
        n_four+=1 
      n_three+=1 
      n_four=0 
     n_two+=1 
     n_three=0 
     n_four=0 
    n_one+=1  
    n_two=0 
    n_three=0 
    n_four=0

運行這段代碼返回以下錯誤：

SyntaxError: Non-ASCII character '\xe2' in file name.py on line 29, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

，並訪問該鏈接，並做一些額外的搜索後，我添加以下行到文檔的頂部：

# coding: utf-8

現在，雖然它不再返回錯誤，但似乎沒有任何事情發生。我加了一行

print src

哪個應該打印每個url的html，但是當我運行它時什麼也沒有發生。任何建議將不勝感激。

來源

2012-08-13 zch

什麼是/是第29行？顯然上面的代碼並不代表你的真實代碼 - otherweise我們會看到你的代碼中的特殊字符。 Downvote ... – 2012-08-13 04:12:55

第29行是「print'nope'」...我發誓我剛剛寫了這個腳本5分鐘前... – zch 2012-08-13 04:14:30

只是爲了您的信息，這個腳本將需要很長時間才能運行。有'26 * 26 * 26 * 26 = 456976'可能的四個字母的單詞。即使你能夠每秒處理兩次，你的腳本仍然會花費456976 * 0.5秒*（1分鐘/ 60秒）*（1小時/ 60分鐘）=大約63.47小時。 – 2012-08-13 04:28:07

嗯，你初始化n_one=0，然後做一個循環while (n_one > 26)。當Python第一次遇到它時，它看到while (0 > 26)這顯然是錯誤的，因此它跳過了整個循環。

正如gnibbler的回答告訴你的，無論如何都有更乾淨的循環方法。

來源

2012-08-13 04:16:05 Blair

哇。你完全正確 - 他們應該是「<" not ">」。非常感謝您指出並提供快速幫助。 – zch 2012-08-13 04:17:42

您可以通過使用itertools.product

from itertools import product 
for d_one, d_two, d_three, d_four in product(letters, repeat=4): 
    ...

而不是定義的字母列表擺脫過度嵌套的，你可以只使用strings.ascii_lowercase

你應該告訴的urlopen您正在使用的協議（http ）

url = "http://twitter.com/" + d_one + d_two + d_three + d_four

此外，當您做得到那並不是一個頁面牛逼存在的urlopen提出了一個404，所以你應該檢查這，而不是看網頁文本

來源

2012-08-13 04:13:10

太好了！謝謝你的提示;我會執行這個。 – zch 2012-08-13 04:15:43

爲什麼我的Python腳本不能正確返回頁面源代碼？

回答

相關問題