錯誤機械化和美化httplib.InvalidURL：非數字端口：''（Python）

我正在瀏覽URL列表並使用Mechanize/BeautifulSoup與我的腳本打開它們。錯誤機械化和美化httplib.InvalidURL：非數字端口：''（Python）

但是我得到這個錯誤：

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 718, in _set_hostport 
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:]) 
httplib.InvalidURL: nonnumeric port: ''

這發生在這行代碼：

page = mechanize.urlopen(req)

以下是我的代碼。任何洞察我做錯了什麼？許多網址都有效，當它遇到某些我得到這個錯誤信息的時候，所以不知道爲什麼。

from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 
import re, os 
import shutil 
import mechanize 
import urllib2 
import sys 
reload(sys) 
sys.setdefaultencoding("utf-8") 

mech = Browser() 
linkfile = open ("links.txt") 
urls = [] 
while 1: 
    url = linkfile.readline() 
    urls.append("%s" % linkfile.readline()) 
    if not url: 
     break 

for url in urls: 
    if "http://" or "https://" not in url: 
     url = "http://" + url 
    elif "..." in url: 
    elif ".pdf" in url: 
     #print "this is a pdf -- at some point we should save/log these" 
     continue 
    elif len (url) < 8: 
     continue 
    req = mechanize.Request(url) 
    req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8') 
    req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20100101 Firefox/17.0') 
    req.add_header('Accept-Language', 'Accept-Language en-US,en;q=0.5') 
    try: 
     page = mechanize.urlopen(req) 
    except urllib2.HTTPError, e: 
     print "there was an error opening the URL, logging it" 
     print e.code 
     logfile = open ("log/urlopenlog.txt", "a") 
     logfile.write(url + "," + "couldn't open this page" + "\n") 
     pass

來源

2013-01-01 user1328021

向我們展示失敗的網址。 – Thomas

http://blog.21ic.com/more.asp?id=27916 – user1328021

適用於我...'http：//blog.21ic.com/more.asp？id = 27916'即。 – Thomas

我覺得這段代碼

if "http://" or "https://" not in url:

是不是做你想要的（或者你認爲它會做什麼）的東西。

if "http://"

將始終評估爲true，因此您的網址永遠不會添加前綴。你需要重寫它（例如）爲：

if "https://" not in url and "http://" not in url:

而且，現在我開始測試你的作品：

urls = [] 
while 1: 
    url = linkfile.readline() 
    urls.append("%s" % linkfile.readline()) 
    if not url: 
     break

這實際上是爲了確保您的URL文件不正確讀取，每2號線被讀入，你可能想借此讀取：

urls = [] 
while 1: 
    url = linkfile.readline() 
    if not url: 
     break 
    urls.append("%s" % url)

的理由是 - 你叫linkfile.readline()兩次，迫使它讀取2線，僅保存Ë非常第二行到您的列表。

另外，您希望if子句在追加之前，以防止列表末尾出現空的條目。

但是你特別的URL例子適用於我。更多，我可能需要你的鏈接文件。

來源

2013-01-01 16:22:46 favoretti

我認爲你是對的，但不知道這是什麼原因造成的錯誤...當它試圖打開它們時，URL是前綴。我做了一份印刷聲明以保證這一點。 – user1328021

看我的編輯。這個特殊的URL對我來說很好，所以爲了幫助你更多，我可能需要你的鏈接文件。 – favoretti

錯誤機械化和美化httplib.InvalidURL：非數字端口：''（Python）

回答

相關問題