我正在瀏覽URL列表並使用Mechanize/BeautifulSoup與我的腳本打開它們。錯誤機械化和美化httplib.InvalidURL:非數字端口:''(Python)
但是我得到這個錯誤:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 718, in _set_hostport
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: ''
這發生在這行代碼:
page = mechanize.urlopen(req)
以下是我的代碼。任何洞察我做錯了什麼?許多網址都有效,當它遇到某些我得到這個錯誤信息的時候,所以不知道爲什麼。
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re, os
import shutil
import mechanize
import urllib2
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
mech = Browser()
linkfile = open ("links.txt")
urls = []
while 1:
url = linkfile.readline()
urls.append("%s" % linkfile.readline())
if not url:
break
for url in urls:
if "http://" or "https://" not in url:
url = "http://" + url
elif "..." in url:
elif ".pdf" in url:
#print "this is a pdf -- at some point we should save/log these"
continue
elif len (url) < 8:
continue
req = mechanize.Request(url)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20100101 Firefox/17.0')
req.add_header('Accept-Language', 'Accept-Language en-US,en;q=0.5')
try:
page = mechanize.urlopen(req)
except urllib2.HTTPError, e:
print "there was an error opening the URL, logging it"
print e.code
logfile = open ("log/urlopenlog.txt", "a")
logfile.write(url + "," + "couldn't open this page" + "\n")
pass
向我們展示失敗的網址。 – Thomas
http://blog.21ic.com/more.asp?id=27916 – user1328021
適用於我...'http://blog.21ic.com/more.asp?id = 27916'即。 – Thomas