2017-05-10 74 views
1

我正在使用python 2.7中的請求包進行網絡新聞報廢的美麗湯。當我調試下面的代碼時,我收到錯誤消息。InvalidSchema:找不到連接適配器

#encoding:utf-8 

import re 
import socket 
import requests 
import httplib 
import urllib2 
from bs4 import BeautifulSoup 

#headers = ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0') 
response = requests.get('http://www.mhi.com.my/') 

class Crawler(object): 
    """Crawler""" 
    def __init__(self, url): 
     self.url = url 

    def getNextUrls(self): 
     urls = [] 
     request = urllib2.Request(self.url) 
     request.add_header('User-Agent', 
     'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0') 
     try: 
      html = urllib2.urlopen(request) 
     except socket.timeout, e: 
      pass 
     except urllib2.URLError,ee: 
      pass 
     except httplib.BadStatusLine: 
      pass 
      # analyse the txt have gotten 
     soup = BeautifulSoup(response.text,'lxml')# slesct and return a list 
     pattern = 'http://www\.mhi\.com\.my/.*\.html' 
     links = soup.find_all('a', href=re.compile(pattern)) 
     for link in links: 
      urls.append(link) 
     return urls 

def getNews(url): 
    print url 
    xinwen = '' 
    request = requests.get(url) 
    request.add_header('User-Agent', 
     'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0') 
    try: 
     html = urllib2.urlopen(request) 
    except urllib2.HTTPError, e: 
     print e.code 

    soup = BeautifulSoup(html, 'html.parser') 
    for news in soup.select('p.para'): 
     xinwen += news.get_text().decode('utf-8') 
    return xinwen 

class News(object): 
    """ 
    source:from where 
    title:title of news 
    time:published time of news 
    content:content of news 
    type:type of news  
    """ 
    def __init__(self, title, time, content, type): 
     self.title = title 
     self.time = time 
     self.content = content 
     self.type = type 

file = open('C:/MyFold/kiki.json', 'a') 
url = "http://www.mhi.com.my" 
print url 
s = Crawler(url) 
for newsUrl in s.getNextUrls(): 
    file.write(getNews(newsUrl)) 
    file.write("\n") 
    print "---------------------------" 

file.close() 

這是返回錯誤。

C:\Python27\python.exe C:/MyFold/CodeTest/file1.py 
http://www.mhi.com.my 
Traceback (most recent call last): 
    File "C:/MyFold/CodeTest/file1.py", line 74, in <module> 
    file.write(getNews(newsUrl)) 
    File "C:/MyFold/CodeTest/file1.py", line 42, in getNews 
    request = requests.get(url) 
    File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get 
    return request('get', url, params=params, **kwargs) 
    File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request 
    return session.request(method=method, url=url, **kwargs) 
    File "C:\Python27\lib\site-packages\requests\sessions.py", line 488, in request 
    resp = self.send(prep, **send_kwargs) 
    File "C:\Python27\lib\site-packages\requests\sessions.py", line 603, in send 
    adapter = self.get_adapter(url=request.url) 
    File "C:\Python27\lib\site-packages\requests\sessions.py", line 685, in get_adapter 
    raise InvalidSchema("No connection adapters were found for '%s'" % url) 
requests.exceptions.InvalidSchema: No connection adapters were found for '<a class="glow" href="http://www.mhi.com.my/akhbar2016.html" style="text-decoration: none;"></a>' 
<a class="glow" href="http://www.mhi.com.my/akhbar2016.html" style="text-decoration: none;"></a> 

這是我的循環問題嗎? 任何人都可以幫助我嗎?

回答

2

在你classCrawler,函數返回getNextUrls()名單<a>

[<a class="glow" href="http://www.mhi.com.my/akhbar2016.html" style="text-decoration: none;"></a>] 

當你循環,它會把整個<a>元素的功能getNews,但參數應該是一個網址。從

urls.append(link) 

urls.append(link.get('href')) 

,這樣函數getNextUrls將返回,而不是<a>元素列表URL列表:

你可以改變你的功能getNextUrls()

['http://www.mhi.com.my/akhbar2016.html'] 
+0

我明白了,謝謝你的明確解釋:)。 – libolon

+0

當然,我會標記它。這是我第一次提出一個python問題,並且很快得到了很好的答案。我將更多地考慮這種方法。 – libolon

+0

@libolon歡迎來到stackoverflow,我們建立社區互相幫助,無論多遠,希望你會從這裏開始一段美好的旅程:) –

相關問題