2016-01-05 60 views
-1

我不明白爲什麼下面的代碼保持產生一個空字符串。我試圖讓代碼提取網站的內容到一個「txt」文件,但它只是繼續生成一個空字符串。代碼中有錯誤嗎?代碼保持產生一個空字符串

import urllib3 
import certifi 


# Function: Convert information within html document to a text file 
# Append information to the file 
def html_to_text(source_html, target_file): 

    http = urllib3.PoolManager(
     cert_reqs='CERT_REQUIRED',  # Force certificate check. 
     ca_certs=certifi.where(),  # Path to the Certifi Bundle 
     headers={'connection': 'keep-alive', 'user-agent': 'Mozilla/5.0', 'accept-encoding': 'gzip, deflate'}, 
    ) 

    r = http.urlopen('GET', source_html) 
    print(source_html) 
    response = r.read().decode('utf-8') 
    # TODO: Find the problem that keeps making the code produce an empty string 
    print(response) 
    temp_file = open(target_file, 'w+') 
    temp_file.write(response) 


source_address = "https://sg.finance.yahoo.com/lookup/all?s=*&t=A&m=SG&r=&b=0" 
target_location = "C:\\Users\\Admin\\PycharmProjects\\TheLastPuff\\Source\\yahoo_ticker_symbols.txt" 

html_to_text(source_address, target_location) 
+1

當你說「生產」,你的意思是「打印」或「寫入到文件」,或者「印刷和寫入到文件」?做'print(source_html)'和'print(response)'打印什麼? – Kevin

+0

打印和寫入功能都沒有產生任何東西。 「print(source_html)」確實成功地打印了「source_address」。 – Cloud

+0

'r'對象似乎有一個'r.data'屬性來保存響應主體。 http://urllib3.readthedocs.org/en/latest/#usage – Jasper

回答

0

我用下面的代碼得到響應。唯一相關的更改是使用r.data而不是r.read()

import urllib3 
import certifi 


def html_to_text(source_html): 

    http = urllib3.PoolManager(
     cert_reqs='CERT_REQUIRED',  # Force certificate check. 
     ca_certs=certifi.where(),  # Path to the Certifi Bundle 
     headers={'connection': 'keep-alive', 'user-agent': 'Mozilla/5.0', 'accept-encoding': 'gzip, deflate'}, 
    ) 

    r=http.urlopen('GET', source_html) 
    print(source_html) 
    print(r.headers) 
    response = r.data     # instead of read().decode('utf-8') 
    print(response) 


source_address = "https://sg.finance.yahoo.com/lookup/all?s=*&t=A&m=SG&r=&b=0" 

html_to_text(source_address) 

使用的版本:

>>> certifi.__version__ 
'2015.11.20.1' 
>>> urllib3.__version__ 
'1.14' 
>>> sys.version 
'3.5.1 (default, Dec 7 2015, 12:58:09) \n[GCC 5.2.0]' 
+0

此代碼似乎是工作,但我得到另一個錯誤:「urllib.error.HTTPError:HTTP錯誤502:服務器掛斷」。我認爲這是網站踢我出去。 – Cloud