2016-04-06 18 views
1

這事,我使用PhantomJS和硒在Python來渲染頁面,這是代碼:字符無法正確解碼使用Jsoup和PhantomJS

import sys, time 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

path_to_chromedriver = 'C:\\..\\chromedriver' 

section = sys.argv[1] 
path = sys.argv[2] 
links = sys.argv[3] 

listOfLinks = [] 
file = open(links, 'r') 
for link in file: 
    listOfLinks.append(link) 

dr = webdriver.Chrome(executable_path = path_to_chromedriver) 

cont = 0 
for link in listOfLinks: 
    try: 
     dr.get(link) 

     # Wait. 
     element = WebDriverWait(dr, 20).until(
      EC.presence_of_element_located((By.CLASS_NAME, "_img-zoom")) 
     ) 

     time.sleep(1) 

     htmlPath = path + section + "_" + str(cont) + ".html" 

     # Write HTML. 
     file = open(htmlPath, 'w') 
     file.write(dr.page_source) 
     file.close() 

     cont = cont + 1 
    except: 
     print("Exception") 

dr.quit() 

此代碼創建收到的鏈接的HTML作爲參數。

該文件由Jsoup用Java解析:

Document document = Jsoup.parse(file, "UTF-8"); 

然而,特殊字符爲「€」,「A」,「E」,「我」,等等,不能正確地解碼和他們被'?'取代。我該如何解決這個問題?

+2

Try Document document = Jsoup.parse(file,「ISO-8859-1」); – Eritrean

+0

@Uzochi是的,這工作! – cuoka

回答

0

溶液通過Uzochi

嘗試文獻文檔= Jsoup.parse(文件, 「ISO-8859-1」)中找到;