1
這事,我使用PhantomJS和硒在Python來渲染頁面,這是代碼:字符無法正確解碼使用Jsoup和PhantomJS
import sys, time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path_to_chromedriver = 'C:\\..\\chromedriver'
section = sys.argv[1]
path = sys.argv[2]
links = sys.argv[3]
listOfLinks = []
file = open(links, 'r')
for link in file:
listOfLinks.append(link)
dr = webdriver.Chrome(executable_path = path_to_chromedriver)
cont = 0
for link in listOfLinks:
try:
dr.get(link)
# Wait.
element = WebDriverWait(dr, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "_img-zoom"))
)
time.sleep(1)
htmlPath = path + section + "_" + str(cont) + ".html"
# Write HTML.
file = open(htmlPath, 'w')
file.write(dr.page_source)
file.close()
cont = cont + 1
except:
print("Exception")
dr.quit()
此代碼創建收到的鏈接的HTML作爲參數。
該文件由Jsoup用Java解析:
Document document = Jsoup.parse(file, "UTF-8");
然而,特殊字符爲「€」,「A」,「E」,「我」,等等,不能正確地解碼和他們被'?'取代。我該如何解決這個問題?
Try Document document = Jsoup.parse(file,「ISO-8859-1」); – Eritrean
@Uzochi是的,這工作! – cuoka