我有一個簡單的項目,從旅遊網站上抓取評論並將其存儲在excel文件中。 評論可能是西班牙文,日本或任何其他語言, 也有評論有時包含特殊符號,如「❤❤」。如何將非英文字符串存儲到excel文件python3中?
我需要存儲所有數據(如果不能寫入,可以排除特殊符號)。
我能夠刮我想要的數據並將其打印在控制檯,因爲它是(如日語文本),但是問題是與將其存儲在csv文件,它是示出了如下所示
錯誤消息我試着用UTF-8編碼打開文件(正如下面的評論中提到的那樣),但是然後它將數據保存在一些奇怪的符號中,這是沒有意義的 ....並且找不到問題的答案。有什麼建議麼。
我使用Python 3.5.3
我的代碼蟒蛇:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re
file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)
pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)
browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)
while (pages):
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
containers = soup.find_all("div",{"class":"innerBubble"})
showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
if showMore:
browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
containers = soup.find_all("div", {"class": "innerBubble"})
showMore = False
for container in containers:
bubble = container.div.div.span["class"][1]
title = container.div.find("div", {"class": "quote"}).a.span.text
review = container.find("p", {"class": "partial_entry"}).text
f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
print(bubble)
print(title)
print(review)
browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
time.sleep(5)
pages -= 1
pageNumber += 1
f.close()
我收到以下錯誤:
Traceback (most recent call last):
File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>
Process finished with exit code 1
UPDATE
我正在嘗試解決此問題。最後,我需要將日語評論翻譯爲英文以及研究,所以可能是我可以使用其中一個google api在編寫代碼之前轉換字符串,然後將其寫入csv文件。 ..
嘗試'f = open(file,「w」,encoding ='utf-8')'。請注意,使用上下文管理器打開文件是瘋狂的,我會用不同的功能分離程序的不同部分(獲取內容,抓取內容,寫下結果) –
[標準美國英語字符和符號在CSV中,使用Python](https://stackoverflow.com/questions/12357261/handling-non-standard-american-english-characters-and-symbols-in-a-csv-using- py) – JeffC
@MaartenFabré刪除錯誤,但實際上並沒有將相同的東西打印到文件中。它應該打印**「美しい幾何學模様!」**而是它的打印**「美ã-ã」幾佽•å|模æ§~~「** –