從微博刪除網址 - UnicodeEncodeError：「ASCII」編解碼器不能編碼字符

我試圖刪除數據集使用pyspark從微博的網址，但我發現了以下錯誤：從微博刪除網址 - UnicodeEncodeError：「ASCII」編解碼器不能編碼字符

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 58: ordinal not in range(128)

導入從csv文件數據框中：

tweetImport=spark.read.format('com.databricks.spark.csv')\ 
        .option('delimiter', ';')\ 
        .option('header', 'true')\ 
        .option('charset', 'utf-8')\ 
        .load('./output_got.csv')

從微博中刪除網址：

from pyspark.sql.types import StringType 
from pyspark.sql.functions import udf 

normalizeTextUDF=udf(lambda text: re.sub(r"(\w+:\/\/\S+)", \ 
       ":url:", str(text).encode('ascii','ignore')), \   
       StringType()) 

tweetsNormalized=tweetImport.select(normalizeTextUDF(\ 
       lower(tweetImport.text)).alias('text')) 
tweetsNormalized.show()

已經TR滅蠅燈：

normalizeTextUDF=udf(lambda text: re.sub(r"(\w+:\/\/\S+)", \ 
       ":url:", str(text).encode('utf-8')), \   
       StringType())

和：

normalizeTextUDF=udf(lambda text: re.sub(r"(\w+:\/\/\S+)", \ 
       ":url:", unicode(str(text), 'utf-8')), \   
       StringType())

沒有工作

------------ -----------編輯---

回溯：

Py4JJavaError: An error occurred while calling o581.showString. :org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 10.0 failed 1 times, most recent failure: Lost task 
0.0 in stage 10.0 (TID 10, localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last): 
    File "/home/flav/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py", line 174, in main 
    process() 
    File "/home/flav/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py", line 169, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/home/flav/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py", line 106, in <lambda> 
    func = lambda _, it: map(mapper, it) 
    File "/home/flav/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py", line 92, in <lambda> 
    mapper = lambda a: udf(*a) 
    File "/home/flav/zeppelin-0.7.1-bin-all/interpreter/spark/pyspark/pyspark.zip/pyspark/worker.py", line 70, in <lambda> 
    return lambda *a: f(*a) 
    File "<stdin>", line 3, in <lambda> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 58: ordinal not in range(128)

來源

2017-04-12 Flav Scheidt

我們需要的完整的回溯，因爲它顯示了什麼行拋出異常以及Python如何到達那裏。 –

追溯添加到原始文章^。〜 –

這不幸不是很清楚;無論是Pyspark本身提出的例外，還是Pyspark設法隱藏實際的例外追溯。 –

我想出了一個辦法做什麼，我需要先刪除ponctuation，使用下面的函數：

import string 
import unicodedata 
from pyspark.sql.functions import * 

def normalizeData(text): 
    replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation)) 
    nfkd_form=unicodedata.normalize('NFKD', unicode(text)) 
    dataContent=nfkd_form.encode('ASCII', 'ignore').translate(replace_punctuation) 
    dataContentSingleLine=' '.join(dataContent.split()) 

return dataContentSingleLine 

udfNormalizeData=udf(lambda text: normalizeData(text)) 
tweetsNorm=tweetImport.select(tweetImport.date,udfNormalizeData(lower(tweetImport.text)).alias('text'))

來源

2017-04-30 16:40:44

首先嚐試解碼文本：

str(text).decode('utf-8-sig')

然後運行編碼：

str(text).encode('utf-8')

來源

2017-04-12 14:33:49

從微博刪除網址 - UnicodeEncodeError：「ASCII」編解碼器不能編碼字符

回答

相關問題