我試圖從MySql數據庫獲取一些推特數據。 我在開發這段代碼時遇到了大量的編碼錯誤。這種持續是我得到的運行代碼,並獲得這個OUTFILE充滿\ UXX人物都圍繞着唯一的辦法,你可以在這裏看到:從outfile輸入編碼問題的數據庫中的字符串
[{..., "lang_tweet": "es", "text_tweet": "Recuerdo un d\u00eda de, *llamada a la 1:45*, \"Micho, me va a dar algo, estoy temblando, me tome un moster y un balium... Que me muero.!!\",...},...]
我已經在這裏了一圈又一圈嘗試不同的解決方案,但事情是,我對編碼和編碼的抽象感到困惑。 我能做些什麼來解決這個問題? 或者,也許會更容易只抓住髒的JSON和'解析'它手動解碼這些字符。
如果你想去看一下我使用查詢數據庫的代碼:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pymysql
import collections
import json
conn = pymysql.connect(host='localhost', user='sut', passwd='r', db='tweetsjun2016')
cur = conn.cursor()
cur.execute("""
SELECT * FROM 20160607_tweets
WHERE 20160607_tweets.creation_date >= '2016-06-07 10:51'
AND 20160607_tweets.creation_date <= '2016-06-07 11:51'
AND 20160607_tweets.lang_tweet = "es"
AND 20160607_tweets.has_keyword = 1
AND 20160607_tweets.rt = 0
LIMIT 20
""")
objects_list = []
for row in cur:
d = collections.OrderedDict()
d['download_date'] = row[1]
d['creation_date'] = row[2]
d['id_user'] = row[5]
d['favorited'] = row[7]
d['lang_tweet'] = row[10]
d['text_tweet'] = row[11].decode('latin1')
d['rt'] = row[12]
d['rt_count'] = row[13]
d['has_keyword'] = row[19]
objects_list.append(d)
# print(row[11].decode('latin1')) <- looks perfect, it prints with accents and fine
j = json.dumps(objects_list, default=date_handler, encoding='latin1')
objects_file = "test23" + "_dicts"
f = open(objects_file,'w')
print >> f, j
cur.close()
conn.close()
如果我從所有刪除*.decode('latin1')
方法是應用程序我得到這個錯誤:
Traceback (most recent call last):
File "test.py", line 51, in <module>
j = json.dumps(objects_list, default=date_handler)
File "C:\Users\Vichoko\Anaconda2\lib\json\__init__.py", line 251, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "C:\Users\Vichoko\Anaconda2\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\Vichoko\Anaconda2\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xed in position 13: invalid continuation byte
我真的不知道字符串從db到我的腳本的方式。
感謝您的閱讀,任何想法都會很感謝。
EDIT1: 在這裏你可以看到JSON文件是如何被導出的編纂錯誤文本text_tweet
關鍵-VAL: https://github.com/Vichoko/real-time-twit/blob/master/auto_labeling/json/tweets_sismos/tweetsago20160.json
我開始認爲這是容易只是** **解析oufile並找到正則表達式'\ u [a-f0-9] +'並將其替換爲相應的值。你推薦使用什麼語言或工具來做這種事?任何想法? –