2013-11-04 12 views
3

有一組推文已保存到.txt文件中。將Python中的推文提要解析爲表格

我想將某些屬性放在Python中的sqlite表中。我成功創建了表格。

import pandas 
import sqlite3 
conn = sqlite3.connect('twitter.db') 
c = conn.cursor() 

c.execute(CREATE TABLE Tweet 
(
    created_at VARCHAR2(25), 
    id VARCHAR2(25), 
    text VARCHAR2(25) 
    source VARCHAR2(25), 
    in-reply_to_user_ID VARCHAR2(25), 
    retweet_Count VARCHAR2(25) 

) 

在我甚至試圖將解析的數據添加到數據庫之前,我試圖創建一個數據框,只是爲了查看。

tweets =pandas.read_table('file.txt', sep=',') 

我得到的錯誤:

CParserError: Error tokenizing data. C error: Expected 63 fields in line 3, saw 69 

我的假設是有「」不僅分離等領域,但在字符串中了。

另外,twitter數據的格式是我以前沒有用過的。每個字段以括號中的變量名開頭,冒號,然後用更多的括號分隔數據。像:

"created_at":"Fri Oct 11 00:00:03 +0000 2013", 

那麼我怎樣才能把它變成一個標準的表格格式,頂部的變量名?

鳴叫的完整的例子是這樣的:

{"created_at":"Fri Oct 11 00:00:03 +0000 2013","id":388453908911095800,"id_str":"388453908911095809","text":"LAGI PUN VISITORS DATANG PUKUL 9 AH","source":"<a href=\"http://www.tweetdeck.com\" rel=\"nofollow\">TweetDeck</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":447800506,"id_str":"447800506","name":"§yazwina·","screen_name":"_SAireen","location":"SSP","url":"http://flavors.me/syazwinaaireen#","description":"Absence makes the heart grow fonder. Stay us x @_DFitri's","protected":false,"followers_count":806,"friends_count":702,"listed_count":2,"created_at":"Tue Dec 27 08:29:53 +0000 2011","favourites_count":7478,"utc_offset":28800,"time_zone":"Beijing","geo_enabled":true,"verified":false,"statuses_count":32558,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http://a0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_tile":true,"profile_image_url":"http://a0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_image_url_https":"https://si0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_banner_url":"https://pbs.twimg.com/profile_banners/447800506/1369969522","profile_link_color":"FA0096","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"it"} 
+2

不幸的是,您不能將嵌套的JSON轉換爲平坦的表格結構,如表格或熊貓DataFrame,因爲它們本質上是不同的結構。看看[Python的JSON庫](http://docs.python.org/2/library/json.html)和[pandas的read_json方法](http://pandas.pydata.org/pandas-docs/stable /generated/pandas.io.json.read_json.html)。你將需要對twitter數據進行一些調整,以便將其轉換爲表格格式。 –

回答

0

我想有這個一個Python庫了,但我能得到你的鳴叫字符串解析爲一個字典,一旦我更換這些條款看起來沒有引用。

false to False 
true to True 
null to None 

我只是將整個括號表達式賦值給一個變量,創建一個字典。然後,您可以通過打印鍵將其作爲標題並將每個值作爲條目進行打印。

修正或引用這三個值也可能會使pandas解析器更快樂,不過我認爲csv讀取器可能更好地處理所有嵌入逗號以及單引號和雙引號。我想,JSON解析器仍然在帶有冒號的URL上阻塞。如果您要嘗試使用JSON,您可以嘗試轉義它們。