目前,我有很多推文,並且我將把它們存儲在實驗室中的服務器上。不過,我有一個問題需要確定我打算怎麼做。使用PyMongo存儲Twitter流式API的JSON字典
例如,鳴叫具有以下格式:
{
"contributors": null,
"coordinates": null,
"created_at": "Tue Jul 10 17:09:12 +0000 2012",
"entities": {
"hashtags": [{
"indices": [62, 78],
"text": "thestrongnation"
}],
"urls": [],
"user_mentions": [{
"id": 376483630,
"id_str": "376483630",
"indices": [0, 8],
"name": "SherryHonig",
"screen_name": "sahonig"
}]
},
"favorited": false,
"geo": null,
"id": 222739261219282945,
"id_str": "222739261219282945",
"in_reply_to_screen_name": "sahonig",
"in_reply_to_status_id": 222695060528037889,
"in_reply_to_status_id_str": "222695060528037889",
"in_reply_to_user_id": 376483630,
"in_reply_to_user_id_str": "376483630",
"place": {
"attributes": {},
"bounding_box": {
"coordinates": [
[
[-106.645646, 25.837164000000001],
[-93.508038999999997, 25.837164000000001],
[-93.508038999999997, 36.500703999999999],
[-106.645646, 36.500703999999999]
]
],
"type": "Polygon"
},
"country": "United States",
"country_code": "US",
"full_name": "Texas, US",
"id": "e0060cda70f5f341",
"name": "Texas",
"place_type": "admin",
"url": "http://api.twitter.com/1/geo/id/e0060cda70f5f341.json"
},
"retweet_count": 0,
"retweeted": false,
"source": "web",
"text": "@sahonig BOOM !!!! I feel a 1 coming on!!! Awesome! #thestrongnation",
"truncated": false,
"user": {
"contributors_enabled": false,
"created_at": "Wed Feb 15 14:40:48 +0000 2012",
"default_profile": false,
"default_profile_image": false,
"description": "Living life on 30A & doing it my way. My mind is Stronger than physical challenge. Runner, Crosfit, Fitness Challenges. Proud member of #thestrongnation. ",
"favourites_count": 17,
"follow_request_sent": null,
"followers_count": 215,
"following": null,
"friends_count": 184,
"geo_enabled": true,
"id": 493181025,
"id_str": "493181025",
"is_translator": false,
"lang": "en",
"listed_count": 4,
"location": "Seagrove Beach, FL",
"name": "30A My Way \u2600",
"notifications": null,
"profile_background_color": "c0deed",
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
"profile_background_tile": true,
"profile_image_url": "http://a0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
"profile_image_url_https": "https://si0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
"profile_link_color": "0084B4",
"profile_sidebar_border_color": "C0DEED",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"protected": false,
"screen_name": "30A_MyWay",
"show_all_inline_media": false,
"statuses_count": 1731,
"time_zone": "Central Time (US & Canada)",
"url": null,
"utc_offset": -21600,
"verified": false
}
}
這是,當然,在Python字典,這恰好跟隨JSON格式。 MongoDB方便地接受這些JSON格式,但事情是,我不想所有提供的信息。 Streaming API爲我提供了20個字段,當時我真的只想混淆userid,text和location。我最初打算通過這個解析並提取我想要的文本,但是我找不到可靠的解析器,並且考慮到正在開發的條件,我覺得寫一個會浪費時間。
但是,我正在考慮的另一個解決方案是,因爲這些正在讀入MongoDB,所以我可能只在字典中存儲我想要的內容並擺脫其餘部分。提出的唯一問題是,Twitter收到的文件格式將所有字典放在同一行 - 我覺得不管怎樣我都必須進行某種提取。
還有其他人有什麼建議嗎?
用於與pymongo這裏蟒的示例代碼http://stackoverflow.com/questions/10855518/optimization-dumping-json-from-a-streaming-api-to-mongo/10865813#10865813應該會有很大幫助 – 2012-07-11 22:43:09