2012-11-26 58 views
2

我使用disco爲許多不同目的運行數十個地圖縮減作業。我的數據變得非常龐大,我想我會嘗試使用DDFS而不是標準的txt文件進行更改。從DDFS讀取數據ValueError:沒有可以解碼的JSON對象

我跟着DISCO map/reduce example Counting Words as a map/reduce job,沒有太大困難,在別人的幫助下,Reading JSON specific data into DISCO我已經過去了我最近的一個問題。

我試圖讀取數據進出ddfs以更好的塊和分發它,但我有點麻煩。

下面是一個例子文件:

# ddfs chunk data:test1 ./file.txt 
created: disco://localhost/ddfs/vol0/blob/44/file_txt-0$549-db27b-125e1 

我測試該文件確實裝入直接數字頻率合成與:

# ddfs xcat data:test1 
{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "I'll call him back tomorrow I guess", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016843603968", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/305726905/FASHION-3.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1818996723/image_normal.jpg", "profile_sidebar_fill_color": "292727", "is_translator": false, "id": 113532729, "profile_text_color": "000000", "followers_count": 78, "protected": false, "location": "With My Niggas In Paris!", "default_profile_image": false, "listed_count": 0, "utc_offset": -21600, "statuses_count": 6733, "description": "Made in CHINA., Educated && Making My Own $$. Fear GOD && Put Him 1st. #TeamFollowBack #TeamiPhone\n", "friends_count": 74, "profile_link_color": "b03f3f", "profile_image_url": "http://a2.twimg.com/profile_images/1818996723/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "1f9199", "id_str": "113532729", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/305726905/FASHION-3.png", "name": "Bee'Jay", "lang": "en", "profile_background_tile": true, "favourites_count": 19, "screen_name": "OohMyBEEsNice", "url": "http://www.bitchimpaid.org", "created_at": "Fri Feb 12 03:32:54 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "000000", "default_profile": false, "following": null}, "in_reply_to_screen_name": null, "retweet_count": 0, "geo": null, "id": 168931016843603968, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"} 
{"favorited": false, "in_reply_to_user_id": 50940453, "contributors": null, "truncated": false, "text": "@LegaMrvica @MimozaBand makasi om artis :D kadoo kadoo", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "168653037894770688", "coordinates": null, "in_reply_to_user_id_str": "50940453", "entities": {"user_mentions": [{"indices": [0, 11], "screen_name": "LegaMrvica", "id": 50940453, "name": "Lega_thePianis", "id_str": "50940453"}, {"indices": [12, 23], "screen_name": "MimozaBand", "id": 375128905, "name": "Mimoza", "id_str": "375128905"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": 168653037894770688, "id_str": "168931016868761600", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "profile_sidebar_fill_color": "DDFFCC", "is_translator": false, "id": 48293450, "profile_text_color": "333333", "followers_count": 182, "protected": false, "location": "\u00dcT: -6.906799,107.622383", "default_profile_image": false, "listed_count": 0, "utc_offset": -28800, "statuses_count": 3052, "description": "Fashion design maranatha '11 // traditional dancer (bali) at sanggar tampak siring & Natya Nataraja", "friends_count": 206, "profile_link_color": "0084B4", "profile_image_url": "http://a3.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "9AE4E8", "id_str": "48293450", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "name": "nana afiff", "lang": "en", "profile_background_tile": true, "favourites_count": 2, "screen_name": "hasnfebria", "url": null, "created_at": "Thu Jun 18 08:50:29 +0000 2009", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "BDDCAD", "default_profile": false, "following": null}, "in_reply_to_screen_name": "LegaMrvica", "retweet_count": 0, "geo": null, "id": 168931016868761600, "source": "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry\u00ae</a>"} 
{"favorited": false, "in_reply_to_user_id": 27260086, "contributors": null, "truncated": false, "text": "@justinbieber u were born to be somebody, and u're super important in beliebers' life. thanks for all biebs. I love u. follow me? 84", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": "27260086", "entities": {"user_mentions": [{"indices": [0, 13], "screen_name": "justinbieber", "id": 27260086, "name": "Justin Bieber", "id_str": "27260086"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016856178688", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/416005864/Captura.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "profile_sidebar_fill_color": "f5e7f3", "is_translator": false, "id": 406750700, "profile_text_color": "333333", "followers_count": 1122, "protected": false, "location": "Adentro de una supra.", "default_profile_image": false, "listed_count": 0, "utc_offset": -14400, "statuses_count": 20966, "description": "Mi \u00eddolo es @justinbieber , si te gusta \u00a1genial!, si no, solo respetalo. El cambi\u00f3 mi vida completamente y mi sue\u00f1o es conocerlo #TrueBelieber . ", "friends_count": 1015, "profile_link_color": "9404b8", "profile_image_url": "http://a1.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "f9fcfa", "id_str": "406750700", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/416005864/Captura.JPG", "name": "neversaynever,right?", "lang": "es", "profile_background_tile": false, "favourites_count": 22, "screen_name": "True_Belieebers", "url": "http://www.wehavebieber-fever.tumblr.com", "created_at": "Mon Nov 07 04:17:40 +0000 2011", "contributors_enabled": false, "time_zone": "Santiago", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "in_reply_to_screen_name": "justinbieber", "retweet_count": 0, "geo": null, "id": 168931016856178688, "source": "<a href=\"http://yfrog.com\" rel=\"nofollow\">Yfrog</a> 
file.txt的

{"favorited": false, "in_reply_to_user_id": null, "contributors": null, "truncated": false, "text": "I'll call him back tomorrow I guess", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": null, "entities": {"user_mentions": [], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016843603968", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/305726905/FASHION-3.png", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1818996723/image_normal.jpg", "profile_sidebar_fill_color": "292727", "is_translator": false, "id": 113532729, "profile_text_color": "000000", "followers_count": 78, "protected": false, "location": "With My Niggas In Paris!", "default_profile_image": false, "listed_count": 0, "utc_offset": -21600, "statuses_count": 6733, "description": "Made in CHINA., Educated && Making My Own $$. Fear GOD && Put Him 1st. #TeamFollowBack #TeamiPhone\n", "friends_count": 74, "profile_link_color": "b03f3f", "profile_image_url": "http://a2.twimg.com/profile_images/1818996723/image_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "1f9199", "id_str": "113532729", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/305726905/FASHION-3.png", "name": "Bee'Jay", "lang": "en", "profile_background_tile": true, "favourites_count": 19, "screen_name": "OohMyBEEsNice", "url": "http://www.bitchimpaid.org", "created_at": "Fri Feb 12 03:32:54 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "profile_sidebar_border_color": "000000", "default_profile": false, "following": null}, "in_reply_to_screen_name": null, "retweet_count": 0, "geo": null, "id": 168931016843603968, "source": "<a href=\"http://twitter.com/#!/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"} 
{"favorited": false, "in_reply_to_user_id": 50940453, "contributors": null, "truncated": false, "text": "@LegaMrvica @MimozaBand makasi om artis :D kadoo kadoo", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "168653037894770688", "coordinates": null, "in_reply_to_user_id_str": "50940453", "entities": {"user_mentions": [{"indices": [0, 11], "screen_name": "LegaMrvica", "id": 50940453, "name": "Lega_thePianis", "id_str": "50940453"}, {"indices": [12, 23], "screen_name": "MimozaBand", "id": 375128905, "name": "Mimoza", "id_str": "375128905"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": 168653037894770688, "id_str": "168931016868761600", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "profile_sidebar_fill_color": "DDFFCC", "is_translator": false, "id": 48293450, "profile_text_color": "333333", "followers_count": 182, "protected": false, "location": "\u00dcT: -6.906799,107.622383", "default_profile_image": false, "listed_count": 0, "utc_offset": -28800, "statuses_count": 3052, "description": "Fashion design maranatha '11 // traditional dancer (bali) at sanggar tampak siring & Natya Nataraja", "friends_count": 206, "profile_link_color": "0084B4", "profile_image_url": "http://a3.twimg.com/profile_images/1803845596/Picture_20124_normal.jpg", "notifications": null, "show_all_inline_media": false, "geo_enabled": true, "profile_background_color": "9AE4E8", "id_str": "48293450", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/347686061/Galungan_dan_Kuningan.jpg", "name": "nana afiff", "lang": "en", "profile_background_tile": true, "favourites_count": 2, "screen_name": "hasnfebria", "url": null, "created_at": "Thu Jun 18 08:50:29 +0000 2009", "contributors_enabled": false, "time_zone": "Pacific Time (US & Canada)", "profile_sidebar_border_color": "BDDCAD", "default_profile": false, "following": null}, "in_reply_to_screen_name": "LegaMrvica", "retweet_count": 0, "geo": null, "id": 168931016868761600, "source": "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry\u00ae</a>"} 
{"favorited": false, "in_reply_to_user_id": 27260086, "contributors": null, "truncated": false, "text": "@justinbieber u were born to be somebody, and u're super important in beliebers' life. thanks for all biebs. I love u. follow me? 84", "created_at": "Mon Feb 13 05:34:27 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": null, "coordinates": null, "in_reply_to_user_id_str": "27260086", "entities": {"user_mentions": [{"indices": [0, 13], "screen_name": "justinbieber", "id": 27260086, "name": "Justin Bieber", "id_str": "27260086"}], "hashtags": [], "urls": []}, "in_reply_to_status_id": null, "id_str": "168931016856178688", "place": null, "user": {"follow_request_sent": null, "profile_use_background_image": true, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/416005864/Captura.JPG", "verified": false, "profile_image_url_https": "https://si0.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "profile_sidebar_fill_color": "f5e7f3", "is_translator": false, "id": 406750700, "profile_text_color": "333333", "followers_count": 1122, "protected": false, "location": "Adentro de una supra.", "default_profile_image": false, "listed_count": 0, "utc_offset": -14400, "statuses_count": 20966, "description": "Mi \u00eddolo es @justinbieber , si te gusta \u00a1genial!, si no, solo respetalo. El cambi\u00f3 mi vida completamente y mi sue\u00f1o es conocerlo #TrueBelieber . ", "friends_count": 1015, "profile_link_color": "9404b8", "profile_image_url": "http://a1.twimg.com/profile_images/1808883280/Captura6_normal.JPG", "notifications": null, "show_all_inline_media": false, "geo_enabled": false, "profile_background_color": "f9fcfa", "id_str": "406750700", "profile_background_image_url": "http://a3.twimg.com/profile_background_images/416005864/Captura.JPG", "name": "neversaynever,right?", "lang": "es", "profile_background_tile": false, "favourites_count": 22, "screen_name": "True_Belieebers", "url": "http://www.wehavebieber-fever.tumblr.com", "created_at": "Mon Nov 07 04:17:40 +0000 2011", "contributors_enabled": false, "time_zone": "Santiago", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": null}, "in_reply_to_screen_name": "justinbieber", "retweet_count": 0, "geo": null, "id": 168931016856178688, "source": "<a href=\"http://yfrog.com\" rel=\"nofollow\">Yfrog</a>"} 

我用它加載到直接數字頻率合成

在這一點上一切都很好,我加載了以前的Stack Post產生的腳本:

from disco.core import Job, result_iterator 
import gzip 

def map(line, params): 
    import unicodedata 
    import json 
    r = json.loads(line).get('text') 
    s = unicodedata.normalize('NFD', r).encode('ascii', 'ignore') 
    for word in s.split(): 
     yield word, 1 

def reduce(iter, params): 
    from disco.util import kvgroup 
    for word, counts in kvgroup(sorted(iter)): 
     yield word, sum(counts) 

if __name__ == '__main__': 
    job = Job().run(input=["tag://data:test1"], 
        map=map, 
        reduce=reduce) 
    for word, count in result_iterator(job.wait(show=True)): 
     print word, count 

注:,該腳本運行文件,如果input=["file.txt"],但是當我用"tag://data:test1"運行它,我得到以下錯誤:

# DISCO_EVENTS=1 python count_normal_words.py 
[email protected]:db30e:25bd8: 
Status: [map] 0 waiting, 1 running, 0 done, 0 failed 
2012/11/25 21:43:26 master  New job initialized! 
2012/11/25 21:43:26 master  Starting job 
2012/11/25 21:43:26 master  Starting map phase 
2012/11/25 21:43:26 master  map:0 assigned to solice 
2012/11/25 21:43:26 master  ERROR: Job failed: Worker at 'solice' died: Traceback (most recent call last): 
    File "/home/DISCO/data/solice/01/[email protected]:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 329, in main                              
    job.worker.start(task, job, **jobargs)                      
    File "/home/DISCO/data/solice/01/[email protected]:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/__init__.py", line 290, in start                              
    self.run(task, job, **jobargs)                        
    File "/home/DISCO/data/solice/01/[email protected]:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 286, in run                             
    getattr(self, task.mode)(task, params)                      
    File "/home/DISCO/data/solice/01/[email protected]:db30e:25bd8/usr/local/lib/python2.7/site-packages/disco/worker/classic/worker.py", line 299, in map                             
    for key, val in self['map'](entry, params):                     
    File "count_normal_words.py", line 12, in map                     
    File "/usr/lib64/python2.7/json/__init__.py", line 326, in loads                
    return _default_decoder.decode(s)                       
    File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode                
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())                   
    File "/usr/lib64/python2.7/json/decoder.py", line 384, in raw_decode               
    raise ValueError("No JSON object could be decoded")                   
ValueError: No JSON object could be decoded                      

2012/11/25 21:43:26 master  WARN: Job killed 
Status: [map] 1 waiting, 0 running, 0 done, 1 failed 
Traceback (most recent call last): 
    File "count_normal_words.py", line 28, in <module> 
    for word, count in result_iterator(job.wait(show=True)): 
    File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 348, in wait 
    timeout, poll_interval * 1000) 
    File "/usr/local/lib/python2.7/site-packages/disco/core.py", line 309, in check_results 
    raise JobError(Job(name=jobname, master=self), "Status %s" % status) 
disco.error.JobError: Job [email protected]:db30e:25bd8 failed: Status dead 

的錯誤狀態:ValueError: No JSON object could be decoded。再次,這工作正常使用文本文件作爲輸入,但現在DDFS。

任何想法,我接受建議?

回答

1

看起來數據在加載到DDFS時是不完整的。加載的最後一行沒有作爲最終字符。因爲這樣的工作失敗了。我想我需要嘗試繼續處理非JSON行。

相關問題