2015-12-10 14 views
1

我從MongoDB中讀取數據並將其存儲在大熊貓數據幀進行進一步的探索性分析和機器學習 我的MongoDB的文件看起來是這樣的..如何存儲MongoDB的在大熊貓嵌套的文檔,而無需重複

{ 
    "user_id" : "user_9", 
    "order_id" : "order_9", 
    "meals"  : 5, 
    "order_area" : "London", 

    "dish" : [ 
     { 
     "dish_id"   : "012" , 
     "dish_name"  : "ABC", 
     "dish_type"  : "Non-Veg",     
     "dish_price"  : 135, 
     "dish_quantity" : 2, 
     "ratings"   : 4, 
     "reviews"   : "blah blah blah", 
     "coupon_type"  : "Rs 20 off" 
     }, 
     { 
     "dish_id"   : "013" , 
     "dish_name"  : "XYZ", 
     "dish_type"  : "Non-Veg",     
     "dish_price"  : 125, 
     "dish_quantity" : 3, 
     "ratings"   : 4, 
     "reviews"   : "blah blah blah", 
     "coupon_type"  : "Rs 20 off" 
     }, 
    ], 
} 

一旦我得到了蟒蛇的數據我用json_normalize而將其插入到一個數據幀

df= json_normalize(db.dataset2.find(), 'dish',   
['_id','user_id','order_id','order_time','meals','order_area'] 

這使我在熊貓以下分裂菜相關的屬性

coupon_type  dish_id dish_name dish_price dish_quantity 
0  Rs 20 off  012  ABC  135   2 
1  Rs 20 off  013  XYZ  125   3 

    ratings reviews  coupon_type user_id order_id meals order_area 
0 4  blah blah blah Rs 20 off  9  9   5  London 
1 4  blah blah blah Rs 20 off  9  9   5  London 

問題的,這是數據被複制(USER_ID,ORDER_ID,膳食,_id & order_area) 請告訴我其他方法來在數據幀中存儲該數據,而無需重複?

+0

我沒有聽到'pandas'庫之前,所以這個問題的標題很有意思對我:) –

回答

1

你可能會尋找一個MultiIndex,這至少給避免出現duplication - (see docs):

df = json_normalize(data, 'dish', ['user_id', 'order_id', 'meals', 'order_area']) 
df = df.set_index(['user_id','order_id', 'meals', 'order_area']) 

            coupon_type dish_id dish_name dish_price \ 
user_id order_id meals order_area            
user_9 order_9 5  London  Rs 20 off  012  ABC   135 
            Rs 20 off  013  XYZ   125 

            dish_quantity dish_type ratings \ 
user_id order_id meals order_area          
user_9 order_9 5  London     2 Non-Veg  4 
               3 Non-Veg  4 

              reviews 
user_id order_id meals order_area     
user_9 order_9 5  London  blah blah blah 
            blah blah blah 
+0

非常感謝..這是我想要的..但​​是,我將能夠適應'MultiIndex'數據的機器學習模型嗎? – Neil