2017-03-04 124 views
0

我正在與numpy一起工作,並試圖找到哪個平臺銷售的NA區域中的副本數量最多。使用numpy從CSV文件中提取數據

我有一個CSV文件來保存大量的數據看起來像這樣的:

Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales 
1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74 
2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24 
3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82 
4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33 
5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37 
6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26 
7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01 
8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02 
9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62 
10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31 
11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76 

我想用最銷售及NA區域售電量打印平臺。我怎樣才能做到這一點?

+0

你到目前爲止嘗試了什麼? – fodma1

+0

我硬編碼所有不同的平臺作爲掩碼,如: maskNES =(data [:,2] =='NES') 然後我將它分配給一個變量: pfNES = data [maskNES] [:, 6] .sum() 最後我比較了所有的平臺找到最高值的平臺。 只是看起來像一個愚蠢的做法。如果我有成千上萬個不同的平臺 哦,我把csv數據放到一個名爲'data'的矩陣中 – Rainoa

回答

1

隨着熊貓,這是相當直接。

代碼:

# read csv data into a dataframe 
df = pd.read_csv(data, skipinitialspace=True) 

# roll up by NA Sales 
platform_roll_up = df.groupby('Platform')['NA_Sales'].sum() 

# find row with max sales 
idx_max = platform_roll_up.idxmax() 

# show platform and sales for max 
print(idx_max, platform_roll_up[idx_max]) 

結果:

Wii 101.71 

測試數據:

data = StringIO(u""" 
    Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales 
    1,Wii Sports,Wii,2006,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74 
    2,Super Mario Bros.,NES,1985,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24 
    3,Mario Kart Wii,Wii,2008,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82 
    4,Wii Sports Resort,Wii,2009,Sports,Nintendo,15.75,11.01,3.28,2.96,33 
    5,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,Nintendo,11.27,8.89,10.22,1,31.37 
    6,Tetris,GB,1989,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26 
    7,New Super Mario Bros.,DS,2006,Platform,Nintendo,11.38,9.23,6.5,2.9,30.01 
    8,Wii Play,Wii,2006,Misc,Nintendo,14.03,9.2,2.93,2.85,29.02 
    9,New Super Mario Bros. Wii,Wii,2009,Platform,Nintendo,14.59,7.06,4.7,2.26,28.62 
    10,Duck Hunt,NES,1984,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31 
    11,Nintendogs,DS,2005,Simulation,Nintendo,9.07,11,1.93,2.75,24.76 
""") 
+0

感謝您的快速回答! 我正在嘗試使用適用於numpy.ndarray的解決方案。其中沒有iloc屬性。在這種情況下,我應該遠離ndarray嗎?另外我試圖找到X平臺所有產品的總體NA_Sales值。而不是找到最高的單一價值。順便說一下,我是python的新手:) – Rainoa

+0

謝謝!非常感謝答案,您編輯的版本正是我所期待的。 – Rainoa

1

與加載此是直截了當:

In [280]: data=np.genfromtxt('stack42602390.csv',delimiter=',',names=True, dtype=None) 

In [281]: data 
Out[281]: 
array([ (1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo', 41.49, 29.02, 3.77, 8.46, 82.74), 
     (2, b'Super Mario Bros.', b'NES', 1985, b'Platform', b'Nintendo', 29.08, 3.58, 6.81, 0.77, 40.24), 
     (3, b'Mario Kart Wii', b'Wii', 2008, b'Racing', b'Nintendo', 15.85, 12.88, 3.79, 3.31, 35.82), 
.... 
     (11, b'Nintendogs', b'DS', 2005, b'Simulation', b'Nintendo', 9.07, 11. , 1.93, 2.75, 24.76)], 
     dtype=[('Rank', '<i4'), ('Name', 'S25'), ('Platform', 'S3'), ('Year', '<i4'), ('Genre', 'S12'), ('Publisher', 'S8'), ('NA_Sales', '<f8'), ('EU_Sales', '<f8'), ('JP_Sales', '<f8'), ('Other_Sales', '<f8'), ('Global_Sales', '<f8')]) 

b'string'只是顯示字節串的Python3方式,從genfromtxt默認的字符串格式。他們不會在Py2中顯示。

結果是一個結構化數組,具有不同的字段名稱和類型。它不是包含行和列的2d數組。

NA_Sales數據:

In [282]: data['NA_Sales'] 
Out[282]: 
array([ 41.49, 29.08, 15.85, 15.75, 11.27, 23.2 , 11.38, 14.03, 
     14.59, 26.93, 9.07]) 

和最大的這些:

In [283]: np.argmax(data['NA_Sales']) 
Out[283]: 0 

和相應的記錄:

In [284]: data[0] 
Out[284]: (1, b'Wii Sports', b'Wii', 2006, b'Sports', b'Nintendo', 41.49, 29.02, 3.77, 8.46, 82.74) 

爲了最大限度地利用這個數組你」的你必須閱讀結構化數組。

+0

試過這個解決方案,但遇到了問題,更長的下來我的CSV文件有標題內的逗號,我不能將_quotechar ='「'_添加到np。getfromtext – Rainoa

+0

'csv'包處理引號,但'numpy'讀者不會。 'genfromtxt'接受來自它的任何行的輸入,所以你可以預處理這些行,清理它們以便可以用簡單的分隔符來解析它們。這已經在許多以前的SO問題中討論過了。 – hpaulj

+0

帶有過濾器輸入的'genfromtxt'的最近示例:http://stackoverflow.com/a/42593389/901925 – hpaulj