如何在sklearn中使用datasets.fetch_mldata（）？

我試圖運行下面的代碼進行短暫的機器學習算法：如何在sklearn中使用datasets.fetch_mldata（）？

import re 
import argparse 
import csv 
from collections import Counter 
from sklearn import datasets 
import sklearn 
from sklearn.datasets import fetch_mldata 

dataDict = datasets.fetch_mldata('MNIST Original')

在這一段代碼，我想讀的數據集通過sklearn「MNIST原」出現在mldata.org。這將導致以下錯誤（有更多行的代碼，但是我在這一行收到錯誤）：

Traceback (most recent call last): 
    File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1481, in <module> 
    debugger.run(setup['file'], None, None) 
    File "C:\Program Files (x86)\JetBrains\PyCharm 2.7.3\helpers\pydev\pydevd.py", line 1124, in run 
    pydev_imports.execfile(file, globals, locals) #execute the script 
    File "C:/Users/sony/PycharmProjects/Machine_Learning_Homework1/zeroR.py", line 131, in <module> 
    dataDict = datasets.fetch_mldata('MNIST Original') 
    File "C:\Anaconda\lib\site-packages\sklearn\datasets\mldata.py", line 157, in fetch_mldata 
    matlab_dict = io.loadmat(matlab_file, struct_as_record=True) 
    File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio.py", line 176, in loadmat 
    matfile_dict = MR.get_variables(variable_names) 
    File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 294, in get_variables 
    res = self.read_var_array(hdr, process) 
    File "C:\Anaconda\lib\site-packages\scipy\io\matlab\mio5.py", line 257, in read_var_array 
    return self._matrix_reader.array_from_header(header, process) 
    File "mio5_utils.pyx", line 624, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5717) 
    File "mio5_utils.pyx", line 653, in scipy.io.matlab.mio5_utils.VarReader5.array_from_header (scipy\io\matlab\mio5_utils.c:5147) 
    File "mio5_utils.pyx", line 721, in scipy.io.matlab.mio5_utils.VarReader5.read_real_complex (scipy\io\matlab\mio5_utils.c:6134) 
    File "mio5_utils.pyx", line 424, in scipy.io.matlab.mio5_utils.VarReader5.read_numeric (scipy\io\matlab\mio5_utils.c:3704) 
    File "mio5_utils.pyx", line 360, in scipy.io.matlab.mio5_utils.VarReader5.read_element (scipy\io\matlab\mio5_utils.c:3429) 
    File "streams.pyx", line 181, in scipy.io.matlab.streams.FileStream.read_string (scipy\io\matlab\streams.c:2711) 
IOError: could not read bytes

我已經試過網上查詢，但幾乎沒有提供任何幫助。任何有關解決這個錯誤的專家幫助將不勝感激。

TIA。

來源

2013-10-22 Patthebug

-1

這是'MNIST原創'。在「o」上使用小寫字母。

來源

2013-10-23 00:01:17

嗨，謝謝你的回覆。也嘗試了小'o'，仍然是同樣的錯誤。 – Patthebug

使用小寫字母「o」或大寫字母不會有不同。在內部，sklearn [使所有內容都爲小寫]（https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/datasets/mldata.py#L33）：'dataname.lower（）。replace（' '，' - '）'。 –

試試這樣說：

dataDict = fetch_mldata('MNIST original')

這爲我工作。由於您使用的語法爲from ... import ...，所以在使用它時，不應預先加入datasets

來源

2014-03-14 21:44:49 Brent

看起來緩存的數據已損壞。嘗試刪除它們並再次下載（需要一點時間）。如果沒有指定不同的「MINST原」中的數據應該是

~/scikit_learn_data/mldata/mnist-original.mat

來源

2014-11-04 20:41:01

下面是一些示例代碼怎麼弄MNIST數據準備用於sklearn：

def get_data(): 
    """ 
    Get MNIST data ready to learn with. 

    Returns 
    ------- 
    dict 
     With keys 'train' and 'test'. Both do have the keys 'X' (features) 
     and'y' (labels) 
    """ 
    from sklearn.datasets import fetch_mldata 
    mnist = fetch_mldata('MNIST original') 

    x = mnist.data 
    y = mnist.target 

    # Scale data to [-1, 1] - This is of mayor importance!!! 
    x = x/255.0*2 - 1 

    from sklearn.cross_validation import train_test_split 
    x_train, x_test, y_train, y_test = train_test_split(x, y, 
                 test_size=0.33, 
                 random_state=42) 
    data = {'train': {'X': x_train, 
         'y': y_train}, 
      'test': {'X': x_test, 
        'y': y_test}} 
    return data

來源

2016-01-13 22:48:55

我也得到一個fetch_mldata （）「IOError：無法讀取字節」錯誤。這是解決方案;相關的代碼行是

from sklearn.datasets.mldata import fetch_mldata 
mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/')

...一定要更改'data_home'爲您的首選位置（目錄）。

這裏是一個腳本：

#!/usr/bin/python 
# coding: utf-8 

# Source: 
# https://stackoverflow.com/questions/19530383/how-to-use-datasets-fetch-mldata-in-sklearn 
# ... modified, below, by Victoria 

""" 
pers. comm. (Jan 27, 2016) from MLdata.org MNIST dataset contactee "Cheng Ong:" 

    The MNIST data is called 'mnist-original'. The string you pass to sklearn 
    has to match the name of the URL: 

    from sklearn.datasets.mldata import fetch_mldata 
    data = fetch_mldata('mnist-original') 
""" 

def get_data(): 

    """ 
    Get MNIST data; returns a dict with keys 'train' and 'test'. 
    Both have the keys 'X' (features) and 'y' (labels) 
    """ 

    from sklearn.datasets.mldata import fetch_mldata 

    mnist = fetch_mldata('mnist-original', data_home='/media/Vancouver/apps/mnist_dataset/') 

    x = mnist.data 
    y = mnist.target 

    # Scale data to [-1, 1] 
    x = x/255.0*2 - 1 

    from sklearn.cross_validation import train_test_split 

    x_train, x_test, y_train, y_test = train_test_split(x, y, 
     test_size=0.33, random_state=42) 

    data = {'train': {'X': x_train, 'y': y_train}, 
      'test': {'X': x_test, 'y': y_test}} 

    return data 

data = get_data() 
print '\n', data, '\n'

來源

2016-01-27 21:24:34

如果你沒有給data_home，程序看起來$ {} yourprojectpath你/mldata/minist-original.mat可以下載該程序，並把該文件正確的路徑

來源

2017-04-06 08:15:19 mcolak

我遇到同樣的問題，並發現在不同的時間mnist-original.mat不同的文件大小，而我用我的可憐的WiFi。我切換到局域網，它工作正常。這可能是網絡問題。

來源

2017-07-09 22:58:11