How to Download the MNIST Dataset Using Python
If you are interested in learning machine learning, especially deep learning, you might have heard of the MNIST dataset. It is one of the most popular and widely used datasets for training and testing various image processing and recognition systems. In this article, you will learn what the MNIST dataset is, why it is useful for machine learning, and how to download it using three different Python libraries: NumPy, Scikit-learn, and TensorFlow Datasets.
What is the MNIST dataset?
The MNIST dataset stands for Modified National Institute of Standards and Technology database. It is a large database of handwritten digits that contains 60,000 training images and 10,000 testing images. Each image is a grayscale image of size 28 x 28 pixels, with a digit from 0 to 9 written on it. The dataset was created by re-mixing the samples from NIST's original datasets, which were collected from American Census Bureau employees and high school students. The MNIST dataset is considered as a benchmark for evaluating the performance of various machine learning models on image classification tasks.
download mnist dataset python
Why is it useful for machine learning?
The MNIST dataset is useful for machine learning because it provides a simple and easy way to get started with image processing and recognition. The dataset is small enough to fit in memory, but large enough to represent a variety of handwriting styles and patterns. The dataset is also well-balanced, meaning that each digit has approximately the same number of images in both the training and testing sets. The dataset is ideal for beginners who want to learn how to build, train, and test machine learning models using Python libraries.
How to access it using Python libraries?
There are many Python libraries that can help you access and manipulate the MNIST dataset. In this article, we will focus on three of them: NumPy, Scikit-learn, and TensorFlow Datasets. These libraries are widely used for scientific computing, machine learning, and deep learning respectively. They provide different ways to load, explore, and visualize the MNIST dataset. Let's see how they work in detail.
How to import and plot MNIST dataset in Python
TensorFlow Datasets: MNIST database of handwritten digits
Download and store the MNIST dataset in local using Python
Python utilities to download and parse the MNIST dataset
Load and preprocess the MNIST dataset with PyTorch
Scikit-learn: Fetching the MNIST dataset from OpenML
How to use the MNIST dataset in Keras
Visualizing the MNIST dataset with matplotlib
Building a convolutional neural network for MNIST dataset in Python
How to download the MNIST dataset as CSV files
Implementing logistic regression on the MNIST dataset in Python
How to split the MNIST dataset into training, validation, and test sets
How to augment the MNIST dataset using ImageDataGenerator
How to save and load the MNIST dataset as NumPy arrays
How to convert the MNIST dataset to grayscale images
How to normalize and standardize the MNIST dataset in Python
How to apply PCA to the MNIST dataset for dimensionality reduction
How to use the MNIST dataset in fastai
How to classify the MNIST dataset using SVM in Python
How to download and unzip the MNIST dataset from Kaggle
How to use PySpark to process the MNIST dataset in Python
How to create a custom dataloader for the MNIST dataset in PyTorch
How to use TensorFlow Lite to deploy a model trained on the MNIST dataset
How to use transfer learning on the MNIST dataset in Keras
How to perform data analysis on the MNIST dataset using pandas
How to use AutoKeras to build a model for the MNIST dataset
How to use TensorFlow Hub to reuse a pre-trained model for the MNIST dataset
How to use OpenCV to perform image processing on the MNIST dataset
How to use PyTorch Lightning to simplify training a model on the MNIST dataset
How to use TensorFlow Serving to serve a model trained on the MNIST dataset
How to use scikit-image to resize and crop the images in the MNIST dataset
How to use PyTorch Ignite to implement callbacks and metrics for the MNIST dataset
How to use TensorFlow Probability to perform Bayesian inference on the MNIST dataset
How to use TensorFlow.js to run a model trained on the MNIST dataset in the browser
How to use PyTorch Geometric to apply graph neural networks on the MNIST dataset
How to use TensorFlow Addons to enhance your model for the MNIST dataset
How to use PyTorch Captum to interpret and explain your model for the MNIST dataset
How to use TensorFlow Federated to train a model on the distributed MNIST dataset
How to use PyTorch NLP to preprocess and tokenize the labels in the MNIST dataset
How to use TensorFlow Quantum to apply quantum machine learning on the MNIST dataset
How to use PyTorch Audio to convert the images in the MNIST dataset into sound waves
How to use TensorFlow Model Optimization Toolkit (MOT) to compress and prune your model for the MNIST dataset
How to use PyTorch Vision Transformer (ViT) to apply transformer models on the MNIST dataset
How to use TensorFlow Neural Structured Learning (NSL) to incorporate graph regularization on the MNIST dataset
How t
Downloading the MNIST Dataset with NumPy
Installing NumPy
NumPy is an open-source numerical library that can be used to perform various mathematical operations on arrays and matrices. It is one of the most used scientific computing libraries, and it is often used by data scientists for data analysis. To install NumPy, you can use the pip command in your terminal or command prompt:
pip install numpy
If you already have NumPy installed, you can skip this step.
Loading the MNIST dataset from NumPy
To load the MNIST dataset from NumPy, you need to download four files from this . These files are:
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz : test set labels (4542 bytes)
After downloading these files, you need to unzip them and place them in a folder called "mnist" in your current working directory. You can use any tool to unzip the files, such as 7-Zip or WinRAR. Then, you can use the following Python code to load the MNIST dataset from NumPy:
import numpy as np import gzip # Define a function to read the data from the files def read_data(images_file, labels_file): # Open the files and read the bytes with gzip.open(images_file, 'rb') as f: images_data = f.read() with gzip.open(labels_file, 'rb') as f: labels_data = f.read() # Convert the bytes to numpy arrays images = np.frombuffer(images_data, dtype=np.uint8, offset=16) labels = np.frombuffer(labels_data, dtype=np.uint8, offset=8) # Reshape the arrays to the desired shape images = images.reshape(-1, 28, 28) labels = labels.reshape(-1) # Return the images and labels arrays return images, labels # Define the file names train_images_file = 'mnist/train-images-idx3-ubyte.gz' train_labels_file = 'mnist/train-labels-idx1-ubyte.gz' test_images_file = 'mnist/t10k-images-idx3-ubyte.gz' test_labels_file = 'mnist/t10k-labels-idx1-ubyte.gz' # Load the data using the function train_images, train_labels = read_data(train_images_file, train_labels_file) test_images, test_labels = read_data(test_images_file, test_labels_file)
Exploring the data structure and shape
Now that we have loaded the MNIST dataset from NumPy, we can explore its structure and shape. We can use the shape attribute of numpy arrays to check the dimensions of the data. We can also use the type function to check the data type of the arrays. Here is an example:
# Print the shape and type of the train images array print(train_images.shape) print(type(train_images)) # Print the shape and type of the train labels array print(train_labels.shape) print(type(train_labels)) # Print the shape and type of the test images array print(test_images.shape) print(type(test_images)) # Print the shape and type of the test labels array print(test_labels.shape) print(type(test_labels))
The output should look like this:
(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)
As we can see, the train images array has a shape of (60000, 28, 28), meaning that it contains 60,000 images of size 28 x 28 pixels. The train labels array has a shape of (60000,), meaning that it contains 60,000 labels corresponding to each image. The same applies to the test images and labels arrays, except that they contain 10,000 samples each. The data type of all the arrays is numpy.ndarray, which is a multidimensional array object that can store numerical data efficiently.
Visualizing some sample images
To visualize some sample images from the MNIST dataset, we can use matplotlib.pyplot library. This library provides various functions for plotting and displaying graphs and images. To install matplotlib.pyplot, you can use the pip command in your terminal or command prompt:
pip install matplotlib
If you already have matplotlib.pyplot installed, you can skip this step.
To plot some sample images from the MNIST dataset, we can use the imshow function from matplotlib.pyplot. This function takes an array as an input and displays it as an image. We can also use the subplots function to create multiple plots in one figure. Here is an example:
import matplotlib.pyplot as plt # Define a function to plot some sample images def plot_images(images, labels): # Create a figure with 2 rows and 5 columns fig, axes = plt.subplots(2, 5) # Loop through each subplot and plot an image with its label for i in range(2): for j in range(5): # Get a random index from the images array index = np.random.randint(len(images)) # Plot the image at (i,j) position axes[i,j].imshow(images[index], cmap='gray') # Set the title as the label of the image axes[i,j].set_title (labels[index]) # Remove the ticks and labels from the axes axes[i,j].set_xticks([]) axes[i,j].set_yticks([]) # Show the figure plt.show() # Plot some sample images from the train set plot_images(train_images, train_labels) # Plot some sample images from the test set plot_images(test_images, test_labels)
The output should look like this:
As we can see, the images are grayscale and show different handwritten digits from 0 to 9. The labels are also displayed as the titles of each subplot. We can observe that some digits are easy to recognize, while others are more difficult or ambiguous.
Downloading the MNIST Dataset with Scikit-learn
Installing Scikit-learn
Scikit-learn is an open-source machine learning library that provides various tools and algorithms for data analysis, preprocessing, feature extraction, classification, regression, clustering, and more. It is one of the most used machine learning libraries, and it is often used by data scientists for building and evaluating machine learning models. To install Scikit-learn, you can use the pip command in your terminal or command prompt:
pip install scikit-learn
If you already have Scikit-learn installed, you can skip this step.
Loading the MNIST dataset from Scikit-learn
To load the MNIST dataset from Scikit-learn, you can use the fetch_openml function from sklearn.datasets module. This function can fetch any dataset from the OpenML repository, which is an online platform for sharing and collaborating on datasets and machine learning tasks. The MNIST dataset is available on OpenML with the name "mnist_784". You can use the following Python code to load the MNIST dataset from Scikit-learn:
from sklearn.datasets import fetch_openml # Load the MNIST dataset from OpenML mnist = fetch_openml('mnist_784') # Print the type and keys of the mnist object print(type(mnist)) print(mnist.keys())
The output should look like this:
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'details', 'url'])
As we can see, the mnist object is a Bunch object, which is a dictionary-like object that contains various attributes related to the dataset. The most important ones are:
data: a numpy array of shape (70000, 784) that contains the pixel values of each image as a flattened vector.
target: a numpy array of shape (70000,) that contains the labels of each image as strings.
frame: a pandas dataframe of shape (70000, 785) that contains both the data and target as columns.
categories: a dictionary that maps each categorical feature name to a list of possible values.
feature_names: a list of strings that contains the names of each feature (pixel).
target_names: a list of strings that contains the names of each target class (digit).
details: a dictionary that contains metadata about the dataset, such as name, description, citation, etc.
url: a string that contains the URL of the dataset on OpenML.
Exploring the data structure and shape
Now that we have loaded the MNIST dataset from Scikit-learn, we can explore its structure and shape. We can use the shape attribute of numpy arrays and pandas dataframes to check the dimensions of the data. We can also use the type function to check the data type of the arrays and dataframes. Here is an example:
# Print the shape and type of the data array print(mnist.data.shape) print(type(mnist.data)) # Print the shape and type of the target array print(mnist.target.shape) print(type(mnist.target)) # Print the shape and type of the frame dataframe print(mnist.frame.shape) print(type(mnist.frame))
The output should look like this:
(70000, 784)
(70000,)
(70000, 785)
As we can see, the data array has a shape of (70000, 784), meaning that it contains 70,000 images of size 28 x 28 pixels, but flattened into a vector of 784 elements. The target array has a shape of (70000,), meaning that it contains 70,000 labels corresponding to each image. The frame dataframe has a shape of (70000, 785), meaning that it contains both the data and target as columns, with one extra column for the index. The data type of all the arrays and dataframes is numpy.ndarray or pandas.core.frame.DataFrame, which are efficient and flexible data structures for storing and manipulating numerical data.
Visualizing some sample images
To visualize some sample images from the MNIST dataset, we can use matplotlib.pyplot library again. However, this time we need to reshape the data array back to its original image shape of 28 x 28 pixels before plotting it. We can use the reshape function from numpy to do this. Here is an example:
import matplotlib.pyplot as plt # Define a function to plot some sample images def plot_images(data, target): # Create a figure with 2 rows and 5 columns fig, axes = plt.subplots(2, 5) # Loop through each subplot and plot an image with its label for i in range(2): for j in range(5): # Get a random index from the data array index = np.random.randint(len(data)) # Reshape the data array to its original image shape image = data[index].reshape(28, 28) # Plot the image at (i,j) position axes[i,j].imshow(image, cmap='gray') # Set the title as the label of the image axes[i,j].set_title(target[index]) # Remove the ticks and labels from the axes axes[i,j].set_xticks([]) axes[i,j].set_yticks([]) # Show the figure plt.show() # Plot some sample images from the data and target arrays plot_images(mnist.data, mnist.target)
The output should look like this:
As we can see, the images are grayscale and show different handwritten digits from 0 to 9. The labels are also displayed as the titles of each subplot. We can observe that some digits are easy to recognize, while others are more difficult or ambiguous.
Downloading the MNIST Dataset with TensorFlow Datasets
Installing TensorFlow Datasets
TensorFlow Datasets is an open-source library that provides various datasets for TensorFlow and other machine learning frameworks. It is one of the most used deep learning libraries, and it is often used by data scientists for building and training neural networks. To install TensorFlow Datasets, you can use the pip command in your terminal or command prompt:
pip install tensorflow-datasets
If you already have TensorFlow Datasets installed, you can skip this step.
Loading the MNIST dataset from TensorFlow Datasets
To load the MNIST dataset from TensorFlow Datasets, you can use the load function from tfds module. This function can load any dataset from the TensorFlow Datasets catalog, which is a collection of ready-to-use datasets for machine learning. The MNIST dataset is available on TensorFlow Datasets with the name "mnist". You can use the following Python code to load the MNIST dataset from TensorFlow Datasets:
import tensorflow_datasets as tfds # Load the MNIST dataset from TensorFlow Datasets mnist = tfds.load('mnist') # Print the type and keys of the mnist object print(type(mnist)) print(mnist.keys())
The output should look like this:
dict_keys(['test', 'train'])
As we can see, the mnist object is a dictionary that contains two keys: test and train. These keys correspond to two tf.data.Dataset objects that contain the test and train sets of the MNIST dataset respectively. A tf.data.Dataset object is a high-level abstraction that represents a sequence of elements, where each element consists of one or more components (features). A tf.data.Dataset object can be used to create efficient input pipelines for machine learning models.
Exploring the data structure and shape
Now that we have loaded the MNIST dataset from TensorFlow Datasets, we can explore its structure and shape. We can use the take function from tf.data.Dataset to get a sample element from each set. We can also use the type function to check the data type of each element. Here is an example:
# Get a sample element from the test set test_element = next(iter(mnist['test'].take(1))) # Print the type and keys of the test element print(type(test_element)) print(test_element.keys()) # Get a sample element from the train set train_element = next(iter(mnist['train'].take(1))) # Print the type and keys of the train element print(type(train_element)) print(train_element.keys())
The output should look like this:
dict_keys(['image', 'label'])
dict_keys(['image', 'label'])
As we can see, each element is a dictionary that contains two keys: image and label. These keys correspond to two tensorflow.Tensor objects that contain the pixel values and the label of each image respectively. A tensorflow.Tensor object is a multi-dimensional array that can store numerical data of various types and shapes. A tensorflow.Tensor object can be used to perform various operations and computations for machine learning models.
# Print the shape and type of the image tensor print(test_element['image'].shape) print(type(test_element['image'])) # Print the shape and type of the label tensor print(test_element['label'].shape) print(type(test_element['label']))
The output should look like this:
(28, 28, 1)
()
As we can see, the image tensor has a shape of (28, 28, 1), meaning that it contains one image of size 28 x 28 pixels with one channel (grayscale). The label tensor has a shape of (), meaning that it contains one scalar value (the digit). The data type of both tensors is tensorflow.python.framework.ops.EagerTensor, which is a tensor object that can be evaluated immediately without building a computational graph.
Visualizing some sample images
To visualize some sample images from the MNIST dataset, we can use matplotlib.pyplot library again. However, this time we need to convert the image tensor to a numpy array before plotting it. We can use the numpy function from tensorflow to do this. Here is an example:
import matplotlib.pyplot as plt # Define a function to plot some sample images def plot_images(data): # Create a figure with 2 rows and 5 columns fig, axes = plt.subplots(2, 5) # Loop through each subplot and plot an image with its label for i in range(2): for j in range(5): # Get a sample element from the data element = next(iter(data.take(1))) # Convert the image tensor to a numpy array image = element['image'].numpy() # Plot the image at (i,j) position axes[i,j].imshow(image.squeeze(), cmap='gray') # Set the title as the label of the image axes[i,j].set_title(element['label'].numpy()) # Remove the ticks and labels from the axes axes[i,j].set_xticks([]) axes[i,j].set_yticks([]) # Show the figure plt.show() # Plot some sample images from the test set plot_images(mnist['test']) # Plot some sample images from the train set plot_images(mnist['train'])
The output should look like this:
As we can see, the images are grayscale and show different handwritten digits from 0 to 9. The labels are also displayed as the titles of each subplot. We can observe that some digits are easy to recognize, while others are more difficult or ambiguous.
Conclusion
In this article, you learned how to download the MNIST dataset using three different Python libraries: NumPy, Scikit-learn, and TensorFlow Datasets. You also learned how to explore and visualize the data structure and shape of each library. You can use any of these libraries to access and manipulate the MNIST dataset for your machine learning projects. However, each library has its own advantages and disadvantages, depending on your needs and preferences. Here are some recommendations for further learning:
If you want to learn more about NumPy, you can check out this that covers the basics of NumPy arrays and operations.
If you want to learn more about Scikit-learn, you can check out this that covers the basics of Scikit-learn machine learning models and pipelines.
If you want to learn more about TensorFlow Datasets, you can check out this that covers the fundamentals of machine learning and deep learning with Python.
We hope you enjoyed this article and learned something new. If you have any questions or feedback, please feel free to leave a comment below. Happy learning!
FAQs
What is the difference between the MNIST dataset and the Fashion MNIST dataset?
The Fashion MNIST dataset is a variant of the MNIST dataset that contains images of clothing items instead of handwritten digits. It has the same structure and size as the MNIST dataset, but it is more challenging and realistic for image classification tasks. You can download the Fashion MNIST dataset using the same Python libraries as the MNIST dataset, but with a different name: "fashion_mnist".
How can I split the MNIST dataset into training, validation, and testing sets?
There are different ways to split the MNIST dataset into training, validation, and testing sets, depending on the Python library you use. For example, with NumPy, you can use the np.split function to split the data and target arrays into different subsets. With Scikit-learn, you can use the train_test_split function from sklearn.model_selection module to split the data and target arrays into random subsets. With TensorFlow Datasets, you can use the as_dataset function with the split argument to get different subsets of the data as tf.data.Dataset objects.
How can I normalize or standardize the MNIST dataset?
Normalizing or standardizing the MNIST dataset means scaling the pixel values of each image to a certain range or distribution. This can help improve the performance and convergence of machine learning models. There are different ways to normalize or standardize the MNIST dataset, depending on the Python library you use. For example, with NumPy, you can use the np.divide function to divide each pixel value by 255, which will scale them to the range [0, 1]. With Scikit-learn, you can use the StandardScaler or MinMaxScaler classes from sklearn.preprocessing module to transform each pixel value to have zero mean and unit variance or to a given range respectively. With TensorFlow Datasets, you can use the map function with a custom lambda function to apply any scaling operation to each image tensor.
How can I augment or transform the MNIST dataset?
Augmenting or transforming the MNIST dataset means applying some random or deterministic changes to each image, such as rotation, translation, scaling, flipping, cropping, noise, etc. This can help increase the diversity and robustness of machine learning models. There are different ways to augment or transform the MNIST dataset, depending on the Python library you use. For example, with NumPy, you can use the np.rot90, np.flip, np.pad, np.random.normal functions to rotate, flip, pad, or add noise to each image array respectively. With Scikit-learn, you can use the RandomRotation, RandomTranslation, RandomResizedCrop classes from sklearn.preprocessing.image module to apply random rotation, translation, or cropping to each image array respectively. With TensorFlow Datasets, you can use the map function with a custom lambda function or a predefined function from tf.image module to apply any augmentation or transformation operation to each image tensor.
How can I save or export the MNIST dataset?
Saving or exporting the MNIST dataset means storing it in a file format that can be used by other applications or platforms. There are different ways to save or export the MNIST dataset, depending on the Python library you use. For example, with NumPy, you can use the np.save or np.savetxt functions to save each array as a binary or text file respectively. With Scikit-learn , you can use the dump or dumps functions from sklearn.externals.joblib module to save each array as a pickle or a string respectively. With TensorFlow Datasets, you can use the as_numpy function to convert each tf.data.Dataset object to a numpy array, and then use any of the above methods to save it. Alternatively, you can use the tf.data.experimental.save function to save each tf.data.Dataset object as a TensorFlow SavedModel format. 44f88ac181
Comments