Machine Learning

Azure NVIDIA VM for PyTorch and TensorFlow in an MSDN Subscription

Posted on Updated on

Reading Time: 9 minutes

Overview:

Unless you’ve been living under a rock, you know that Machine Learning (ML) and Artificial Intelligence (AI) are all the rage right now, and will be going forward in technology. In most cases I will first recommend that customers use offerings such as Azure Cognitive Services, Azure OpenAI, Azure OpenAI (Use Your Data), or Azure Machine Learning. In some cases though, customer want to roll their own ML/AI, or simply want to work with some Open-Source projects which require deployment on a VM with the CUDA toolkit. If you need to do this, I recommend looking at the GPU optimized VMs on Azure (GPU Optimized Virtual Machine Sizes) and choosing a modern VM SKU with a modern GPU.

Although, many people have, and often don’t use, MSDN entitlements which are designated for personal sandboxes and learning, and come with $50, $100, or $150 per month in Azure Credits. The purpose of this blog is to show you how you can deploy a GPU VM in an MSDN subscription for less than $0.50 per hour (an entire day of lab work for less than $5 of your credits), let’s get started!

VM Deployment:

To kick things off, we will need to deploy a VM. There are many “gotchas” to deploying a GPU VM in an MSDN subscription (mainly because the SKUs are typically restricted and reserved for paying customer subscriptions).

We are going to use the following configuration:

  • Ubuntu 20.04 VM
  • East US Region
  • No infrastructure Redundancy
  • Standard Security Type
  • Configure the VM Generation for Generation 1 (you will need to change this from the Generation 2 Default)
  • Standard_NC6_Promo SKU

You can go look at the other sizes offered, but as noted, if you’re in an MSDN subscription all other sizes will likely say they are not available. This NC6_Promo size is set to be retired at the end of August 2023, so I will update this blog post after that time. You will note that the VM size doesn’t support Premium SSDs so on the “Disks” tab of deploying the VM it will select Standard SSD by default. You can change this to Standard HDD but I would not recommend doing so for this type of work.

Since this is just a lab environment, I am going to use Azure Network Security Groups (NSGs) to control my SSH access to the VM rather than something like Azure Bastion. I will configure the NSG after the VM deployment is complete so for now I’m going to say create a Public IP, but do not allow any ports.

Next, just in case I get sidetracked with other work I want to make sure this VM shuts down automatically and doesn’t keep burning credits, so I’m going to set the automated shutdown for 7PM.

 

After all of the validation has passed you can create the VM. You will note here that the compute cost is less than $0.40 per hour.

 

After the VM is done deploying, you can go in and under “Networking” add an Inbound Port rule for the SSH service using “My IP Address” as the source, which will use your current Public IP Address.

 

After the addition of the NSG rule finishes, you can SSH into the VM.

Note: By default the Ubuntu 20.04 image is provisioned with 30GB of OS drive space. The drivers and packages we’re downloading are substantial and you may run out of space. If you try to install both PyTorch and TensorFlow you will fill up the VMs usable space so at this point you may want to resize the disk. To do that, stop the VM, change the disk size, and start it again. The Ubuntu Image in Azure uses a cloud-init package that will automatically resize the / partition, so you don’t’ need to do anything in the OS.

Prerequisite Installation:

Now that the VM is deployed and you’re SSH’d into it, you can look at the hardware of the VM to verify the GPU, and you will see there is a Tesla K80 in the NC6 VM.

While I don’t show it below, I always recommend running “sudo apt update; sudo apt upgrade” with each Linux Vm to make sure everything is up-to-date before you begin. After that, go ahead and install gcc and make as shown below, which are required by the NVDIA driver installer.

Note: both of the following are very large run files and can take a while to install, anywhere between 5-10 minutes each. 

Next, as noted in the documentation (https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup) the latest supported NVIDIA drivers for the K80 card is 470.82.01. You can get the link to that download at this page (https://www.nvidia.com/Download/driverResults.aspx/182617/en-us/) and wget the file to the VM.

Once you’ve downloaded the .run file, you will sudo bash NVIDIA*.run to execute the file with bash (you can execute it with sh if you want as well). The installer will go through a few screens in the terminal, and since it’s an older driver version there are a couple of warnings, but nothing that stops us from doing what we need. Feel free to take note of them for your own self documentation though.

After the installation is complete, you can run nvidia-smi, which comes with the driver installation, that shows you the NVIDIA GPU information like you would expect. If it looks like it does below, the driver install completed successfully.

Now that the NVIDIA drivers are installed, we need the CUDA toolkit so that later on we can leverage CUDA for any AI/ML workloads. To do that, you can go to this page (https://developer.nvidia.com/cuda-11-8-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) to get the run file for the VM architecture and OS that we’re using. In the screenshot below I will run the commands the webpage prompts you to run.

As the installer runs it will prompt you with another screen, you can leave it with the default selections or install all of the categories if you’d like.

Now both the NVIDIA drivers, and CUDA Toolkit are installed. At this point the prerequisites are completed. After driver installations I always like to reboot, so I do at this point, but it shouldn’t be required.

Environment Installation:

Two of the most common frameworks in the Open-Source Community for this type of work are PyTorch and TensorFlow, while there are certainly others these are the ones that I wanted to test since most of the projects on GitHub seem to use one or the other. Let’s get them both installed and validate that they can leverage the GPU. The notes made below are a combination of information gathered from many other Blogs, YouTube Videos, and my own research, but this is what works for the particular setup in this environment.

Anaconda Installation:

While there are other ways, I am going to be using Anaconda as the installation vehicle for both PyTorch and TensorFlow. To install Anaconda, you will go to their website (https://www.anaconda.com/download#downloads) to get the link for the installer, and bash install it similar to how we did with the other run files.

The installer documentation on their website says to accept all of the defaults, but I recommend letting the installer do the initialization at the end of the install so you don’t have to do it yourself.

Once the install is complete, you will need to reboot, and then you can elevate your privileges to see the prefix to your shell location noting the conda environment.

 

PyTorch Installation:

Note: If you don’t need PyTorch you can skip this part.

With Anaconda installed, we’ll use it to install PyTorch. From this point on, these installers print out much more than I can feasibly capture in screenshots, so note that you will see additional content in your terminal.

Next, we want to verify the version of Python that’s running in our environment.

Noting Python 3.10, we can move on to creating a new Anaconda environment that uses the packages we just installed.

After we create that environment we can activate it, and see that it switches from base to pytorch.

At this point we have PyTorch at our disposal, so let’s test it out by importing it, running a quick torch.rand function to test the package, and then run the torch.cuda.is_available function to validate that the CUDA toolkit and associated GPU is available for use.

Note: If that function returns that it’s not available, run nvidia-smi again to validate that the driver didn’t fail. If it did, you can reboot, and in some cases I found I needed to re-run the NVIDIA driver install file for some reason. Linux and GPU drivers are a tricky thing sometimes.

Wonderful! If you see the function is available, then everything we’ve done to this point was successful. Now, if the work you’re going to be doing or project your running only requires PyTorch you can skip the next part and go straight to the conclusion at the end of the blog.

TensorFlow Installation:

If you need TensorFlow, let’s get that installed here. First we will need python3-pip; after that’s installed you will need to pip install tensorflow. If all goes well, you will see that it has been successfully installed.

If you haven’t rebooted after the Anaconda installation earlier, go ahead and do that now. After rebooting, you can elevate your privileges to see the prefix to your shell location noting the conda environment. We’ll run a quick test to verify the python version.

Now that we know we’re running Python 3.10, we can use Anaconda to create an environment for TensorFlow.

Similar to how we did it with the PyTorch environment, you will now activate the environment by running conda activate tf and you will see the active environment notation switch in your terminal. After you’re in the TensorFlow environment, you can install the nvidia-cudnn-11 package which we will need to interact with the CUDA toolkit.

Now this gets a bit messy, because we’ll need to run all of the following commands to setup the variables and other path and environment information for TensorFlow.

CUDNN_PATH=$(dirname $(python -c “import nvidia.cudnn;print(nvidia.cudnn.__file__)”))

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib

mkdir -p $CONDA_PREFIX/etc/conda/activate.d

echo ‘CUDNN_PATH=$(dirname $(python -c “import nvidia.cudnn;print(nvidia.cudnn.__file__)”))’ >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

echo ‘export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib’ >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Now that those are all set, we’ll pip install TensorFlow into the environment.

Once that’s done there is only one more step, which is to run the command below to validate TensorFlow’s ability to communicate with the GPU and the CUDA Toolkit.

python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))”

I like to first run nvidia-smi to verify I can still communicate with the GPU through the driver like normal. If it fails, you can reboot, and in some cases I found I needed to re-run the NVIDIA driver install file for some reason. Linux and GPU drivers are a tricky thing sometimes.

You can see above I rebooted, activated my conda environment, and ran that command. The end line result shows that TensorFlow has access to the Physical GPU device, which means that everything we’ve done up to this point has all worked!

 

Conclusion:

The first thing I will say is that, as I noted, Linux and GPU drivers don’t have a stellar reputation for working the first time, or every time. It may take a bit of patience, but I’ve run through this somewhere between 5-10 times now and the instructions I’ve capture here seem to work pretty well. If you find something else notable, please leave a comment here or on any of the social posts.

In the end, if you are a fast at copy & paste, you can setup this environment end-to-end in about 30 minutes. Your first time through it will likely take you 1-3 hours though, depending on your previous experience. The resulting ability to have a GPU VM for testing, learning, or playing with GitHub AI/ML projects though for such a low cost is great to have in your back pocket.

If you have any questions, comments, or suggestions for future blog posts please feel free to comment below, or reach out on LinkedIn or Twitter. I hope I’ve made your day a little bit easier!