Month: May 2023
Data Box Disk Overview
I have written in the past on the considerations of using Data Box for offline data transfers into Azure or using online methods, which was primarily focused on Data Box Heavy. Here I am going to walk through the process of obtaining a Data Box, specifically a Data Box Disk (see the Data Box Family of offerings here). The ordering process for all Data Box devices is largely the same, and this can be used as a reference for any of them. However, the primary focus of this post will be on the setup and usage of Data Box Disk.
If you’ve read my previous post in which I postulated the merit of using an offline method of transfer in many cases, you may find it odd that I am now promoting the Data Box Disk, which is only suitable for transferring a few TB of data. I maintain my position that in most cases online transfer is optimal, especially for the type of data that would be in-scope for a Data Box Disk. However, as I have noted, there are some cases where offline data transfer is needed.
Order and Setup
Ordering a Data Box is straightforward through the Azure Portal.
After you’ve selected the initial configuration items, you will choose the device type.
You will name the order and select the destination storage in Azure.
After confirming whether you’re using a Microsoft-Managed Key or Customer-Managed Key (in this case I’m using a Microsoft-Managed Key) you will enter shipping information and the order will be submitted. In each step of the process, you will receive an email with the status. For example, here is the notification that my order was created and then again when it was delivered.
When you create the job in Azure, it creates a Data Box resource, which has all of the information about the device and order including a timeline showing where the device is in the process.
The Disk arrived with the SATA to USB cable, and I hooked it up to my Intel NUC (excuse the dust!).
Note in the image above both the USB adapter and the ports on my device are denoted with “SS” meaning they’re USB 3.0. This is important, you will note that the Data Box Disk is an SSD which is very performant. You will also note in the email stating the device was delivered, that I have a certain period of time to get it shipped back before I start incurring additional cost.
Most enterprise servers only have USB ports to support peripherals, and thus do not invest in USB 3.0 or 3.1, leaving you with the 2.0 standard. The maximum theoretical throughput of USB 2.0 is 480 Mbps, or 60 MBps. The maximum theoretical throughput of USB 3.0 however, is 5 Gbps or 625 MBps. This is an important note, that in some cases it may be faster to even attach this to a laptop that has Gigabit network connectivity to wherever the source data is held if the servers only have USB 2.0 ports.
*Note:* I am doing this in Windows, but you can do all of the following in Linux as well.
If I look in Windows Explorer when I attach the drive I can see a volume, but it is encrypted and locked. That is intentional and a part of the security process with Azure Data Box.
The process for allowing access to each device in the Data Box family is different, but with Data Box Disk there is a utility to unlock the device, which in combination with the passkey available under the Data Box resource in Azure, will unlock the device.
At the root of the filesystem, you will see a folder for all the storage types, Table, Queue, File, Blob, and Managed Disk; what you copy here will get copied to the respective storage type at the destination.
If you have a lot of small files, one thing to note is the impact of antivirus. Especially if you’re pulling TBs worth of small files across the network to a laptop where the drive is attached, since it’s writing those files locally your antivirus will likely do in-line scanning. Depending on the data and whether your policies allow, adding an exception on your antivirus for the folder where you’re copying the data e.g. “F:\BlockBlob” may speed up your copy performance.
To test performance, I devised two tests, one with large files and one with small files. For the large files, I copied a bit over 50GB of .iso files of various Linux distributions. The copy below is simply CNT+C, CNT+V of that folder from my machine’s SSD to the Data Box Disk using Windows Explorer. In addition to the copy operation, I took a screenshot of the disk throughput and activity in Task Manager (which is a way of showing how much of the capable performance is utilized by way of disk operations queuing metrics).
You can see with a single copy job I’m getting over 300 MBps for those large files. I then also wanted to try small files, which is much more likely of a use case for Data Box Disk. For this I used a PowerShell script which is a part of another project I’m working on which will be posted soon on my GitHub to create 10,000 x 1 MB files – I again first copied them using Windows Explorer.
I was able to get just over 50MBps in write speeds, which is good considering the file sizes, but given there were no constraints on my source disk, destination disk, or CPU, this led me to believe that the bottleneck was with the copy operation itself. Next, I wanted to run a test with a multi-threaded copy operation, so I first set a baseline with a single-threaded robocopy job.
You can see this took about 3 and a half minutes and copied at roughly the same speed as Windows Explorer. Now that I have my baseline, here’s the real performance test using the multi-threading flag on robocopy.
With that flag I was able to push over 3x the amount of performance, increasing from ~50MBps to ~190MBs and reducing the copy time from 3 minutes and 33 seconds to just 58 seconds which fully utilized my hardware.
I also went back and tried the same multi-threaded copy operation with my large files and was able to increase the throughput from 334MBps to 522MBps which fully utilized my hardware as well.
I finished loading my data onto the disk and utilized the data validation utility, which comes in the same download as the tool that unlocks and decrypts the drive, to generate checksums of my data on the device which I can use later to validate data integrity when it is copied into the Storage Account. After that I unmounted the device, packaged it back up and dropped it off at my local UPS store – the box already had a return label on it.
Similar to when the device was being shipped to me, I got email notifications for each step of the way including when the data copy started, and when it finished. The process is then marked as complete and all of the details are available in the portal.
You can see the data is now loaded into the Storage Account, and you will see a “databoxcopylog” folder as well, which you can use to validate the copy jobs included with the final checksum of the files.
Lastly, you will see a one-time charge for the device on your invoice, you can see here the $90 fee for the Data Box Disk in Azure Cost Management.
*Note*: You will still be charged for any transactions that take place when loading the data into your storage account.
The data is now all loaded, and I get a confirmation via email (which is also shown in the portal screenshot above) that the device has been erased in accordance with NIST 800-88r1 standards. As I noted above, the process for ordering the device is largely similar for the Data Box or the Data Box Heavy.