I frequently work with organizations who are migrating data from an on-premises datacenter into Azure. Undoubtably the question will come up “should we use an Azure Data Dox to ship that much data?”, and most of the time the room echoes in a resounding “Yes!”.
I’ve been working in Azure for many years and have seen a lot of data migrations, and while data box is a wonderful service and is yet another way that Microsoft enables and empowers customers to do what’s best for them, it’s just that, an option for what might be the best fit. I write the same thing almost every month to various people, and figured it was time to post it to use as a reference.
Note: My thoughts here are in no way indented to conflict with the official product documentation and are rather a more experience-based thought experiment to accelerate time-to-value in regard to data migration, at the bottom of this post there are links to two great official pieces of documentation that are more technically focused, please give those a read as well.
Most of the time when we think about uploading or downloading data to and from the internet, we think in terms of gigabytes – typically single-digit gigabytes at that. Even what the home-ISPs like to reference as the bandwidth heavy services like movie streaming, will typically use less than 7GB/hr. With that in mind, when we think of the amount of data that is used in an enterprise, we’re typically talking Terabytes or even for those very large organizations – Petabytes. When we talk about migrating that amount of data to a different physical location (for example, Azure) it seems outlandish to think about moving it online – or is it?
Azure Data Box
If you haven’t taken a look at the Azure Data Box Family of offerings, I highly suggest it. There are 4 different offerings of Data Box:
- Data Box Disk: 8 TB SSD for offline transfer
- Data Box: 100 TB appliance for offline transfer
- Data Box Heavy: 1 PB appliance for offline transfer
- Data Box Gateway: A VM appliance storage gateway used for managed online data transfer.
These devices ship to your location for a nominal fee, you load up the data, then ship it back, and Microsoft loads the data into the destination you choose. The idea is that up to a 40Gbps network connection on your local network is going to be much faster than it would be to send this data over the Internet, VPN, or ExpressRoute connection and is a great option.
Offline Transfer Considerations
I challenge everyone to think through this process though when considering an offline migration. Specifically, we need to think about how long it will take to get the process approved (among other factors) to move your company’s data using a shipping carrier. I’ve worked with organizations where the policy for this type of process requires a private courier, active GPS, and someone following the truck along the entire route (I’ve even seen requirements for armed guards or police escort), among many other requirements from various departments within the organization.
Let’s look at the most common components of this process that might influence the timeline of your data migration.
- Privacy & Legal Team Approvals: Depending on the data, privacy and legal may need to be involved to inspect the process for data device handling, determine who has visibility into the data, how it is destroyed upon completion of the ingestion, and potentially even determine insurance implications.
- Security Approval: From a technical controls perspective they will want to make sure proper encryption is used at the data level and hardware level, determine who controls the keys for encryption, ensure device attestation, and even certify these devices to be plugged into the datacenter based on the controls in place for certain hardware vendors.
- Ordering & Shipping: The process of receiving your Data Box takes up to 10 business days, depending on availability and other factors.
- Loading the Data: There are two points that are important here, the first is how fast can the data be retrieved (e.g., is the data passed through a source that only has a 1 Gb link, are there disk throughput limitations, do you need to limit the transfer rate to not impact other workloads, etc.). The second point to consider is write throughput on the Data Box itself, while there is ample network connectivity with each device, the larger devices are designed for capacity rather than performance and while there is good throughput, they are not designed for high I/O which is important for datasets with smaller file sizes.
- Shipping to Microsoft: Standard shipping time applies to shipping the device back to Microsoft, typically a few days.
- Microsoft transferring the data: After the device is received it is inspected for damage, then setup to copy the data to the destination you selected when you requested the Data Box – this could be a few hours to a few days depending on availability, data size, I/O size, and both the type of Data Box itself and the target storage location.
(Time to Legal Approval) + (Time to Privacy Approval) + (Time to Security Approval) + (Ordering & Shipping Time) + (Time to Load the Data) + (Shipping Time) + (Time to Unload the Data)
When thinking about these lead times it’s important to be honest with yourself. How long after you send the email, or meeting invite, will it take to get full approval from Legal, Security, and Privacy? In most cases, this is typically a few weeks and depends on the organizational processes and sensitivity of the data, sometimes up to a few months.
For example, let’s say it takes 1 month for full approval to ship the data, which is certainly a reasonable timeframe. Let’s also assume it takes 2 days to get the Data Box hooked up in the datacenter, and that you’re copying 50 TB at 5 Gbps over the LAN. With a generalized timeline, this operation would roughly look like the following:
Example: 50 TB, 5 Gbps LAN Offline Transfer with Data Box
1 Month for approval + 8 Days for shipping + 2 Days for setup + 2 Days for data copy (~26 hr. for actual data movement) + 2 days to prep for shipping + 3 days for shipping + 1 day for receiving + 1 day for copying data (likely less)
30 days + 8 days + 2 days + 2 days + 2 days + 3 days + 1 day + 1 day = ~49 days
Now let’s assume that same data was copied “online” (Internet, ExpressRoute, VPN, etc.) at even just 100 Mbps averaged across the day. In most cases organizations would be able to leverage more bandwidth than this, but it makes for easy calculations. If you copied 50 TB online, at 100Mbps, it would take ~53.5 days. In this scenario the time to copy the data online vs offline is very close, and without any of the fuss of approvals and shipping. If you assume you can use 125 Mbps of bandwidth you’re looking at ~42.5 days which is even faster than the offline mode.
At this point I’m sure there are a few people saying “yes, but what if I had a LOT of data, say 1 PB!”. I’ve done many multi-PB data migrations to Azure and have seen them go both online and offline, let’s do the calculation and see how it looks. While it may not be the case for everyone, in my experience with the increase of the dataset size comes longer approval lead times for various reasons. Additionally, these types of organizations typically also have more bandwidth capacity – again, these are generalized numbers, but in my personal experience they are realistic.
NOTE: Data Box Heavy requires a QSFP+ compatible cable, which I find is not as common in most datacenters, make sure you have one on-hand prior to receiving the device.
For this calculation let’s assume 2 PB of data that can be copied on the LAN at 10 Gbps. Keep in mind that if there was actually 2PB of data you’d need 3 Data Boxes because you get 770 TB of usable space after overhead per Data Box Heavy. Take note though, that I’m not taking the multiple Data boxes into account in the calculation, which would realistically extend the timeline.
Example: 2 PB, 10 Gbps LAN Offline Transfer with Data Box Heavy
2.5 Months for approval + 8 Days for shipping + 2 Days for setup + 22 Days for data copy + 2 days to prep for shipping + 3 days for shipping + 1 day for receiving + 4 days for copying data
75 days + 8 days + 2 days + 22 days + 2 days + 3 days + 1 day + 4 days = ~117 days (~3.9 months)
Like I said earlier, typically if an organization has this much data they have much more bandwidth – 2 Gbps for this operation would not be unreasonable to assume as a generalization. Given 2 Gbps bandwidth, it would take ~107 days to copy this data online compared to ~117 days copying it offline.
However, I will say that I’ve been in situations where an organization had other limitations such as the total available capacity on a firewall or edge router, and they would have to upgrade at significant expense to be able to handle an extra 2 Gbps so they could only do something like 250 Mbps. At that speed it would take 874 days to copy and at those speeds with that much data it certainly does not make sense to move the data online, and using a Data Box would be much more efficient to copy the data offline.
NOTE: Data Box will not ship across international borders (except countries within the European Union), please see the FAQ reference link if that is a requirement for your data transfer.
If you are going to copy the data online, there are various ways to accomplish this task. In general, I see AzCopy, Azure Data Factory, Azure Data Box Gateway, or depending on the target storage location any number of other tools used for online data movement.
There are some considerations when choosing your tooling such as cost (of the tool only, ingress bandwidth to Azure is free), performance, manageability and whether there is data churn that needs to continuously be uploaded after the initial import. Keep in mind that you can also control your bandwidth with online copies and for example use less bandwidth during business hours and more at night, and some of these tools will help facilitate that for you.
I won’t go into depth on this decision process but let me know if I should write another blog on that topic.
The two reference links below have wonderful information about choosing a data transfer solution, and as noted earlier I HIGHLY suggest reviewing them as well. The purpose of this blog was to talk about some of the processes and procedures that’s typically not addressed when looking purely at the technology.
- Choose an Azure solution for data transfer
- Data transfer for large datasets with moderate to high network bandwidth
I hope going through these scenarios was helpful when considering methods for data transfer into Azure. My goal here was not to go in depth on anything in particular, but more think through the process. As takeaways, here are a few points to keep in mind about transferring large amounts of data into Azure.
- Be honest with yourself about approval timelines for shipping your company’s (and/or customer’s) data.
- Use a file transfer calculator to see how long it would actually take to transfer X data at Y speeds – it’s probably not as long as you think.
- For good reason, there will likely be a lot of meetings, documentation, email threads, and other time-consuming activities for shipping data physically – and that should also count for something in terms of cost.
- There will likely also be some of the aforementioned procedural work for online data migration, but in most cases not nearly as much.
- Online is not always going to work out, sometimes Data Box is going to be the best fit.