Spyglass MTG Blog

Azure vs. AWS Round 2: The Modern Data Warehouse

Written by William Richard | Dec 20, 2021 5:29:57 PM

The world has changed. Leveraging data more often, and at larger scales, to drive business goals is no longer viewed as an option and is rarely not within an organization’s near-term strategic vision plan. Enterprises have been deriving meaningful insights from their vast amounts of data with increasing use-cases in the past 5 years. Gartner provided some stats a while back which puts this into perspective, “as of 2017, 75% Enterprises have already invested in technology that facilitates data platform, analytics, and app development.” The problem? These investments, prior to 2017, don’t meet the needs of a modern world. They’re already outdated!! The Netezza’s, Hadoop platforms, APS and Exadata’s are off to the pasture. Whether it’s their lack of integration with Machine Learning solutions, collaboration between all personas of the data lifecycle, or their ability to scale with growing data volumes, they often don’t meet the mark alone.

Modern data and analytics platforms built on cloud infrastructure are increasingly popular. These platforms provide enhanced security, disaster recovery and availability, cost management, collaboration, and integration over those pre 2017 investments.

Amongst the many cloud vendors available, Microsoft Azure and Amazon AWS are the top Cloud Platforms that Enterprises are utilizing to build scalable data and analytics solutions. The next logical question is always which platform should we adopt?! Should we leverage one, two, or more? What’s a hybrid or multi-cloud strategy for Analytics look like?

Well, some of those questions are lengthy….i.e. MORE blog posts….so consider this Part II in the battle towards cloud centric analytics and what might provide the best analytics platform for you. 

As with Part I of this series (The Managed Database), I want to provide my perspective on some core platforms available on AWS and Azure for your general satisfaction. While we are a Microsoft Partner first, I have deep knowledge of both AWS and Azure data solutions and often get asked on the benefits over one or another. Not an easy question, by the way, so take this blog for what it is, “an assessment from my experience” and research…but also not the end all be all of answers or information. Honestly, this blog could consume all my time if I let it! There is a lot to consider in an enterprise’s decision of one vs. the other.    

 

Cloud Data Warehouses

 

Cost Management

At first glance, AWS Redshift is an appealing option as the pricing chart does not climb as quickly as the equivalent pricing in Azure Synapse. Additionally, comparing apples to apples here is challenging as Synapse uses a DWU concept for performance and scale where Redshift is more traditional with cores, storage, and memory.

If you do a quick read you may get to something like this, Redshift tops at $13.04/hr while Synapse SQL can climb to $453/hr with the pay as you go option. Both platforms provide reserved pricing as well. For example, if you’re running $453/hr we highly recommend the reserved capacity which can reduce that to $158/hr. Still a hefty price tag compared to what Redshift is presenting……..or is it?!

While Synapse SQL presents in DWU’s, Redshift is a node-based cluster. Much like other platforms such as Databricks, if you need more horsepower, you can throw more nodes at it. The $13.04 pricing is per instance/node. So, the equivalent of $453/hr for Synapse SQL is about a 34 node Redshift cluster. With a maximum number of nodes per cluster being consistent at 128 nodes. The maximum service level for Synapse SQL is DW30000c, which has 60 Compute nodes. Respectively, that means Redshift maxes out with 49k GiB when running 128 ra3.16xlarge of memory and Synapse SQL tops with 18k.

Amazon Redshift clusters - Amazon Redshift

Memory and concurrency limits - Azure Synapse Analytics | Microsoft Docs

 

Security

Azure Synapse and Redshift provide a similar foundation for security. Including the typical capabilities around authentication, encryption of data in transit and within the database include file level encryption when data is within S3 or Azure Blob. In Redshift, permissions apply to table and column levels. Azure Synapse Analytics supports granular permissions on schemas, tables, views, individual columns, rows, procedures, and other objects.

AWS provides Column-Level access controls out of the box, but row level security, while possible, appears to be challenging to find a clear approach. Synapse provides column level access controls, and it does well to also make row-level security a bit easier to implement.

Achieve finer-grained data security with column-level access control in Amazon Redshift | AWS Big Data Blog

Column-level security for dedicated SQL pool - Azure Synapse Analytics | Microsoft Docs

Row-Level Security - SQL Server | Microsoft Docs

For authentication, AWS allows federated user access via AWS Directory Service, while Azure Synapse Analytics can integrate with Azure Active Directory. Both support multifactor authentication (MFA). Azure Synapse offers OAuth 2 for authorized account access without sharing or storing user login credentials; Redshift lacks OAuth support.

The integration with Azure Active Directory is a key differentiator for Synapse and the advanced security capabilities such as Vulnerability assessments, Dynamic Data Masking, Column level encryption and auditing give Synapse an edge in this space overall. Synapse Analytics also provides role-based access beyond database access controls to improve collaboration and integration of personas and services.  

 

Disaster Recovery and Availability

Synapse provides global redundant geo backups for the database platform. Out of the box this provides an RPO of 24 hours and with a standby regions RTO can be within the hour. Restoring is accomplished as simply as a few clicks within the portal.

Redshift also provides globally redundant snapshots that can be restored in any AZ in that region or transferred automatically to other regions for disaster recovery. Amazon Redshift can even prioritize data being restored from Amazon S3 based on the queries running against a cluster that is still being restored. AWS would usually fall within the 8-hour RPO by default and restoration is also quick.

AWS, however, does well to provide a method for scheduling more frequent backups and performs automated backups more regularly when more data is being processed. Call that a win for AWS just for ease of management, with Synapse you can implement your own process to take more frequent user defined restore points which can be restored to your DR region and will reduce the RPO.

What does this mean? Backup and restore is the primary method for recovery within both platforms. Which I guess means if you know how to recover on one platform then you understand the same on the other. AWS is a bit easier to manage a reduced RPO.

 

Collaboration and Integration

Neither solution is designed just to be data storage platform. Synapse and Redshift integrate with BI, data science and machine learning platforms to facilitate advanced analytics. Synapse offers a native PREDICT statement using which users can score machine learning models trained in Azure Machine Learning, Apache Spark, or other frameworks, and derive insights within Synapse without having to perform massive data movement. Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands as well and it leveraged AWS Sage maker for training models.

Synapse with the invent of Synapse Workspace takes this integration a step further. Synapse Workspaces integrate all the data services you need within a modern data landscape. Including your data lake storage, Data Engineering, Azure ML Services, Power BI for data visualizations, SQL dedicated and SQL server-less warehouse capabilities and even big data Spark applications, Kusto clusters and Python notebooks.

But don’t think Redshift is out. Redshift also works to integrate all the data services you need within a modern data landscape. Including your S3 storage, Data Engineering, AWS ML Services, Quicksight for data visualizations, Redshift and server-less warehouse capabilities in Amazon Redshift Spectrum.

Both of these platforms enable all your personas to collaborate, the data scientist can build MLOps pipelines that are fed by data engineered within many compute frameworks by the data engineer and BI analysts can consume containerized ML models and build data visualizations from certified curated data.

So where’s the win? While Redshift has an ‘OK’ interface for working with all the services, Synapse Studio, a single UI for all services within the workspace and the same look and feel of Azure ML Studio, means Synapse Analytics has been heavily engineered on collaboration between personas (Data Scientists, Analysts and Engineers) in a user-friendly way. The studio experience is a much better way to work within these platforms and gives the platform a leg up.

 

Thoughts

So, which suits your needs?

I believe the clear differentiator is the integration for both security as well as user interfaces between the two. One of the largest challenges in enterprise analytics is optimizing the workflow between all personas. Data scientists often struggle to implement their models and may not have a good communication path to the engineers and analysts. A Synapse Workspace with integration to the other tools, Azure ML Studio/Power BI/Azure Purview, goes a long way.

Regarding scale and performance, the two platforms perform to extremely large data warehousing workloads. If you have a use case for running DW30000c or 128 ra3.16xlarge nodes, we’ll be more than happy to do a bake off between the two. Contact us today, and we’ll be right there to help!