NoSQL on Azure, AWS and Google Cloud Part 1
NoSQL on Azure, AWS and Google Cloud Part 1
The need for NoSQL is on the rise, lets have a look at the differences between relational and non-relational data, where each is applicable, and what services we have available on each cloud platform.
What is NoSQL?
What is a NoSQL Database and when do we use it?
One of the common questions today floating around many development projects. Today we are going to look through our options for hosting a NoSQL database in the cloud. It doesn’t matter what platform you are using, every single platform has there own service for NoSQL. This is not a concern, the major concern is a decision of storing your data in a relational or non-relational database.
This post will talk through each service independently, along with a video explanation. Ove the next couple of weeks we will look at a walkthrough for setup on each cloud platform.
Relational (SQL) vs. Non-Relational (NoSQL)
A relational database is your typical CRUD database, used for large complex sets of data. If you look decades back, the database era began with relational data, multiple tables with keys to relate tables together. As databases grew and querying became more complex, so did the SQL. We moved forward with complex querying capabilities using JOINS, building DATASETs, data partitioning, sharding, and much more.
What did this mean?
More requirements for DATA ADMIN, programmers specialising in SQL, partitioning and sharding databases, increased need for server capacity, master-slaving approaches, and worst of all, increase in cost, time and resources.
Does a non-relational database solve these issues?
Some but not all, and a non-relational database has different requirements than a relational database, meaning each one must be applied in the correct scenario.
A non-relational (or NoSQL) database is basically a table which contains an infinite number of items. Each item is referenced by some sort of key-value access pattern to determine data distribution, very similar to big-data storage solutions.
MongoDB was one of the first frameworks to be available on the market for NoSQL Solutions. We now have the ability to host NoSQL datasets under the mongoDB framework.
As out tables grow larger, so does the need to partition the data.
What is data partitioning?
Data partitioning is the process of breaking up data into segments, making the data more easily maintained for performant access (reads). We partition data in a NoSQL table by partition keys, these are typically mandatory.
Fig 1. Partitioning (vertical) involves splitting one entity into multiple entities
What about Sharding? What is the different between Partitioning and Sharding?
We can actually refer to these processes as VERTICAL PARTITIONING (Data Partitioning) and HORIZONTAL PARTITIONING (Sharding). When you perform horizontal scaling on a database, you replicate the schema and divide what data is stored in each shard based on a key. For example, a user database containing an attribute UserId, this will be the shard key, meaning ids from 0-1000 are stored in one shard, and ids from 1001-2000 are stored in a different shard. When choosing a shard key, the DBA will typically look at data-access patterns and space issues to ensure that they are distributing load and space across shards evenly. Another point to note is most modern databases are natively sharded because it will always improve performance of data as the data grows.
Fig 2. Sharding (Horizontal) involves replicating the schema and diving data based on a shard key
What else is good about NoSQL?
How often does your data change? M ultiple times a day? Yikes, imagine having to rewrite relations between tables in a CRUD database? Sound familiar? Its HORRIBLE. When we have a NoSQL database, we don’t need to worry about data structure, what ever the item, just throw it in the table. Think of each item in a table a blob of JSON, i.e. simple text which represents an object. For each item, we don’t care about their attributes, we just insert each blob into the table as required.
Could you give me an example?
Ever had the need to store images? Use a NoSQL database.
Images tend to be unstructured datatypes. There are multiple formats (pngs, jpegs, etc), meaning attributes will always be different. With any application, never enforce the user to use just one type (i.e. only JPEG), because users require all major image types. In context, uploading profile images to Facebook, think how annoying it would be for user experience if we had to convert all image types to JPEG?
Let’s look at the different options we have available for hosting NoSQL in the cloud.
Azure Blob Storage
Let’s talk about a non-relational database on Microsoft Azure using Blob Storage. Blob storage through the use of CONTAINERS (like tables), and inside containers we have blobs. Inside a single instance of a Blob Storage service, we can have multiple containers, and each container can contain different data types. We must also have a storage account tied to a blob storage instance, as storage accounts are given permissions for read/write on specific containers.
Fig 3. Blob storage on Azure
There are three types of blobs
- Block blobs store text and binary data, up to about 4.7 TB, these can be managed individually, e.g. images, videos and documents. Use block blobs in scenarios such as streaming where you need to upload or download large volumes of data quickly.
- Append blobs are made up of blocks like block blobs, but are optimised for append operations. These are ideal for logging scenarios.
- Page blobs store random access files up to 8 TB in size. Page blobs are ideal for storing index-based and sparse data structures like OS and data disks for Virtual Machines and Databases. Use page blobs for applications that require random rather than serial access to parts of the data.
Fig 4. Visual perspective on the link between storage accounts, containers and blobs.
How does Blob Storage manage partitioning?
Azure blob storage uses partition keys for each blob. A partition key is built by storage account name + container name + blob name. This means each blob can have its own partition if load on the blob demands it. Blobs can be distributed across many servers in order to scale out access to them, but a single blob can only be served by a single server.
Let’s have a look at an architecture around the image scenario. The architecture below shows an example of a toll booth application, it incorporates reactive functionality using SERVERLESS solutions with Azure Functions (we will go through serverless in another article). Pay attention to the circled items, these are two blob storage instances used for storing the images to be displayed on the website. Also the CSV exports produced by the serverless function are stored in their own blob storage instance. Two implementations that provide infinite scalability and partition for faster read access.
Fig 5. Blob storage used for images and CSV exports
Flip over now to AWS. Here we have DynamoDB as a non-relational database service. DynamoDB has a similar approach, instead of CONTAINERS, we have tables. Tables contains items, and items contain attributes. It’s easy for developers to get confused between S3 and DynamoDB. In Azure, we use Blob Storage, in AWS, we have both S3 and DynamoDB for different purposes.
Fig 6. A look at dynamoDB in the AWS portal.
What are the differences between DynamoDB and S3 bucket, and when should I use each?
The main difference between the two is S3 is file storage and DynamoDB is a database. We use DynamoDB for high-throughput and low-latency. DynamoDB is typically useful for storing a large number of small records with single digit millisecond latency. Where S3 provides unlimited storage at relatively cheap cost than DynamoDB, but read operations are much slower. If we want to replicate the image scenario mentioned above, we require an S3 bucket plus CloudFront instance to provide large file storage and fast read access. Ideally we would want to implement Azure CDN on the front of blob storage as well for items like images, videos, etc. Each setup provides a similar solution. Just think with AWS, we require S3 and CloudFront for images, video and documents that require large file size and fast read access. In Azure, we only need Blob Storage, but we should add CDN on top if we are going to serve images and videos on a public website, we always want a cache for fast read access.
Have a look at the figure below, the architecture is a very simple website setup for a chat app. The S3 bucket serves static content (images, video) for the website, and the DynamoDB instance stores the chat messages.
Fig 7. S3 and DynamoDB used for an online chat app.
Primary & Sorting Keys
Primary keys uniquely identify each item in a DynamoDB. We also have the option to use a sorting key for more efficient querying purposes. You also have the ability to design a partition key which is built from both primary and sorting keys, you should follow best practices for designing partition keys.
How does DynamoDB manage partitioning?
DynamoDB stores data in partitions which are managed via partition keys. It will allocate additional partitions to a table in the following situations:
- If you increase the table’s provisioned throughput settings beyond what the existing partitions can support.
- If an existing partition fills to capacity and more storage space is required.
Jumping even lower level, partitions are provisioned storage on solid-state drives (SSD), and these automatically replicate across availability zones.
Fig 8. DynamoDB used for images and CSV exports
Wait! What are availability zones?
We will touch on this in more detail later on but briefly, an availability zone is an isolated location within a region. Each region is made up multiple availability zones meaning we always have high data availability.
That should give us good clarity between the two NoSQL services by Microsoft and Amazon, now what about Google Cloud?
Google Cloud provides a service Datastore for NoSQL, document databases. It offers the ability to scale, but it scales based upon query results, not your data set. This approach allows for minimum scalability because we only need the extra bulk to cater for querying, we don’t care about the size of the scale when the data just sits in the datastore.
Fig 9. A look at Datastore in the Google Cloud portal.
Where should I use Google Cloud Datastore?
Cloud Datastore is ideal for applications that rely on highly available structured data at scale. For example, product catalogs that provide real-time inventory and product details for retailers.
One key feature that differentiates Google Cloud Datastore is the ability to select your regions to store data. If we want a similar feature for Azure, we would look at using Cosmos DB instead of Blob Storage. Cosmos DB os another option we have for a NoSQL database, its more expensive, but it provides a wider range of options (we will look at this in the next part). AWS doesn’t provide this type feature for their NoSQL services, but there data replication is handled automatically across different regions, developers don’t have the option to customise selection. But this can be a good thing, its one less item we have to manage.
Fig 10. Region select options for your data within Datastore
Ever heard of GQL?
Using Google’s Datastore gives developers the options to run GQL queries. GQL maps roughly to SQL: you can think of a GQL kind as a SQL table, a GQL entity as a SQL row, and a GQL property as a SQL column. However, a SQL row-column lookup is limited to a single value, whereas in GQL a property can be a multiple value property [extracted from Google Docs].
What about the structure?
Google Cloud Datastore works on the basis of Entities, each entity has a kind, and each entity contains one or more properties like our other NoSQL services above. One disadvantage of assigning a KIND for each entity is it makes the data objects tightly coupled to an entity. We start drifting off the track of a loosely unstructured data store, don’t fall into a trap by making an entity for every data object because then we start moving to a more relational database.
Now let’s look at a similar architecture to the image scenario above. In this case, we have cloud storage for storing static content to our sites such as images, videos, like S3 with AWS. We also have Cloud SQL which could be a datastore service for dynamic content such as user accounts, user sessions, and other dynamic entities of a specific KIND:
Fig 11. Serving dynamic and static content to an app or website on Google Cloud.
Which one is the right choice?
This is always a tricky question across any cloud services, you should choose the decision that is best fit for your current architecture. The following will help clarify a decision:
- All cloud platforms offer NoSQL services, stick with your current platform. Don’t split your architecture to another cloud platform just to host a NoSQL Database.
- Every cloud platform can trigger serverless functions on add/remove
- Each NoSQL database handles high availability and data partitioning.
Some developers think multi-cloud is a good strategy for single architectures, what do we think?
Unless you have a good strategy for DevOps, i.e. terraform templates across all cloud platforms for infrastructure, multi-cloud strategies can create confusing architecture, and unnecessary custom integration points resulting in increased latency for communication between cloud platforms.
Don’t forget the following key take aways from the above:
- Non-relational databases are not designed to replace relational, determine if your data is:
- structured vs unstructured
- relational vs non-relational
- complex data vs non-complex data
- Vertical scalability involves increasing servers
- Horizontals scalability involves sharding
- Avoid multi-cloud architecture where possible, it creates complication and increased latency
Tune in next week for part two as we delve into pricing comparison, cosmos DB, and where each service is most applicable.Posted in Blog, Learn