NoSQL Part 2 Azure Cosmos DB and Blob Storage Access

NoSQL Part 2 – Azure Cosmos DB and Blob Storage Access

cosmos dbIn part 1 we discussed our NoSQL database solutions on all cloud platforms, except for Azure Cosmos DB. Cosmos DB is a document database (rebranded from DocumentDB with more features) with selective global replication, it provides auto-scaling, key-value pairing, five consistency models, and selective db framework.

 

Cosmos DB gives developers a wide range of APIs:

  • SQL and JavaScript
  • Gremlin
  • MongoDB
  • Azure Table storage
  • Apache® Cassandra
  • Apache® Spark

Take your pick, it provides a great range of flexibility what ever the framework. We believe it is the best NoSQL option out of any cloud platform.

Why?

Apart from the common features that all other NoSQL services provide:

  • key-value pattern
  • partitioning
  • elastic scale

Cosmos DB has all the above plus extra ingredients around customization:

  • native API selection
  • selective region replication
  • performance advantages
    •  scale throughput and storage independently
    • deploy to any number of regions – high availability

Pretty cool right?

Since we have this flexibility, we can design database agnostic applications. There isn’t another service out there which does the same. It combines the goodness from Datastore, DynamoDB, and Blob Storage. We can leave the S3 part out because we don’t care about file store, it’s a non-relational database.

Let’s focus on the performance side. You have the ability to select geographic replication, you can pin point traffic and replicate as needed. Guaranteed by the SLA:

For a typical 1KB item, Cosmos DB guarantees end-to-end latency of reads under 10 ms and indexed writes under 15 ms at the 99th percentile, within the same Azure region. The median latencies are significantly lower (under 5 ms).

I don’t care who you are, that is super fast. We can guarantee performance and high availability for mission critical applications.

So what’s the catch? 

Price. Unfortunately it costs more for this flexibility. It’s easy to say, “why don’t we just use Cosmos DB for all NoSQL databases?” If we did, we would be racking up dollars for flexibility where simple Blob Storage should be utilised.

This should apply to all cloud services – Only pay for what you need.

We see these mistakes a lot, architecture consuming items that are not required. This is where it is important to review your architectures regularly, setup monitoring on your resources to investigate usage closely. As applications grow and change, so does the need for its backend services.

Now the big question, where do we use it? And when do we pick this service over Blob Storage?

Have a look at the following architecture:

Fig 1. Azure Cosmos DB + HDInsight to deliver a highly available

and globally distributed experience.

This is a common scenario for an IoT system that requires archival and high throughput for predictive analytics. We use IoT Hub to digest telemetry from all devices (IoT Hub is a service that has the ability to digest data from millions of IoT devices – we will look into this further in another article). HDInsight is used for big data storage solutions, as in this case with IoT devices, we are going to be collecting large amounts of data, so we need to be able to store all daw data and apply map reduce and partition functionality. Also, take note to the Blob Storage attachment to HDInsight, we use an instance of this storage for archiving all the raw data from devices that our HDInsight service will use. Once we have large amounts of data in out Apache HDInsight store, we pass this into Cosmos DB to give readings on device state and telemetry. The Azure Web Job is attached to Cosmos DB, it listens for data additions, and runs on insertions to perform data transform that our Logic App will use in some workflow (this is our first look at Serverless, we will look further into Serverless with Logic Apps and Functions on Azure in another article).

Key mention here is the difference between Blob storage and Cosmos DB. Blob storage is used as an archival for raw data. We still want to store the raw data before it is transformed (after a few days, we may throw this away once the data is transformed), we don’t care about data structure, we just want to throw it in storage. Since we are not going to be reading from this data store, we pay no attention to through put.

Cosmos DB on the other hand is used to store the transformed data, and kept for high availability and high throughput. Our web job is pulling data regular every time our logic app workflow runs. Reason for this regular data retrieval is to investigate our device data to uncover patterns from a device event stream. Think of it as a predictive analytics scenario.

Blob Storage – Hot, cool, and archive storage tiers

archiveLet’s not forget about the flexibility around Blob Storage access. We can apply different tiers based on throughput requirements:

Hot Access Tier

  • Data that is in active use or expected to be accessed (read from and written to) frequently.
  • Data that is staged for processing and eventual migration to the cool storage tier.

Cold Access Tier

  • Short-term backup and disaster recovery datasets.
  • Large data sets that need to be stored cost effectively while more data is being gathered for future processing.

Archive Access Tier

  • Long-term backup, secondary backup, and archival datasets
  • Compliance and archival data that needs to be stored for a long time and is hardly ever accessed.

Which one do I choose?

Simple. We have to investigate. Let’s look at the predictive analytics example above, our blob storage archival is cold, non-regular access that may be thrown away after a certain time period. We will used the COLD ACCESS TIER. If we aren’t going to throw away this data, we would look at the ARCHIVE STORAGE TIER because of long term datasets. A HOT ACCESS TIER example might be for serving media (videos, images). Media streaming is a good example, but we would require a CDN service on the front for global distribution and caching purposes.

streaming

Fig 2. Streaming video through Blob Storage using Hot Tier Access and Media Services.

Media utilises CDN for global distribution

Wait! Both can provide hot access solutions, if we combine Blob Storage + CDN + Hot Access Tier, doesn’t this mean we have a Cosmos DB equivalent?

No. Here are some reasons why:

  • Cosmos DB offers much more around performance as it can scale independently with throughput and storage
  • CDN can’t customise your availability zones
  • Cosmos DB provides multiple API access

Remember, investigate your dataset requirements, throughput is always the best place to start, determine whether we require cold/hot/archival access.

In our next article, we will delve into pricing and compare differences between all NoSQL services on each cloud platform.

 

Posted in Blog, Learn