Modeling and partitioning a real-world example on Azure Cosmos DB

Modeling and partitioning a real-world example on Azure Cosmos DB

Whatever database you are using, modeling your data against the unique strengths and constraints of that database is key to achieve the best performance, maintainability and scalability. For many, performing this exercise on Cosmos DB presents new challenges because of its particular characteristics - notably, being a non-relational and horizontally scalable data store. This article will guide you through the complete process of modeling a real-world use-case on Cosmos DB, from document modeling to entity co-location and container partitioning.

The domain

I usually take the example of a blogging platform to illustrate my articles, and this one is no exception! But this time, we'll add some real-world features to make it less naive and more relevant for our exercise.

So let's imagine a platform like Medium, where users can create posts. Users can also like and add comments to posts. A front page displays a feed of recently created posts. When displayed as lists, posts only present a truncated summary of the content, together with the author's username and a count of likes and comments. Comments also contain the username of their authors.

Identifying the expected access patterns

To start, we list all the operations that users will be able to perform against our platform. In order to make the overall process easier to follow, we categorize those different requests as either commands or queries, borrowing some vocabulary from CQRS where commands are write πŸ“ requests (i.e. intents to update the system) and queries are read-only πŸ” requests.

Here is the list of requests that our platform will have to expose:

  • [C1] πŸ“ Create/edit a user's profile
  • [Q1] πŸ” Retrieve a user's profile
  • [C2] πŸ“ Create/edit a post
  • [Q2] πŸ” Retrieve a post
  • [Q3] πŸ” List a user's posts in short form (truncated content)
  • [C3] πŸ“ Create a comment
  • [Q4] πŸ” List a post's comments
  • [C4] πŸ“ Like a post
  • [Q5] πŸ” List a post's likes
  • [Q6] πŸ” List the x most recent posts created in short form (feed)

Looks like a decent set of features to work on. An additional, overall requirement is that we expect our platform to accommodate a very large number of users creating a lot of posts, and heavy traffic overall. There will be more (potentially anonymous) readers than post creators, so we expect more traffic on the reading than the writing side.

Note that as this stage, we haven't really thought about the details of what each entity will contain. This step is usually among the first ones to be tackled when designing against a relational store, because we have to figure out how those entities will translate in terms of tables, columns, foreign keys etc. This is much less of a concern with a document database that doesn't enforce any schema at write.

So if you're coming from a relational background, try to resist the urge to dive into the details at this point; what we really need to understand first is:

  • the different entities that compose our data (users, posts etc.),
  • the possible relationships between these entities (a comment is created by a user and relates to a post etc.),
  • the access patterns in terms of requests and the entities involved with each request.

The rules of the game

Before we dive in, I should recap the different Cosmos DB characteristics that we have to take into account while designing our data model. These characteristics basically define the constraints we have to follow, but also the strengths that we have to build upon in order to maximize the performance and scalability of our solution. If you're already familiar with Cosmos DB, you can probably skip to the next section.

In Cosmos DB, documents are stored in containers (sometimes also called collections). Now if your instinctive reaction to that last sentence is "oh yeah, like row in tables", well... yes and no, but mostly no. Cosmos DB containers are first and foremost units of scale, as you can scale each container's storage and throughput independently - and independently of each other. Developers familiar with relational databases naturally assume that a container should only contain documents of one single type, but this approach yields sub-optimal benefits in many cases. Instead, reasons for grouping documents in the same container may be:

  • because they are related to each other (a post and all its comments and likes, for example),
  • because they belong to similar access patterns (a container for documents accessed frequently, an other one for documents accessed rarely).

To appreciate the rationale behind these guidelines, we have to understand how Cosmos DB achieves horizontal scalability. It does so by splitting the containers into logical partitions and shuffling those logical partitions across physical partitions (that roughly translate to servers). When you create a container, you have to specify a partition key, which is the name of the field that Cosmos DB will look up in your documents to decide which logical partition each document belongs to. So if you set userId as the partition key, all documents with the same value of userId will be part of the same logical partition, and so will be physically co-located.

Why is that important? Because our aim is to end up with well-balanced partitions, both in terms of storage and throughput. Some key elements to keep in mind are:

  • a logical partition has a maximum size of 10 GB (so that's 10 GB of documents with the same partition key value),
  • write πŸ“ requests should be evenly spread across logical partitions to avoid the bottleneck of "hot" partitions,
  • most frequent read πŸ” requests should get their results from only one logical partition for best performance.

If you feel overwhelmed by all that, don't be! It turns out that designing a data model that takes these elements into account is much easier than it looks. Once you get familiar and comfortable with these rules, it's actually pretty fun (hence the title of this section!).


A first shot

Our approach is going to be iterative: designing containers, figuring out how requests translate against the design, reviewing how well we follow the "rules", optimizing the design, rinse and repeat. Β So it doesn't really matter where we start, we just have to start somewhere.

Looking at the 4 types of "things" we have to deal with (users, posts, comments and likes), I would intuitively spread them first in 2 containers: users in a users container partitioned by id and all the rest in a posts container partitioned by postId. This is the type of intuition that you will build after just a couple of such exercises, and it's not really important how "off" it is as you'll iteratively test and fine-tune your design.

A couple of notes about this first version:

  • Denormalizing users' usernames in the posts container seems the way to go because (1) users are in a different container so usernames can't be joined and (2) users rarely update their usernames, which makes this potentially expensive - but rare - update operation acceptable.
  • Comment and like counts will change frequently, but denormalizing them in posts will benefit from the fact that posts, comments and likes sit in the same container and so can be updated in the same (transactional) operation - more on that below.
  • The decision of having comments and likes in separate documents is certainly questionable, as a common modeling technique is to embed related entities within the same document when dealing with one-to-many relationships. Let's say that in our current use-case, we may have to handle very large posts that could trigger a high number of comments that aren't limited in size either, so we prefer to spread posts, comments and likes in different documents to avoid hitting the 2 MB size limit that Cosmos DB currently enforces on documents.

Now time to check how well our model performs for each of the requests we have to support!

[C1] πŸ“ Create/edit a user's profile

This one's pretty straightforward as we just write to the users container. The requests will nicely spread across all partitions by virtue of the id partition key. Performance: βœ…

[Q1] πŸ” Retrieve a user's profile

Just as easy, the query SELECT * FROM u WHERE u.id = '<user-id>' issued against the users container retrieves the user document, and will do so very efficiently as it directly filters on the partition key. Performance: βœ…

[C2] πŸ“ Create/edit a post

Similarly to [C1], we just write to the posts container. Performance: βœ…

[Q2] πŸ” Retrieve a post

Similarly to [Q1], the query SELECT * FROM p WHERE p.postId = '<post-id>' AND p.type = 'post' issued against the posts container efficiently retrieves the post document. Performance: βœ…

[Q3] πŸ” List a user's posts in short form (truncated content)

Our posts being in the posts container, the query we would issue here is SELECT * FROM p WHERE p.userId = '<user-id>' AND p.type = 'post' ORDER BY p.creationDate DESC and we would truncate the content in our application, from the results returned. But note that this query does not filter on the partition key of the posts container, which will result in a costly partition scan. This can be totally acceptable for an operation that happens rarely, but fetching a list of posts for a particular user may be a frequent request. The performance and cost of this request will degrade as the number of posts increase, so there's room for improvement here. Performance:❌

[C3] πŸ“ Create a comment

Here you would think that we just have to write the corresponding document to the posts container, but remember that posts contain a comment count that has to be updated at the same time. A post and its related comments not only sit in the same container, but also in the same partition, which makes it possible to (1) insert the new comment and (2) increment the post's comment count atomically with a stored procedure that takes the postId and the body of the new comment as parameters:

function createComment(postId, comment) {
  var collection = getContext().getCollection();

  collection.readDocument(
    `${collection.getAltLink()}/docs/${postId}`,
    function (err, post) {
      if (err) throw err;

      post.commentCount++;
      collection.replaceDocument(
        post._self,
        post,
        function (err) {
          if (err) throw err;

          comment.postId = postId;
          collection.createDocument(
            collection.getSelfLink(),
            comment
          );
        }
      );
    })
}

Performance: βœ…

[Q4] πŸ” List a post's comments

The query is SELECT * FROM p WHERE p.postId = '<post-id>' AND p.type = 'comment' ORDER BY p.creationDate DESC, issued against the posts container. No performance issue here as the partition key is used in the WHERE clause. Performance: βœ…

[C4] πŸ“ Like a post

Just like [C3], we insert the like and increment the post's like count with a stored procedure:

function createLike(postId, like) {
  var collection = getContext().getCollection();

  collection.readDocument(
    `${collection.getAltLink()}/docs/${postId}`,
    function (err, post) {
      if (err) throw err;

      post.likeCount++;
      collection.replaceDocument(
        post._self,
        post,
        function (err) {
          if (err) throw err;

          like.postId = postId;
          collection.createDocument(
            collection.getSelfLink(),
            like
          );
        }
      );
    })
}

Performance: βœ…

[Q5] πŸ” List a post's likes

Just like [Q4], the query is SELECT * FROM p WHERE p.postId = '<post-id>' AND p.type = 'like' ORDER BY p.creationDate DESC, issued against the posts container. Performance: βœ…

[Q6] πŸ” List the x most recent posts created in short form (feed)

This query here would be SELECT TOP <x> * FROM p WHERE p.type = 'post' ORDER BY p.creationDate DESC, issued against the posts container. Like [Q3], this query does not filter on the partition key and will result in a partition scan. But that's even more problematic than [Q3] as this one returns our "feed", displayed on the home page of our blogging platform! So this operation definitely needs some improvement. Performance:❌

Keeping usernames in sync

We've detailed all the requests that we have to cover, but let's not forget that our data model denormalizes users' usernames in posts, comments and likes, so we have to make sure that any username update is correctly reflected. We achieve that by leveraging the change feed of the users container to trigger an operation each time a document is updated there. I've covered this technique in a previous article so I'll let you read the details there if necessary. In short, we will call another stored procedure on the posts container that will update the username in all the documents matching the userId:

function updateUsernames(userId, username) {
  var collection = getContext().getCollection();
  
  collection.queryDocuments(
    collection.getSelfLink(),
    `SELECT * FROM p WHERE p.userId = '${userId}'`,
    function (err, results) {
      if (err) throw err;
      
      for (var i in results) {
        var doc = results[i];
        doc.userUsername = username;
        
        collection.upsertDocument(
          collection.getSelfLink(),
          doc);
      }
    });
}

As the posts container is partitioned, stored procedure calls have to target a specific partition which means that we have to pass a partition key value when calling them. So we first have to retrieve the list of all postIds with the following query: SELECT p.postId FROM p WHERE p.type = 'post', then call the stored procedure for each returned value. This is a costly operation as the query retrieving postIds has to scan all partitions of the posts container, but we assume that most users choose a suitable username during sign-up and won't ever change it.


Round two: making sure all requests are scalable

Our first version is already pretty good! By storing users in the users container and posts, comments and likes in the posts container, we've made sure that there won't be any "hot" partition as data and load should be evenly spread. Also, most of the queries - except [Q3] and [Q6] - will yield good performance as they filter on the partition key of the container they target. So let's review those 2 remaining queries and figure out how we can modify our design to better accommodate them.

[Q3] πŸ” List a user's posts in short form (truncated content)

The reason why this request is sub-optimal is because it retrieves all posts for a particular user from a container that is not partitioned by userId. And so if I may state the obvious, this request should rather hit a container this is partitioned by userId.

It turns out that our design already has such a container: users. So we can apply another denormalization approach, using the change feed of the posts container this time, to truncate and write each user's post in the users container. As we truncate the content (probably to something like 50 characters), it is safe to assume that the 10 GB of logical partition space will be enough to store each user's profile and all her truncated posts.

We now have 2 kinds of entities in the users container (users and truncated posts), so we add a userId field to its documents and make this field our new partition key. We also add a type field to distinguish users from posts.

The query itself doesn't change (SELECT * FROM p WHERE p.userId = '<user-id>' AND p.type = 'post' ORDER BY p.creationDate DESC), but we'll now issue it against the users container. Performance: βœ…

[Q6] πŸ” List the x most recent posts created in short form (feed)

Our last optimization involves the "feed" query that fetches recently created posts. The performance issue is exactly the same as for [Q3] as we query the posts container without filtering on postIds. But we can't fix it as we did for [Q3], as it doesn't filter on userIds either. Instead, we'll take advantage of the facts that (1) the feed contains truncated posts and (2) it only has to keep a fixed amount of x most recent posts, which means that all the feed data would fit in a single logical partition.

So we introduce a third container named feed that is totally dedicated to serving feed queries (which will be among our most frequent requests). We force all documents to end up in the same partition by partitioning this container by type and giving all the documents the same type value of feed.

To insert documents in this new container, we just have to rely on the same change feed processing introduced above to denormalize posts in the users container.

We also have to make sure that we keep a fixed maximum number of documents within that container. To achieve that, we use a post-trigger, which is a Javascript function that Cosmos DB will execute after each document creation. We configure this trigger to fire only when new documents are created (otherwise we don't have to truncate the container) and give it the following body:

function truncateFeed() {
  const maxDocs = 500;
  var context = getContext();
  var collection = context.getCollection();

  collection.queryDocuments(
    collection.getSelfLink(),
    "SELECT VALUE COUNT(1) FROM f",
    function (err, results) {
      if (err) throw err;
      processCountResults(results);
    });

  function processCountResults(results) {
    // + 1 because the query didn't count the newly inserted doc
    if ((results[0] + 1) > maxDocs) {
      var docsToRemove = results[0] + 1 - maxDocs;
      collection.queryDocuments(
        collection.getSelfLink(),
        `SELECT TOP ${docsToRemove} * FROM f ORDER BY f.creationDate`,
        function (err, results) {
          if (err) throw err;
          processDocsToRemove(results, 0);
        });
    }
  }

  function processDocsToRemove(results, index) {
    var doc = results[index];
    if (doc) {
      collection.deleteDocument(
        doc._self,
        function (err) {
          if (err) throw err;
          processDocsToRemove(results, index + 1);
        });
    }
  }
}

The feed is now fetched from the feed container with the query SELECT * FROM f WHERE f.type = 'feed' ORDER BY f.creationDate DESC. Performance: βœ…


Our final design

Here is the final set of containers and their corresponding documents:

And here is a diagram showing all the operations that are triggered in reaction of change feed updates:


Closing remarks

Denormalization can be added incrementally

The scalability improvements we've explored in this article involve denormalization and duplication of data from one container to others. It should be noted that these optimizations don't have to be put in place at day 1. Queries that filter on partition keys and don't scan partitions perform better at scale, but cross-partition queries do work and can be totally acceptable if they are called rarely or against a limited dataset. If you're just building a prototype, or launching a product with a small and controlled user base, you can definitely spare those improvements for later; what's important then is to monitor your model's performance so you can decide when it's time to bring them in.

Adding denormalized data to your model can be done anytime and is a pretty painless process. The change feed that we use to distribute updates to other containers store all those updates persistently. This makes it possible to request all updates since the creation of the container, and bootstrap your denormalized views even if your system already has a lot of data.

Balancing read vs. write performance

You have probably noticed that we have concentrated our efforts towards improving the performance of read πŸ” requests (queries) at the expense of write πŸ“ requests (commands). In many cases, writes trigger subsequent denormalization through change feeds, which makes them more computationally expensive and longer to materialize.

This is justified by the fact that a blogging platform (like most social apps) is read-heavy, which means that the amount of read requests is usually orders of magnitude higher than the amount of write requests. So it makes sense to make write requests more expensive to execute in order to let read requests be cheaper and better performing.

Designing a data model for a write-heavy app (like an IoT platform that ingests high volumes of device telemetry but only has to extract analytics once in a while) would obviously call for a different approach.

New design techniques for a new kind of database

Now if you're coming from a relational background, you might think that the process we've described doesn't feel straightforward. Maybe it looks like a lot of hoops and loops to go through in order to get a proper data model, and certainly more effort than required when working with a relational database. I would object!

Imagine for a moment that you know nothing about relational modeling. You've no idea what a table, a relationship or a foreign key are... What would be the effort required to achieve a decent relational model starting from there? My point is that it's a different kind of exercise, but it involves roughly the same kind of effort. It may feel harder if you're not familiar with the specifics of Cosmos DB, but just as you learned how to build a relational model, it's something that you can learn and get better at!

The payoff of this exercise is a solution that is incredibly scalable. Because it correctly leverages the partitioning capability of Cosmos DB, the data model we've built throughout this article can accommodate hundreds to millions of users while delivering a performance level that would be very challenging to achieve with a traditional data store.

Comments