Cool NoSQL on Azure with DocumentDB

DocumentDB is the latest storage option available on Azure. During the last weeks I had the opportunity to play with it and give some presentations about it. This article will introduce you to this new service.

What Microsoft says about DocumentDB and what that really means

Microsoft says that DocumentDB is a “fully managed, scalable, queryable, schema-free JSON document service for modern application”. What does that mean? Let us quickly talk about the different points.

Fully managed: Because DocumentDB is provided as a service you can work on JSON data without managing VMs or the appropriate cluster infrastructure.
Scalable: It runs on Azure and can scale through something called Capacity Units.
Queryable: DocumentDB offers a query interface where you can use JavaScript as a modern T-SQL.
Schema-free: Simple … it is a document store.

What you need to know about document stores

Document stores are a part of the NoSQL movement. The most important parts are collections and documents. A database can contain multiple collections which are containers for a bunch of documents. A document itself is schema-free, which means that you can put any kind of document into one collection. As an example, you can put a document with your cook recipes together with a document containing your financial data.

In most cases collections are compared to tables in the relational world. But here comes my warning: Don’t try to use your “relational mind” (what you know about relational database systems) in the NoSQL world. If you try to compare the different terms used in either world you can miss the important points, especially how things are handled. In some cases you can decrease performance and availability of your application dramatically when trying to map relational techniques onto document stores.

What you also need to know about DocumentDB

DocumentDB is special, even in the world of document stores, when it comes to different features.

One of them is the transactional support with ACID semantics. That means that transactions are running under the cover which are based on ACID. Other systems, such as MongoDB, has a similar support, but only on the document level.

The next feature is that the API is exposed as REST over HTTP. You can uniquely address all entities in a DocumentDB by a logical URI. That also offers the capability to use DocumentDB from each client you want to.

One of the most important feature is tunable consistency. What does that mean? With tunable consistency you can tune and trade off consistency through well defined levels to suit your application scenario and your performance needs. Additionally to that you can weaken the consistency level per read or query request.

Tunable consistency

As mentioned, in DocumentDb several consistency levels can be used. Compared to other database systems that is a major advantage. The different consistency levels and their behavior on writes and reads are as follows:

Level Writes Reads
Strong Sync quorum writes Quorum reads
Bounded Async replication Quorum reads
Session Async replication Session bound replica
Eventual Async replication Any replica

Strong consistency means that all writes are visible to all readers. Writes are synchronously committed by a majority quorum of replicas and reads are acknowledged by the majority read quorum. This consistency level is mostly used in relational database systems.
Bounded staleness guarantees the order of writes. Reads adhere to the minimum freshness. Writes are propagated asynchronously.
Session consistency lets you read your own writes. Writes are propagated asynchronously while reads for a session are issued against the replica that can serve the requested version. This consistency level gives the best tradeoff for for most NoSQL application scenarios related to consistency and performance.
Eventual consistency means that reads are eventually converge with writes. Writes are propagated asynchronously while reads can be acknowledged by any replica. Readers may view older data then previously observed.

An example: Imagine a scenario where you are changing your age from 26 to 27 and your hometown location from Berlin to Magdeburg on a site. What happens if your refresh your browser?

– Strong: After you have submitted the form and you refresh your browser it can take some time until the writes are written to all replicas. As a result you (and every other person) will see 27 as your age and Magdeburg as your hometown.
– Bounded: After changing your age and hometown the writes are propagated to all replicas in the order how you posted them (in this case the age is the first one written). When hitting refresh a majority quorum of the replicas have to acknowledge the new version. In a 5 replica cluster at least 3 replicas must already have the new version to display it. As seen in the example writes are done in the way how eventual consistency would handle it and reads are handled similar to reads in a system with strong consistency.
– Session: Same story, change the data and hit the refresh button. What happens now that you, as the writer of the data, will see the new values (27 and Magdeburg). Other persons connected to the database hitting the refresh button would see 26 or 27 as the age and Magdeburg or Berlin as the hometown location, depending on which replica handles the read (does the replica already have the new data or not?).
– Eventual: Eventual consistency handles reads different as the session consistency. After hitting refresh, you as the writer could end up on a replica which has the new data (27 and Magdeburg), a replica which has only a part of the new data (26 and Magdeburg or 27 and Berlin) or a replica which still has the old data (26 and Berlin).

What does quorum mean? Quorum refers to the majority of nodes in a cluster (replication_factor / 2 + 1).

What is important for a developer?

DocumentDB is special compared to other document stores.

In DocumentDB you can use User-Defined Functions (UDFs), Triggers and Stored Procedures … written in JavaScript! As said before DocumentDB uses JavaScript as a modern T-SQL. All mentioned components are used in the same way as in the relational world, but instead of using T-SQL JavaScript is used for the definition. As mentioned before DocumentDB does support transactions. These transactions are integrated in the language. Under the cover the entire procedure is wrapped in an implicit ACID transaction and if a JavaScript exception occurs it results into aborting the transaction. We now know that DocumentDB supports several components common relational database systems support. Another similarity is that you can use a “document oriented” SQL grammar to query the database. How this one looks like will be shown later in this article. As mentioned before DocumentDB is exposed as REST over HTTP. At the time of writing there are some client SDK wrappers around that REST interface, such as for .NET, Node.js, JavaScript and Python. SDKs for C++ and Java are planned. An important point to notice for developers is the asynchronous support for all operations.

What is the resource model like?

src: http://azure.microsoft.com/en-us/documentation/articles/documentdb-interactions-with-resources/

As mentioned before, every entity in DocumentDB can be uniquely addressed by a URI.

The root entity is the database account which is associated with one or more capacity units representing provisioned document storage and throughput, a set of databases and blob storage. The database account has two child entities: media and database.

database is a logical container of document storage partitioned across collections. It is also a container for the users.

The users are the logical namespace for scoping and partitioning permissions, which provide an authorization token associated with a user for authorized access to a specific resource.

Another child entity of the database is the collection: A collection is simply a container of JSON documents and JavaScript application logic associated with it. A collection can hold the documents, the stored procedures, the triggers and the UDFs associated with it.

stored procedure is application logic written in JavaScript which is registered with a collection and is transactionally executed within the database engine.

trigger is application logic written in JavaScript modeling side effects associated with an insert, replace or delete operation.

UDF is a side effect free, application logic written in JavaScript. It enables you to extend the core DocumentDB query language.

document is just that … user defined JSON content. By default, no schema needs to be defined or secondary indices need to be provided for all the documents added to a collection.
Within a document you can also store attachments. They are special documents containing references and associated metadata to an external blob/media. We as developers can choose to have the blob be managed by DocumentDB or to have it stored with an external blob service provider, such as OneDrive, Dropbox, etc.

Two examples for addressing a resource in DocumentDB:

https://<database account>.documents.azure.com is the API base URL which gives you access to your database account.
To access a document you would use <root>/dbs/<database id>/colls/<collection id>/docs/<document id>.

Current quotas

At the time of writing DocumentDB has some limitations. The most important are:

1. The number of stored procedures, triggers and UDFs per collection are limited to 25.
2. The number of AND clauses per query is 5.
3. The number of OR clauses per query is 5.
4. The maximum request size of a document is 256Kb.
5. The maximum request size of a stored procedure, trigger and UDF is also 256Kb.

For the current limitations see – http://azure.microsoft.com/en-us/documentation/articles/documentdb-limits/

What is more important for a C# developer?

The following points are the most important features for C# developers:

– Support for gateways and direct connectivity
– Async APIs for all operations
– HTTP and TCP transports available
– POCOs, inherited document types and dynamics

… and most important …

– LINQ, LINQ, LINQ

The small print

DocumentDB is a sandboxed environment so there are some limitations.

1. No imports are allowed.
2. The eval() function is disallowed.
3. The execution of operations is time boxed.
4. Resources are governed for CPU, IO and memory.

Especially the last two points are important. If your operation is a big one and needs a lot of time (#3), a lot of CPU, IO or memory (#4) the system can abort your operation. This behavior makes sure that no other tenants in your environment are influenced.

Simple operations in .NET

Let us now have a look on some DocumentDB operations in C#. But before we start let me point out that for most operations you will need the endpoint to your database account and the authorization key, which can both be accessed through the portal. To keep things simple these are stored in a private field.

private const string Endpoint = "https://.documents.azure.com:443/";
private const string AuthKey = "";

Creating a database

Next we want to create a new database for our beer registry which contains some references to breweries in Belgium and Germany.

// usage: Database database = await CreateDatabase();

private static async Task CreateDatabase()
{
  // Create a connection to our database account
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    // Give our database a name
    const string DatabaseName = "BeerRegistry";

    // Check if database already exists
    Database database = client.CreateDatabaseQuery().Where(db => db.Id == DatabaseName).AsEnumerable().FirstOrDefault();

    if (database == null)
    {
      // Create the database
      database = await client.CreateDatabaseAsync(new Database { Id = DatabaseName });
    }

    return database;
  }
}

An important point to mention: When creating resources (such as database, collections, etc.) in DocumentDB you should check if they already exist. If you do not an exception will be thrown. Currently there is no way, like in other document stores, to say that a resource should be used when it already exists and when not create this resource.

Creating a collection

After we have created our database we want to create a collection in this database. Similar to the creation of the database we have to check whether the collection is already present. If not, we have to create it.

// usage: DocumentCollection collection = await CreateCollection(database);

private static async Task CreateCollection(Database database)
{
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    const string CollectionName = "Beers";

    // Check if collection already exists
    DocumentCollection collection =
      client.CreateDocumentCollectionQuery(database.CollectionsLink)
        .Where(c => c.Id == CollectionName)
        .AsEnumerable()
        .FirstOrDefault();

    if (collection == null)
    {
      // Create collection
      collection = await client.CreateDocumentCollectionAsync(
        database.CollectionsLink,
        new DocumentCollection { Id = CollectionName });
    }

    return collection;
  }
}

SQL queries

Before we start querying our database we have to load some data into our collection. The following two JSON snippets contain some information about the Affligem and Diebels brewery and arbatary JSON.

Affligem.json

{
  "id": "Affligem",
  "location": "Belgium",
  "beers": [
    {
      "name": "Postel Blond",
      "level": 7
    },
    {
      "name": "Affligem Tripel",
      "level": 9.5
    }
  ]
}

Diebels.json

{
  "id": "Diebels",
  "location": "Germany",
  "beers": [
    {
      "name": "Diebels Pils",
      "level": 4.9
    }
  ]
}

To load the JSON files into the collection we first load the text from the appropriate file and deserialize the object into a dynamic. This dynamic type will be used to persist the document into our collection.
An interesting point of DocumentDB is that you can use normal SQL statements to query the database. In the following code snippet we use this capability to query the database for the Affligem brewery to print out the beers brewed in this brewery.

// usage: await SqlQueries(collection.DocumentsLink);

private static async Task SqlQueries(string documentsLink)
{
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    dynamic affligem = JsonConvert.DeserializeObject(File.ReadAllText(@".DataAffligem.json"));
    dynamic diebels = JsonConvert.DeserializeObject(File.ReadAllText(@".DataDiebels.json"));

    // Persist the documents
    await client.CreateDocumentAsync(documentsLink, affligem);
    await client.CreateDocumentAsync(documentsLink, diebels);

    IQueryable query = client.CreateDocumentQuery(
      documentsLink,
      "SELECT * FROM Beers b WHERE b.id = 'Affligem'");
    var company = query.AsEnumerable().FirstOrDefault();

    Console.WriteLine("The Affligem brewery have the following beers:");

    if (company != null)
    {
      foreach (var beer in company.beers)
      {
        Console.WriteLine(beer.name);
      }
    }
  }
}

On important point to notice is that you have to be careful when working with dynamics instead of POCOs. When iterating over the beers of the company you have to use lower case for beers and the name property, because it is represented in this way in the document. If you don’t use the lower case you will get an exception from the JSON serializer.

LINQ queries

Instead of using SQL statements you can also use LINQ to query the database. To demonstrate that the JSON structure we used before are expressed in two POCOs: Company and Beer.

Company.cs

public class Company
{
  [JsonProperty(PropertyName = "id")]
  public string Id { get; set; }

  public string Location { get; set; }

  public Beer[] Beers { get; set; }
}

Beer.cs

public class Beer
{
  public string Name { get; set; }

  public double Level { get; set; }
}

Similar to the collection and the database we first check if the document is already in our database. If not, we will use our POCOs to create the appropriate object and save it to the collection. After storing the document is finished we use LINQ to query the beers in our database to print their name and their level of alcohol. At this point you should notice that we use something like collection links, self links and document links during the code snippets. Because DocumentDB is based on REST you will use resource representations to query documents, to create databases etc. A database object contains the link to itself (to its resource representation), called self link. It also contains a “pointer” to its URI representation of its collections, called collection link. The collection among others contains a “pointer” to itself and to the resource where documents are stored.

The document link in this example is used to store new documents. The self link of a document is used in the last part of the snippet to delete the document, where the self link points to.

// usage: await LinqQueries(collection.DocumentsLink);

private static async Task LinqQueries(string documentsLink)
{
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    const string Frankenheim = "Frankenheim";

    // Check if the document already exists
    Document document = client.CreateDocumentQuery(documentsLink)
      .Where(d => d.Id == Frankenheim)
      .AsEnumerable()
      .FirstOrDefault();

    if (document == null)
    {
      // Create the document
      var frankenheim = new Company
      {
        Id = Frankenheim,
        Location = "Germany",
        Beers = new[] { new Beer { Name = "Frankenheim blue", Level = 2.9 } }
      };

      document = await client.CreateDocumentAsync(documentsLink, frankenheim);
    }

    IQueryable beers = client.CreateDocumentQuery(documentsLink).SelectMany(c => c.Beers);

    foreach (Beer beer in beers.ToList())
    {
      Console.WriteLine(beer.Name + " " + beer.Level);
    }

    document = await client.DeleteDocumentAsync(document.SelfLink);

    if (document == null)
    {
      Console.WriteLine("Document deleted");
    }
  }
}

Using stored procedures

An interesting point of DocumentDB is the usage of stored procedures. Stored procedures are written as normal JavaScript. If you have worked with Microsoft Azure Mobile Services this should look familiar to you. The following stored procedure is used to swap the locations of two breweries.

SwapLocations.js

function SwapLocations(company1Id, company2Id) {
  var context = getContext();
  var collection = context.getCollection();
  var response = context.getResponse();

  collection.queryDocuments(collection.getSelfLink(), 'SELECT * FROM Beers b where b.id  = "' + company1Id + '"', {},
    function (err, documents, responseOptions) {
    var beer1 = documents[0];

    collection.queryDocuments(collection.getSelfLink(), 'SELECT * FROM Beers b where b.id = "' + company2Id + '"', {},
      function (err2, documents2, responseOptions2) {
      var beer2 = documents2[0];

      var itemSave = beer1.location;
      beer1.location = beer2.location;
      beer2.location = itemSave;

      collection.replaceDocument(beer1._self, beer1,
        function (err, docReplaced) {
        collection.replaceDocument(beer2._self, beer2, {});
      });

      response.setBody(true);
    });
  });
}

Before we can use this stored procedure you have to make it available in your database. Like always, we check if the stored procedure is already present. If not, we will define the stored procedure with its ID and the body of the stored procedure, which is our JavaScript seen above. When we create the the stored procedure we use the stored procedures link of the collection. This is the next type of link you can keep in mind. To execute the stored procedure we will use its self link and the parameters which should be passed into the stored procedure. In this case we use our breweries created before. To check if it works we will again query our Affligem brewery to see if it is now located in Germany and to query the Diebels brewery, which should now be in Belgium.

// usage: await UseSps(collection.StoredProceduresLink, collection.DocumentsLink, "Affligem", "Diebels");

private static async Task UseSps(string storedProceduresLink, string documentsLink, string company1, string company2)
{
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    const string StoredProcedureName = "SwapLocations";

    StoredProcedure storedProcedure = client.CreateStoredProcedureQuery(storedProceduresLink).Where(sp => sp.Id == StoredProcedureName).AsEnumerable().FirstOrDefault();

    if (storedProcedure == null)
    {
      // Register a stored procedure
      storedProcedure = new StoredProcedure
      {
        Id = StoredProcedureName,
        Body = File.ReadAllText(@".StoredProceduresSwapLocations.js")
      };
      storedProcedure = await client.CreateStoredProcedureAsync(storedProceduresLink, storedProcedure);
    }

    await client.ExecuteStoredProcedureAsync(storedProcedure.SelfLink, company1, company2);

    IQueryable query = client.CreateDocumentQuery(
      documentsLink,
      "SELECT * FROM Beers b WHERE b.id = 'Affligem'");
    var company = query.AsEnumerable().FirstOrDefault();

    Console.WriteLine("The Affligem brewery is located in:");

    if (company != null)
    {
      Console.WriteLine(company.location);
    }

    query = client.CreateDocumentQuery(
      documentsLink,
      "SELECT * FROM Beers b WHERE b.id = 'Diebels'");
    company = query.AsEnumerable().FirstOrDefault();

    Console.WriteLine("The Diebels brewery is located in:");

    if (company != null)
    {
      Console.WriteLine(company.location);
    }
  }
}

Using User Defined Functions (UDFs)

Beside stored procedures you can also use UDFs. Similar to the stored procedures UDFs are also defined in JavaScript.

Sqrt.js

function sqrt(number) {
  return Math.sqrt(number);
}

The creation of a UDFs is similar to the creation of a stored procedure (except you now use the link to the UDFs in the collection). A difference is that you can use the UDF in your SQL query after the creation was successful.

// usage: await UseUdf(collection.UserDefinedFunctionsLink, collection.SelfLink);

private static async Task UseUdf(string userDefinedFunctionsLink, string collectionLink)
{
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    const string UdfMaxName = "Sqrt";

    UserDefinedFunction userDefinedFunction =
      client.CreateUserDefinedFunctionQuery(userDefinedFunctionsLink)
        .Where(udf => udf.Id == UdfMaxName)
        .AsEnumerable()
        .FirstOrDefault();

    if (userDefinedFunction == null)
    {
      // Register User Defined Function "Max"
      userDefinedFunction = new UserDefinedFunction
      {
        Id = UdfMaxName,
        Body = File.ReadAllText(@".UDFsSqrt.js")
      };
      await client.CreateUserDefinedFunctionAsync(userDefinedFunctionsLink, userDefinedFunction);
    }

    var query = client.CreateDocumentQuery(collectionLink, "SELECT Sqrt(b.level) FROM b IN Beers.beers");
    var result = query.AsEnumerable().FirstOrDefault();

    Console.WriteLine("Sqrt: " + result);
  }
}

Delete a database

After all the work is done we want to delete our database. This is done via the self link of the database.

// usage: await DeleteDatabase(database);

private static async Task DeleteDatabase(Database database)
{
  using (var client = new DocumentClient(new Uri(Endpoint), AuthKey))
  {
    await client.DeleteDatabaseAsync(database.SelfLink);
  }
}

When should you use DocumentDB

After all the talk and some samples on how DocumentDB can be used.

In General

– You don’t want to do replication and scale-out by yourself.
– You want to have tunable consistency.
– You want to do rapid development.

Compared to relational databases

– You don’t need predefined columns.

Compared to other document stores

– You want to use a SQL-like grammar.

Last words

Remember that DocumentDB is still in preview …

… expect some things to change
… give feedback
… SDKs are open sourced through GitHub

The complete sample used in this article is available on GitHub (see the Resources section).

Have fun!

Jan (@Horizon_Net)

Resources

– Sample on GitHub
– Presentation on SlideShare
– Introducing DocumentDB blog post
– Channel9 video
– DocumentDB documentation

Advertisements

2 thoughts on “Cool NoSQL on Azure with DocumentDB

  1. Pingback: Dew Drop – October 6, 2014 (#1870) | Morning Dew

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s