ACID and Apache Druid®

What is ACID?

ACID refers to a set of properties necessary for a transaction system.

Atomicity – This refers to transactions completing fully or failing without any partial changes. If a bank account is debited then another account needs to be credited. Atomicity means that either both operations succeed or both fail. There should be no scenario in which one succeeds and the other fails.

Consistency – This means that the database should permit only those transactions that comply with the rules. For instance if the rule is that account balance should not be negative and a transaction tries to debit an amount greater than the available balance then it should be rejected. The database must be in a consistent state before and after the transaction.

Isolation – This refers to one transaction not affecting another. If multiple transactions occur concurrently then the end result would be as if they occurred in sequence one after another. If a bank account has a balance of 500 Dollars and two transactions execute concurrently – one debiting 100 Dollars and another crediting 300 Dollars then the result would be a balance of 700 dollars (as if the transactions occurred one after another).
Durability – This means that all committed transactions must be retained even if the system faces a hardware or software failure.

ACID and Druid

Druid is a database for building a modern Analytics application. Druid cannot be used as a system of record and does not support transactions. Given the focus on analytics,ACID guarantees are not fully relevant to Druid. However, it is interesting to consider some of the capabilities of Druid in this light

Atomicity – There are two scenarios in which everything is committed or nothing is committed

  1. Batch ingestion – In batch ingestion, the ingestion succeeds and commits data to deep storage only if all rows in the ingestion succeed. Otherwise the entire ingestion fails and no rows are committed.
  2. Real time ingestion – Real time ingestion proceeds from a checkpoint for one hour (configurable) and creates another checkpoint and commits data up to that check point. However if any of the rows fail then the ingestion fails in its entirety and begins from the previous checkpoint again.

The above provides sufficient Atomicity guarantee to ensure that Druid does not provide an incorrect picture and that users don’t have to track which data went through and which did not. Taking the example of debit and credit into bank accounts, these records will have same time stamp in the transaction system and when they are ingested into Druid then either both will be ingested or both will not be (with the caveat that both records are part of the same batch ingestion or the same kafka topic). This will ensure that the analytics results are correct.

Consistency – Druid does not allow a lot of rules on the data. There are no primary keys and no foreign keys. If the data types for columns are specified in the ingestion spec then any data that does not conform to the spec will be rejected. When an entity is updated Druid will hold multiple records for that entity as each update is an event in time. Druid provides LATEST sql command to get a consistent picture of the latest state of the entity. 

Isolation – Isolation can be seen in Druid’s ability to ingest millions of rows per second while  handling 100’s of queries per second at the same time – Historicals handle queries on the data that is already in the system and the indexers respond to queries using the data that is being indexed. Indexers publish the indexed data to deep storage as segments and the historicals download the segments from deep storage and the indexing task is complete only when a Historical has downloaded the segment. Any queries on the data being indexed is handled by the indexer until the Historical has downloaded the segment. This ensures that there are no read-write conflicts (consider the alternate approach of indexers writing the data directly to the historicals. This would have made it very difficult to avoid read-write conflicts). Additionally ,segments (which contain the data indexed and stored in a columnar format) are immutable. This ensures isolation between reads and writes.

Durability – The primary persistent store is the deep storage which is usually a blob storage. So any data that has been committed to deep storage is available even if the entire cluster goes down. A new cluster can be brought using the data in deep storage. Batch ingestion succeeds only if all the data in the batch is committed to deep storage and real time ingestion succeeds only when all data from the previous checkpoint is committed to deep storage. 

Conclusion

Druid provides sufficient ACID guarantees from an analytics perspective. Full transactional ACID is not relevant to Druid. The guarantees Druid provides enable building a modern analytics application with large volume data, high throughput real time ingestion and high query concurrency

Links

  1. Apache Druid – druid.apache.org

Leave a comment