Skip to main content

Counting Distinct Values Accurately in Druid SQL

· 3 min read
Aidan Mulgrew
Software Engineer

When working with Druid SQL, it's easy to fall into a common trap when counting distinct values: using COUNT(DISTINCT ...) directly can sometimes return unexpected results. Recently, I hit a case where COUNT(DISTINCT) returned a different value than selecting the DISTINCT rows manually - and this post explains why that happens, and how to fix it.

The Problem

I was running a simple query like this:

SELECT COUNT(DISTINCT "callID") 
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'

The result was 56.

However, when I rewrote it to select the distinct IDs manually:

SELECT DISTINCT "callID" 
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'

I only got 55 rows.

Why the difference?

Why This Happens

In Druid, COUNT(DISTINCT ...) by default uses approximate counting behind the scenes. It relies on a probabilistic algorithm (typically HyperLogLog) which trades perfect accuracy for speed and efficiency - a good choice for very large datasets, but occasionally inaccurate when you need precision.

Druid is optimized for analytics over massive datasets, where perfect precision is often less important than query speed. Approximate algorithms like HyperLogLog dramatically reduce memory usage and processing time compared to exact deduplication.

On the other hand, SELECT DISTINCT always materializes the actual rows and returns an exact result.

In this case, the 56 was a slight overestimate caused by the approximation.

How to Get an Exact COUNT(DISTINCT)

Unfortunately, Druid SQL does not have a simple USING EXACT option like some SQL engines.
Instead, the best workaround is to manually group the distinct values first, and then count them:

SELECT COUNT(*) 
FROM (
SELECT "callID"
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'
GROUP BY "callID"
)

Or, depending on your Druid SQL support:

SELECT COUNT(*) 
FROM (
SELECT DISTINCT "callID"
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'
)

Either version forces Druid to materialize the distinct IDs first, ensuring an exact count.

Approximate Counting When It's OK

If you intend to use approximate counts, Druid provides the APPROX_COUNT_DISTINCT(column) function explicitly. This makes your queries' behavior clearer to others (and to your future self!).

Example:

SELECT APPROX_COUNT_DISTINCT("callID") 
FROM "source_table"

In dashboards, ad-hoc analytics, or monitoring systems where a tiny margin of error is acceptable, using APPROX_COUNT_DISTINCT can dramatically speed up queries and reduce cluster load.

For critical use cases like billing, audit logs, or regulatory reporting, always use an exact counting method.

Newer Druid Versions

If you're using Druid 27.0 or later, there are additional options for controlling the behavior of approximate aggregators, such as choosing between HyperLogLog and Theta Sketch algorithms manually. Always consult the documentation of your Druid version to see what options are available if precision matters to you.

Conclusion

When you need an accurate count of distinct values in Druid SQL, remember that COUNT(DISTINCT ...) may be approximate unless you force an exact materialization step.
Using a GROUP BY inside a subquery is the safest and clearest approach.

And if you're happy to accept a little fuzziness, make it explicit with APPROX_COUNT_DISTINCT.

Unlock NetSuite Sales and Orders with Sheetloom

· 3 min read
Gerry Conaghan
Business Development Manager

Filling business decision models with essential data captured in NetSuite™ can be: tiring, confusing, repetitive and error prone. The good news for the brave souls, doing all that manual labour, is Sheetloom. Sheetloom is the game-changing SaaS solution that seamlessly automates the injection of NetSuite™ data straight into dynamic Excel decision models, pivots, dashboards and reports.

Credit Control

· One min read
Calum Miller
Director

Credit Control is a challenging and time-consuming affair which, if managed incorrectly, can lead to cash flow problems and even business failure. Keeping track of customer payment history, to indicate potential problems, is an important task but often; labour intensive, complex and error prone.

Finance - Credit Check Automation

· One min read
Calum Miller
Director

A finance company used a mix of the latest digital tech and a human touch to bring a different approach to small businesses lending.

A crucial part of the loan review process required the company to credit score each applicant. Testing of the credit score process proved very time consuming using traditional methods. The finance company would manually; select applicants, collect individual responses and then consolidate results from a credit checking service.

Information Technology must serve decision makers

· 3 min read
Calum Miller
Director

All Business Intelligence (BI) vendors have a dirty little secret. It’s hidden under a weighty digital rock, keeping all those electronic worms company. It’s shared by Looker, Pentaho, Power BI, Tableau, SAP, Quicksight, et al.

It’s obvious, but seldom noticed. Take a deep breath, here it is;

All BI clients use Excel, way more than the BI vendors care to admit.