Skip to main content

One post tagged with "Druid"

View All Tags

Counting Distinct Values Accurately in Druid SQL

· 3 min read
Aidan Mulgrew
Software Engineer

When working with Druid SQL, it's easy to fall into a common trap when counting distinct values: using COUNT(DISTINCT ...) directly can sometimes return unexpected results. Recently, I hit a case where COUNT(DISTINCT) returned a different value than selecting the DISTINCT rows manually - and this post explains why that happens, and how to fix it.

The Problem

I was running a simple query like this:

SELECT COUNT(DISTINCT "callID") 
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'

The result was 56.

However, when I rewrote it to select the distinct IDs manually:

SELECT DISTINCT "callID" 
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'

I only got 55 rows.

Why the difference?

Why This Happens

In Druid, COUNT(DISTINCT ...) by default uses approximate counting behind the scenes. It relies on a probabilistic algorithm (typically HyperLogLog) which trades perfect accuracy for speed and efficiency - a good choice for very large datasets, but occasionally inaccurate when you need precision.

Druid is optimized for analytics over massive datasets, where perfect precision is often less important than query speed. Approximate algorithms like HyperLogLog dramatically reduce memory usage and processing time compared to exact deduplication.

On the other hand, SELECT DISTINCT always materializes the actual rows and returns an exact result.

In this case, the 56 was a slight overestimate caused by the approximation.

How to Get an Exact COUNT(DISTINCT)

Unfortunately, Druid SQL does not have a simple USING EXACT option like some SQL engines.
Instead, the best workaround is to manually group the distinct values first, and then count them:

SELECT COUNT(*) 
FROM (
SELECT "callID"
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'
GROUP BY "callID"
)

Or, depending on your Druid SQL support:

SELECT COUNT(*) 
FROM (
SELECT DISTINCT "callID"
FROM "source-table"
WHERE __time > '2025-04-01 00:00:00'
AND __time < '2025-04-02 00:00:00'
)

Either version forces Druid to materialize the distinct IDs first, ensuring an exact count.

Approximate Counting When It's OK

If you intend to use approximate counts, Druid provides the APPROX_COUNT_DISTINCT(column) function explicitly. This makes your queries' behavior clearer to others (and to your future self!).

Example:

SELECT APPROX_COUNT_DISTINCT("callID") 
FROM "source_table"

In dashboards, ad-hoc analytics, or monitoring systems where a tiny margin of error is acceptable, using APPROX_COUNT_DISTINCT can dramatically speed up queries and reduce cluster load.

For critical use cases like billing, audit logs, or regulatory reporting, always use an exact counting method.

Newer Druid Versions

If you're using Druid 27.0 or later, there are additional options for controlling the behavior of approximate aggregators, such as choosing between HyperLogLog and Theta Sketch algorithms manually. Always consult the documentation of your Druid version to see what options are available if precision matters to you.

Conclusion

When you need an accurate count of distinct values in Druid SQL, remember that COUNT(DISTINCT ...) may be approximate unless you force an exact materialization step.
Using a GROUP BY inside a subquery is the safest and clearest approach.

And if you're happy to accept a little fuzziness, make it explicit with APPROX_COUNT_DISTINCT.