Since Oracle 12.1, users have had the ability to use the APPROX_COUNT_DISTINCT() function.

This function is supposed to be significantly faster than a COUNT(DISTINCT …). Does it really offer that big of a payout? Let’s see:

First, let’s make a table with 10 million rows:

create table drop_me
as
select rownum unique_col
     , mod(rownum, 20) low_card_col
     , b.*
from dba_objects a
     cross join dba_objects b
where rownum <= 10000000;

Next, let’s issue a traditional COUNT(DISTINCT …)

-- On my system runs in about 1.6 seconds
select count(distinct low_card_col)
from drop_me;

Now, let’s issue an approx_count_distinct(…)

-- On my system, runs in about 1.4 seconds
select approx_count_distinct(low_card_col)
from drop_me;

So….there wasn’t that much difference in time. A mere 0.2 seconds? And you still probably get slightly incorrect data??

Bleh, this approximate count business kinda looks like it’s not even worth using! Why not always make sure your data is 100% correct if your time difference is just 0.2 seconds, right?

The truth is, for extremely low cardinalities, it’s probably not worth using.

APPROX_COUNT_DISTINCT Algorithm

Here’s (the very simplified version of) how the APPROX_COUNT_DISTINCT algorithm works: As it encounters each value in the table, it puts the value through a very random-seeming hash, and then notes the approximate number of leading zeros the hash has.

There’s more to it than that: There are actually multiple counters, and each counter is keeping track of the number of leading zeros for a group of records. Oracle distributes the rows evenly among the buckets. At the end of the calculation Oracle throws away the outlier buckets. So you have a bunch of buckets, each bucket holding a number, which represents the maximum number of leading zeros found in any of its hashes.

Oracle can average all these numbers and find out that the approximate average number of leading zeros was <some value> and then Oracle takes 2 to the power of <some value> and uses that as the estimate for the approximate number of distinct values.

So the really cool part about this is it takes an extreeeeeemely small amount of memory to perform this algorithm–and that’s where the optimization comes in.

The biggest advantage for APPROX_COUNT_DISTINCT is that it can be performed using a very, very small amount of memory.

Which is why there wasn’t much of an effect on our query run time.

Why Wasn’t There That Much Time Difference?

If I run the first query (select count(distinct low_card_col) from drop_me) this is what I get for the query’s plan:

----------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)|
----------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |     1 |    13 | 45077   (2)|
|   1 |  SORT AGGREGATE      |          |     1 |    13 |            |
|   2 |   VIEW               | VW_DAG_0 |    20 |   260 | 45077   (2)|
|   3 |    HASH GROUP BY     |          |    20 |    60 | 45077   (2)|
|   4 |     TABLE ACCESS FULL| DROP_ME  |    10M|    28M| 44722   (1)|
----------------------------------------------------------------------

Oracle scans the DROP_ME table, and uses a HASH GROUP BY operation to get a distinct listing of our column. Then, Oracle uses a SORT AGGREGATE to count the “uniquified” list. So Oracle has to write/compare every value it finds with other values in the HASH GROUP BY hash.

Since there were only 20 unique values, Oracle wrote very few records to our HASH GROUP BY hash (we would only have to write a maximum of 20 entries to the hash, and then read 20 entries back out at the end). Plus, counting the 20 values wouldn’t take any time at all. So scanning the table took the majority of the time in this query.

But what happens if we use the same table, but do a distinct count on a different column….suppose we try a column with a much higher cardinality?

-- On my system, this runs for about 1.5 seconds
select approx_count_distinct(unique_col)
from drop_me;

Hmmm…still not much change for the APPROX_COUNT_DISTINCT function.

What about the traditional COUNT(DISTINCT …) way of doing things?

-- On my system, this runs for about 4.8 seconds
select count(distinct unique_col)
from drop_me;

WOW! Same table with the same number of rows as before, but significantly different timings.

That’s because this time we had to write 10 million unique records to our HASH GROUP BY hash, not just 20 unique records….and then we had to count 10 million unique hash entries instead of 20 unique hash entries.

So the higher the distinct cardinality, the bigger the payout as far as APPROX_COUNT_DISTINCT

When Do you Get the Biggest Payout From Using APPROX_COUNT_DISTINCT?

When would you expect to get a REALLY massive payout from using APPROX_COUNT_DISTINCT?

Well, if a COUNT(DISTINCT …) is using a HASH GROUP BY, and the table is has a large number of rows, and the column has an extremely high cardinality, then it’s possible that the hash table won’t even fit in memory.

The de-duplication hash table could very will spill over into temp space, so you would be both writing records to temp, and reading records to temp if there wasn’t enough space in the PGA.

Here, I would expect a COUNT(DISTINCT…) operation to perform EXTREMELY slowly due to all the physical IO of both reading and writing, and an APPROX_COUNT_DISTINCT operation to perform extremely quickly.

So! You’ll get the biggest APPROX_COUNT_DISTINCT benefit when:

  • Your table is large, with lots of rows
  • The column has a very high cardinality (high number of distinct values, or nearly unique)
  • A corresponding HASH GROUP BY operation would spill to disk

Free Oracle SQL Tuning Guide

Checkout my FREE guide, 7 SQL Tuning Secrets You Can Use Immediately, Even If You’ve Never Tuned a Query In Your Life!

Get it here: tuningsql.com/secrets

7 SQL Tuning Secrets You can Use Immediately, Even If You've Never Tuned A Query In Your Life

Leave a Reply

Your email address will not be published. Required fields are marked *