I remember being at an Oracle conference a few years ago, and I attended a session that was related to SQL Tuning.

One of the things that was discussed in this lecture was Oracle Single Table Hash Clusters, and I remember one of the things that the speaker said that shocked me.

The speaker said “I’ve run my own tests, and unique indexes are actually faster than single table hash clusters”

That blew my mind!

So is it true? Because if unique indexes are faster than hash clusters, and you have a unique data set, then why would you ever bother with all the limitations involved with hash clusters?

Unfortunately, I don’t have a copy of the speaker’s test case that they ran. But! I’m guessing that I can probably drum up something very similar.

The (Possible) Test Case That The Speaker Used

First, let’s just create a table with a bunch of random data. It’ll have 10 million rows, each row will have a random number between 0 and 1000.

create table bunch_of_random_data
  pctfree 0
  as
with generator 
as
(
    select null
    from dual
    connect by level <= 1000000
)
select round(dbms_random.value * 1000) random_number
from generator a
     cross join generator b
where rownum <= 10000000;

Next, we’ll make a table with a unique index (from the primary key).

create table unique_indexed
( 
    id constraint pk_unique_indexed primary key
)
as
select cast(rownum as number(*,0))
from dual
connect by level <= 1000;

Then we’ll make a table that houses the same data…except it uses a hash cluster rather than a primary key.

create cluster my_hash_cluster (id number(*,0))
size 32 single table hashkeys 1000 hash is id;

create table single_table_hc
(
  id
)
  cluster my_hash_cluster (id)
as
select cast(rownum as number(*,0))
from dual
connect by level <= 1000;

Next, we’ll analyze all of the tables.

exec dbms_stats.gather_table_stats(user, 'bunch_of_random_data');
exec dbms_stats.gather_table_stats(user, 'unique_indexed');
exec dbms_stats.gather_table_stats(user, 'single_table_hc');

From here, let’s run some code that causes our random data table to join (using nested loops) using the index or cluster hash key, and time each and see how it works….

declare
  v_start timestamp with time zone;
  v_end timestamp with time zone;
  v_unique_index_duration interval day to second;
  v_hash_cluster_duration interval day to second;
  CONST_QUERY_ITERATIONS constant pls_integer := 10;
  v_dummy number;
begin
  -- Get all the tables cached, 
  -- so that physical IO isn't slowing
  -- us down.
  select count(*) into v_dummy from bunch_of_random_data;
  select count(*) into v_dummy from unique_indexed;
  select count(*) into v_dummy from single_table_hc;
  
  
  -- Do multiple tests on joining using the primary key index 
  v_start := systimestamp;
  for v_index in 1 .. CONST_QUERY_ITERATIONS loop
    select /*+ leading(a) use_nl(b) */ count(*) into v_dummy
    from bunch_of_random_data a
         inner join unique_indexed b on (a.random_number = b.id);
  end loop;
  v_end := systimestamp;
  
  v_unique_index_duration := v_end - v_start;
  
  -- Now do multiple tests on joining using a hash cluster
  v_start := systimestamp;
  for v_index in 1 .. CONST_QUERY_ITERATIONS loop
    select /*+ leading(a) use_nl(b) */ count(*) into v_dummy
    from bunch_of_random_data a
         inner join unique_indexed b on (a.random_number = b.id);
  end loop;
  v_end := systimestamp;
  
  v_hash_cluster_duration := v_end - v_start;
  
  dbms_output.put_line('Unique Index Demo Duration: ' || v_unique_index_duration);
  dbms_output.put_line('Hash Cluster Demo Duration: ' || v_hash_cluster_duration);
end;
/

When I run this on my system, it yields the following results:

Unique Index Demo Duration: +00 00:00:53.883220
Hash Cluster Demo Duration: +00 00:01:03.864171

So the question is…what’s wrong with this test? It looks perfectly valid, and it looks like unique indexes won “fair and square.”

If we see how many limitations there are on a hash cluster, surely it doesn’t even make sense to use hash clusters if a unique index is an option?

What Factors Does the Test Not Consider

Firstly–index height. We’re dealing with a very small number of records in the table, so the unique index only had a height of 2 (or a blevel of 1) in my database. But if the table grows in size over time, you could end up with a height of 3 or more, and you would start to see the gap narrow as you started dealing with larger data sets.

Secondly (and this is the main thing) the two queries that are issues are not an apples-to-apples comparison.

Let’s look at the plan for the query with the unique index.

Plan hash value: 3057063937
 
-----------------------------------------------------------------
| Id  | Operation           | Name                 | Cost (%CPU)|
-----------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |  4451  (16)|
|   1 |  SORT AGGREGATE     |                      |            |
|   2 |   NESTED LOOPS      |                      |  4451  (16)|
|   3 |    TABLE ACCESS FULL| BUNCH_OF_RANDOM_DATA |  3809   (2)|
|*  4 |    INDEX UNIQUE SCAN| PK_UNIQUE_INDEXED    |     0   (0)|
-----------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("A"."RANDOM_NUMBER"="B"."ID")

Notice anything that’s missing? No??

Let’s look at the other plan for the hash cluster:

Plan hash value: 3756711336
 
-----------------------------------------------------------------
| Id  | Operation           | Name                 | Cost (%CPU)|
-----------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |    10M  (1)|
|   1 |  SORT AGGREGATE     |                      |            |
|   2 |   NESTED LOOPS      |                      |    10M  (1)|
|   3 |    TABLE ACCESS FULL| BUNCH_OF_RANDOM_DATA |  3809   (2)|
|*  4 |    TABLE ACCESS HASH| SINGLE_TABLE_HC      |     1   (0)|
-----------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("A"."RANDOM_NUMBER"="B"."ID")

How about now–can you tell me what’s missing from the first plan?

The answer is….

….

…..

ACCESS TO THE TABLE!!! (see line 4 of either plan)

That’s right. The first query only touches the index. The second query actually goes directly to the row in the table.

So if you have a query where you’re ONLY needing columns found in a unique index, then yes…it’s very likely that a unique index will be faster than a hash cluster.

But suppose that once you found the index entry into the index, you needed a column that was *NOT* on the index?

You would have to probably do TABLE ACCESS BY INDEX ROWID.

And *THIS* is going to slow down your unique query considerably.

The nice thing about hash clusters? There is no subsequent TABLE ACCESS BY ROWID.

You hash the value to the block where the row is located…and suddenly, you have full, wide-open access to the entire row!

There’s no extra overhead for visiting columns that are not part of the hash key.

A Better Test

SO! Let’s try a more apples-to-apples test…let’s re-create our SINGLE_TABLE_HC table and UNIQUE_INDEXED plans with an additional column.

drop table unique_indexed;

create table unique_indexed
( 
    id constraint pk_unique_indexed primary key
  , data
)
as
select cast(rownum as number(*,0)), 'X'
from dual
connect by level <= 1000;
drop cluster my_hash_cluster including tables;

create cluster my_hash_cluster (id number(*,0))
size 32 single table hashkeys 1000 hash is id;

create table single_table_hc
(
    id
  , data
)
  cluster my_hash_cluster (id)
as
select cast(rownum as number(*,0)), 'X'
from dual
connect by level <= 1000;

NOW let’s try doing some table access.

Instead of this:

select /*+ leading(a) use_nl(b) */ count(*)
from bunch_of_random_data a
     inner join unique_indexed b on (a.random_number = b.id);

…we’ll change count(*) to count(data)…

select /*+ leading(a) use_nl(b) */ count(data)
from bunch_of_random_data a
     inner join unique_indexed b on (a.random_number = b.id);

…and re-run our test:

set serveroutput on
declare
  v_start timestamp with time zone;
  v_end timestamp with time zone;
  v_unique_index_duration interval day to second;
  v_hash_cluster_duration interval day to second;
  CONST_QUERY_ITERATIONS constant pls_integer := 10;
  v_dummy number;
begin
  -- Get all the tables cached, 
  -- so that physical IO isn't slowing
  -- us down.
  select count(*) into v_dummy from bunch_of_random_data;
  select count(*) into v_dummy from unique_indexed;
  select count(*) into v_dummy from single_table_hc;
  
  
  -- Do multiple tests on joining using the primary key index 
  v_start := systimestamp;
  for v_index in 1 .. CONST_QUERY_ITERATIONS loop
    select /*+ leading(a) use_nl(b) */ count(data) into v_dummy
    from bunch_of_random_data a
         inner join unique_indexed b on (a.random_number = b.id);
  end loop;
  v_end := systimestamp;
  
  v_unique_index_duration := v_end - v_start;
  
  -- Now do multiple tests on joining using a hash cluster
  v_start := systimestamp;
  for v_index in 1 .. CONST_QUERY_ITERATIONS loop
    select /*+ leading(a) use_nl(b) */ count(data) into v_dummy
    from bunch_of_random_data a
         inner join single_table_hc b on (a.random_number = b.id);
  end loop;
  v_end := systimestamp;
  
  v_hash_cluster_duration := v_end - v_start;
  
  dbms_output.put_line('Unique Index Demo Duration: ' || v_unique_index_duration);
  dbms_output.put_line('Hash Cluster Demo Duration: ' || v_hash_cluster_duration);
end;
/

This is going to more closely resemble a real-life scenario where you’re doing a record lookup by a primary key index.

The result?

Unique Index Demo Duration: +00 00:01:30.725755
Hash Cluster Demo Duration: +00 00:01:03.075208


PL/SQL procedure successfully completed.

Hash clusters blow unique indexes out of the water!

Here are the new execution plans…first for the hash cluster table:

Plan hash value: 3756711336
 
-----------------------------------------------------------------
| Id  | Operation           | Name                 | Cost (%CPU)|
-----------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |  3809   (2)|
|   1 |  SORT AGGREGATE     |                      |            |
|   2 |   NESTED LOOPS      |                      |  3809   (2)|
|   3 |    TABLE ACCESS FULL| BUNCH_OF_RANDOM_DATA |  3809   (2)|
|*  4 |    TABLE ACCESS HASH| SINGLE_TABLE_HC      |            |
-----------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("A"."RANDOM_NUMBER"="B"."ID")

Still the same plan shape as before!

Now, let’s compare that to the unique index plan:

Plan hash value: 1508104960
 
---------------------------------------------------------------------------
| Id  | Operation                     | Name                 | Cost (%CPU)|
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                      |  9996K  (1)|
|   1 |  SORT AGGREGATE               |                      |            |
|   2 |   NESTED LOOPS                |                      |  9996K  (1)|
|   3 |    NESTED LOOPS               |                      |  9996K  (1)|
|   4 |     TABLE ACCESS FULL         | BUNCH_OF_RANDOM_DATA |  3809   (2)|
|*  5 |     INDEX UNIQUE SCAN         | PK_UNIQUE_INDEXED    |     0   (0)|
|   6 |    TABLE ACCESS BY INDEX ROWID| UNIQUE_INDEXED       |     1   (0)|
---------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("A"."RANDOM_NUMBER"="B"."ID")

Aha…now we have a bunch of extra steps!

So! If all you need is access to columns that are on a unique index, then yes…there’s a good chance it’s especially not worth it to fiddle with a hash cluster.

BUT!

If you want a query to be as mean and lean as possible and you need to do a subsequent TABLE ACCESS BY INDEX ROWID, (and if your data allows for it) then hash clusters are a pretty nifty option for you.

Leave a Reply

Your email address will not be published. Required fields are marked *