Friday, March 11, 2022

How to create a partial index on a table in PostgreSQL

I was asked by one of my customers to advise on how to create an index for a new column called "sk_type", of datatype varchar, with only 3 distinct values in addition to NULL. Let's call them STRING1, STRING2, STRING3.

The table is already partitioned on column refyear(date). Here is the table DDL:
CREATE TABLE event
(
    id bigint NOT NULL DEFAULT nextval('hendelselager.hendelse_id_seq'::regclass),
    skpl character varying(8)  COLLATE pg_catalog."default",
    refyear integer NOT NULL,
    pnum bigint NOT NULL,
    ksystem character varying(128) COLLATE pg_catalog."default" NOT NULL,
    category character varying(128) COLLATE pg_catalog."default" NOT NULL,
    event character varying(128) COLLATE pg_catalog."default" NOT NULL,
    tstamp timestamp without time zone NOT NULL,
    ip_id character varying(128) COLLATE pg_catalog."default" NOT NULL,
    details jsonb,
    CONSTRAINT e_pkey PRIMARY KEY (refyear, event, id),
    CONSTRAINT uc_e01 UNIQUE (refyear, pnum, ksystem, category, event, ip_id)
) PARTITION BY RANGE (refyear);
The distribution of values was expected to be very scew right from the start:
* STRING1 - 95%
* STRING2 - 5%
* STRING3 < 0,5%
* NULL < 0,1%

In the documentation I found that perhaps a partial index would be ideal in this situation. A partial index is defined as

an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate.

According to the documentation, a partial index could be useful in cases where you want to avoid indexing common values:

Since a query searching for a common value (one that accounts for more than a few percent of all the table rows) will not use the index anyway, there is no point in keeping those rows in the index at all. This reduces the size of the index, which will speed up those queries that [actually] do use the index. It will also speed up many table update operations because the index does not need to be updated in all cases

In other words, we would only index rows which have column value different from 'STRING1'.

First, add the new column:
ALTER TABLE event
ADD sk_type character varying(8);
Next, create the partial index. Here is the syntax I used in a sandbox environment:
CREATE INDEX sk_type_idx ON event (refyear,sk_type)
 WHERE NOT (sk_type = 'STRING1');
This approach comes with a caviat, which may or may not be acceptable:

Observe that this kind of partial index requires that the common values be predetermined, so such partial indexes are best used for data distributions that do not change. The indexes can be recreated occasionally to adjust for new data distributions, but this adds maintenance effort.

The offical PostgreSQL 11 documentation can be found here

No comments:

Post a Comment