Re: Hash Index on Partitioned Table

From: Marc Millas <marc(dot)millas(at)mokadb(dot)com>
To: "peter(dot)borissow(at)kartographia(dot)com" <peter(dot)borissow(at)kartographia(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Hash Index on Partitioned Table
Date: 2023-06-01 10:42:40
Message-ID: CADX_1aaJ4KLJcnwZ9fr3sTRHx1n5FgXH=TMzV0ad6jCDA67SNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Peter,

in postgres 13, create index should be, by default, parallelized.
so albeit for specific values of the parallelization parameters in
postgresql.conf, your machine should use more than one core while creating
the indexes.
also you can set the maintenance_workmem parameter to the max for such a
job, as you have some RAM.

In my own experience of indexing big partitioned tables, I did create a few
scripts to create each index (there is one index for each partition) and
when finished create the 'global' index which, as the job is already done,
is fast.
(check the 'only' parameter in create index doc).
doing this it was easy to somewhat optimize the process according to number
of available core/RAM/storage.

hf

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Wed, May 31, 2023 at 7:53 PM peter(dot)borissow(at)kartographia(dot)com <
peter(dot)borissow(at)kartographia(dot)com> wrote:

> Hi Tom,
>
> Thanks so much for your quick response. As luck would have it, the
> index FINALLY finished about an hour ago. For a size comparison:
>
>
>
> BTree: 6,433 GB
>
> Hash: 8,647 GB
>
>
>
> Although I don't have a proper benchmark to compare performance, I can say
> the hash is working as good as if not faster than the BTree for my use case
> (web application).
>
>
>
> I guess I was getting a little nervous waiting for the index to complete
> and seeing such a huge difference in file size but I'm ok now :-)
>
>
>
> Thanks again,
>
> Peter
>
>
>
>
>
> -----Original Message-----
> From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Sent: Wednesday, May 31, 2023 10:07am
> To: "peter(dot)borissow(at)kartographia(dot)com" <peter(dot)borissow(at)kartographia(dot)com>
> Cc: pgsql-general(at)lists(dot)postgresql(dot)org
> Subject: Re: Hash Index on Partitioned Table
>
> "peter(dot)borissow(at)kartographia(dot)com" <peter(dot)borissow(at)kartographia(dot)com>
> writes:
> > I have a rather large database with ~250 billion records in a
> partitioned table. The database has been running and ingesting data
> continuously for about 3 years.
>
> > I had a "regular" BTree index on one of the fields (a unique bigint
> column) but it was getting too big for the disk it was on. The index was
> consuming 6.4 TB of disk space.
>
> That's ... really about par for the course. Each row requires an 8-byte
> index entry, plus 12 bytes index overhead. If I'm doing the math right
> then the index is physically about 78% full which is typical to good for
> a b-tree. Reindexing would remove most of the extra space, but only
> temporarily.
>
> > After doing some research I decided to try to create a hash index
> instead of a BTree. For my purposes, the index is only used to find
> specific numbers ("=" and "IN" queries). From what I read, the hash index
> should run a little faster than btree for my use case and should use less
> disk space.
>
> I'm skeptical. The thing to bear in mind is that btree is the mainstream
> use-case and has been refined and optimized far more than the hash index
> logic.
>
> > (1) Why is the hash index consuming more disk space than the btree
> index? Is it because the hash of the bigint values larger than the storing
> the bigints in the btree?
>
> From memory, the index entries will be the same size in this case,
> but hash might have more wasted space.
>
> > (4) Is there any way to estimate when the index process will complete?
>
> An index on a partitioned table isn't a single object, it's one index per
> partition. So you should be able to look at how many partitions have
> indexes so far. You might have to drill down to the point of counting how
> many files in the database's directory, if the individual indexes aren't
> showing up as committed catalog entries yet.
>
> regards, tom lane
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2023-06-01 13:32:17 Re: Dynamic creation of list partitions in highly concurrent write environment
Previous Message Jim Vanns 2023-06-01 10:14:17 Dynamic creation of list partitions in highly concurrent write environment