Re: Proposal: Global Index for PostgreSQL

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Langote <amitlangote09(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, wenhui qiu <qiuwenhuifx(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Proposal: Global Index for PostgreSQL
Date: 2025-07-03 04:23:55
Message-ID: CAFiTN-utcXBksH0H2suGN3R2oAaGbznRyO0CH8msP+0qr1ZsOg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 2, 2025 at 7:18 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:

> Just to clarify -- I was hoping that, at least for SELECTs, we
> wouldn’t need to lock all leaf partitions up front.
>
> One of the potential selling points of global indexes (compared to
> partitioned indexes) is that we can avoid expand_partitioned_rtentry()
> and the full scan path setup per partition, though that's admittedly
> quite an undertaking. So I was imagining we could just lock the
> parent and the global index during planning, and only lock individual
> heap relations at execution time -- once we know which partition the
> returned tuple belongs to.
>
> Locking isn’t cheap -- and in workloads with thousands of partitions,
> it becomes a major source of overhead, especially when the query
> doesn't contain pruning quals and ends up touching only a few
> partitions in practice. So I think it’s worth seeing if we can avoid
> planning-time locking of all partitions in at least the SELECT case,
> even if INSERTs may require broader locking due to uniqueness checks,
> but see below...
>
> > > I understand that this is currently necessary, given that a global
> > > index scan is a single node without per-partition awareness. But it
> > > might be worth considering whether the scan could opportunistically
> > > defer heap relation locking until it returns a tuple that actually
> > > belongs to a particular partition -- similar to how inserts into
> > > partitioned tables only lock the target partition at execution time.
> > > Or did I miss that inserts also need to lock all partitions up front
> > > when global indexes are present, due to cross-partition uniqueness
> > > checks?
> > >
> > > Let me know if I’ve misunderstood the design.
> >
> > So there difference is in the cases, where we are directly operating
> > on the leaf table, e.g. if you inserting directly on the leaf
> > relation, currently we just need to lock that partition, but if there
> > is global index we need to lock other siblings as well (in short all
> > the leaf under the parent which has global index) because if the
> > global index is unique we might need to check unique conflict in other
> > leafs as well. I believe when the table is partitioned, it might not
> > be the most preferred way to operate directly on the leaf, and with
> > global index only this case will be impacted where we are doing DML
> > directly on the leaf. I am not sure in this case how much delay we
> > can do in locking, because e.g. for insert we will only identify which
> > partition has a duplicate key while inserting in the btree.
>
> Hmm, it’s my understanding that with a true global index, meaning a
> single btree structure spanning all partitions, uniqueness conflicts
> are detected by probing the index after inserting the tuple into the
> heap. So unless we find a matching key in the index, there is no need
> to consult any other partitions.
>
> Even if a match is found, we only need to access the heap page for
> that specific TID to check visibility, and that would involve just one
> partition.
>
> Why then do we need to lock all leaf partitions in advance? It seems
> like we could defer locking until the uniqueness check identifies a
> partition that actually contains a conflicting tuple, and only then
> lock that one heap.
>
> I understand that in some earlier floated ideas for enforcing global
> uniqueness (perhaps only briefly mentioned in past discussions), the
> approach was to scan all per-partition indexes, which would have
> required locking everything up front. But with a unified global index,
> that overhead seems avoidable.
>
> Is there something subtle that I am missing that still requires
> locking all heaps in advance?

Thanks Amit for raising this question. I understand your point, and I
think this is mainly about whether to lock all the tables upfront or
delay the locking and lock them when we need to. I remember the very
first patch when I implemented at least in case of unique checking I
was locking the table on the fly whenever we find the duplicate tuple
and there I had a discussion with Robert regarding the same. Here is
what was out thinking, although it's in my words and Robert might have
something else in mind but this is based on what I think he told.

Here's what I remember our thinking was at the time, though Robert
might recall it differently. We concluded that the code was designed
so the planner would upfront decide which tables were needed for the
plan, lock them, and keep them in the range table. If it was a
prepared plan, AcquireExecutorLock would handle locking all of them
before execution began. And then I changed the logic of lock on the
fly to locking everything upfront.

I'm now inclined to agree with your point. When dealing with multiple
partitions, especially when it's uncertain if all of them will even be
needed, the traditional approach of locking every potential partition
upfront isn't the most efficient strategy. But one problem is we will
only discovered which partitions to lock (whether doing scan or
checking unique while inserting in the index) only when we are in
Btree code, so as of now I am not finding it strategically correct to
lock and build the relation descriptor right then and there, I don't
have any reason why it would be bad but may be one reason is this
could be the first code to do it and another reason is if this has to
lock millions of partitions (in case we find matching tuple from
multiple partition during scan) then locking all of them on the fly
from index AM code might not be the best idea.

What's your thought on this? I would appreciate if @Robert Haas can
also share his thoughts.

--
Regards,
Dilip Kumar
Google

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2025-07-03 04:56:12 Re: Conflict detection for update_deleted in logical replication
Previous Message Amit Kapila 2025-07-03 04:02:10 Re: Using failover slots for PG-non_PG logical replication