Re: Transactions and indexes

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Chris Cleveland <ccleveland(at)dieselpoint(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Transactions and indexes
Date: 2021-07-20 02:37:46
Message-ID: CAH2-Wzk4vCda9-7gB-MUw_2N-BevRXcL9uMc6hLQg6rtfa7PJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 19, 2021 at 7:20 PM Chris Cleveland
<ccleveland(at)dieselpoint(dot)com> wrote:
> Thank you. Does this mean I can implement the index AM and return TIDs without having to worry about transactions at all?

Yes. That's the upside of the design -- it makes it easy to add new
transactional index AMs. Which is one reason why Postgres has so many.

> Also, as far as I can tell, the only way that TIDs are removed from the index is in ambulkdelete(). Is this accurate?

It doesn't have to be the only way, but in practice it can be. Depends
on the index AM. The core code relies on ambulkdelete() to make sure
that all TIDs dead in the table are gone from the index. This allows
VACUUM to finally physically recycle the previously referenced TIDs in
the table structure, without risk of index scans finding the wrong
thing.

> Does that mean that my index will be returning TIDs for deleted items and I don't have to worry about that either?

If you assume that you're using heapam (the standard table AM), then
yes. Otherwise I don't know -- it's ambiguous.

> Don't TIDs get reused? What happens when my index returns an old TID which is now pointing to a new record?

This can't happen because, as I said, the table cannot recycle
TIDs/line pointers until it's known that this cannot happen (because
VACUUM already cleaned out all the garbage index tuples).

> This is going to make it really hard to implement Top X queries of the type you get from a search engine. A search engine will normally maintain an internal buffer (usually a priority queue) of a fixed size, X, and add tuples to it along with their relevance score. The buffer only remembers the Top X tuples with the highest score. In this way the search engine can iterate over millions of entries and retain only the best ones without having an unbounded buffer. For this to work, though, you need to know how many tuples to keep in the buffer in advance. If my index can't know, in advance, which TIDs are invisible or deleted, then it can't keep them out of the buffer, and this whole scheme fails.
>
> This is not going to work unless the system gives the index a clear picture of transactions, visibility, and deletes as they happen. Is this information available?

Are you implementing a new index AM or a new table AM? Discarding data
based on something like a relevance score doesn't seem like something
that either API provides for. Indexes in Postgres can be lossy, but
that in itself doesn't change the result of queries.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2021-07-20 02:40:48 Re: Parallel INSERT SELECT take 2
Previous Message Alvaro Herrera 2021-07-20 02:23:07 Re: row filtering for logical replication