Re: Looking for an index that supports top-n searches by enforcing a max-n automatically

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Looking for an index that supports top-n searches by enforcing a max-n automatically
Date: 2024-04-05 08:46:40
Message-ID: CAKqncchWauxByJLtrAVKL9X9syrZrVTPSu38Q7bpGYvPrLhEiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Just about as soon as I sent the above, I realized that it's unlikely to
make sense in the real world in a row-store. If the goal is to keep the
top-25 results and trim the rest, what happens when values are
added/modified/deleted? You now *have to go look at all of the data you
aren't caching in the index. *Unless you can *guarantee* that data is
entered in perfect order, and/or never changes, I don't think what I'm
looking for is likely to make sense.

On Fri, Apr 5, 2024 at 11:27 AM Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
wrote:

> Looking for an index to support top-n searches, were n has a fixed maximum
>
> Recently, I've been looking at strategies to handle top-n queries in
> Postgres. In my current cases, we've got definition tables, and very large
> related tables. Here's a stripped-down example, the real tables are much
> wider.
>
> CREATE TABLE IF NOT EXISTS data.inventory (
> id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
> inv_id uuid NOT NULL DEFAULT NULL
> );
>
>
> CREATE TABLE IF NOT EXISTS data.scan (
> id uuid NOT NULL DEFAULT NULL PRIMARY KEY,
> inv_id uuid NOT NULL DEFAULT NULL
> scan_dts_utc timestamp NOT NULL DEFAULT NOW(); -- We run out
> servers on UTC
> );
>
> Every item in inventory is scanned when it passes through various steps in
> a clean-dispatch-arrive-use-clean sort of a work cycle. The ratio between
> inventory and scan is 1:0-n, where n can be virtually any number. In
> another table pair like this, the average is 1:1,000. In the inventory
> example, it's roughly 0:200,000. The distribution of related row counts is
> all over the map. The reasons behind these ratios sometimes map to valid
> processes, and sometimes are a consequence of some low-quality data leaking
> into the system. In the case of inventory, the results make sense. In our
> case:
>
> * The goal value for n is often 1, and not more than up to 25.
>
> * Depending on the tables, the % of rows that are discarded because
> they're past the 25th most recent record is 97% or more.
>
> * Partial indexes do not work as well on huge tables as I hoped. The same
> data copied via a STATEMENT trigger into a thin, subsetted table is much
> faster. I think this has to do with the increase in random disk access
> required for a table and/or index with more pages spread around on the disk.
>
> * We can't filter in advance by *any* reasonable date range. Could get 25
> scans on one item in an hour, or a year, or five years, or never.
>
> We're finding that we need the top-n records more and more often, returned
> quickly. This gets harder to do as the table(s) grow.
>
> SELECT id, scan_dts_utc
> FROM data.scan
> WHERE inv_id = 'b7db5d06-8275-224d-a38a-ac263dc1c767' curve.
> ORDER BY scan_dts_utc DESC
> LIMIT 25; -- Full search product might be 0, 200K, or anything
> in-between. Not on a bell curve.
>
> A compound index works really well to optimize these kinds of searches:
>
> CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
> ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC);
>
> What I'm wondering is if there is some index option, likely not with a
> B-tree, that can *automatically* enforce a maximum-length list of top
> values, based on a defined sort
>
> CREATE INDEX scan_inv_id_scan_time_utc_dts_idx
> ON ascendco.analytic_scan (inv_id, scan_time_utc_dts DESC) --
> This defines the ordering
> LIMIT 25; --
> This sets the hard max for n
>
> The goal is to have an automatically maintained list of the top values
> *in* the index itself. In the right situations (like ours), this reduces
> the index size by 20x or more. Smaller index, faster results. And, since
> the index is on the source table, the row references are already there.
> (Something I lose when maintaining this by hand in a side/shadow/top table.)
>
> I've looked at a ton of plans, and Postgres *clearly* goes to a lot of
> effort to recognize and optimize top-n searches already. That's
> encouraging, as it suggests that the planner takes LIMIT into account.
> (I've picked up already that maintaining the purity of the planner and
> executor abstractions is a core value to the project.)
>
> And, for sure, I can build and maintain my own custom, ordered list in
> various ways. None of them seem like they can possibly rival the trimming
> behavior being handled by an index.
>
> I'm out over my skis here, but I'm intuiting that this might be a job for
> one of the multi-value/inverted index types/frameworks. I tried some
> experiments, but only got worse results.
>
> Hope that reads as understandable...grateful for any suggestions.
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Fercot 2024-04-05 09:45:18 Recovery of .partial WAL segments
Previous Message Alvaro Herrera 2024-04-05 08:41:01 Re: LogwrtResult contended spinlock