Re: Improving Performance of Query ~ Filter by A, Sort by B

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Lincoln Swaine-Moore <lswainemoore(at)gmail(dot)com>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Improving Performance of Query ~ Filter by A, Sort by B
Date: 2018-07-15 03:25:21
Message-ID: CAMkU=1zfX5A8yPZ0BVansbF_AKUKtXSS+M9MXTESod59mwog2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Jul 10, 2018 at 11:07 AM, Lincoln Swaine-Moore <
lswainemoore(at)gmail(dot)com> wrote:

>
>
>
> Something about the estimated row counts (this problem persisted after I
> tried ANALYZEing)
>

What is your default_statistics_target? What can you tell us about the
distribution of parent_id? (exponential, power law, etc?). Can you show
the results for select * from pg_stats where tablename='a' and
attname='parent_id' \x\g\x ?

> forces usage of the tmstmp index and Merge Append (which seems wise) but
> also a filter condition on parent_id over an index condition, which is
> apparently prohibitively slow.
>
> I tried creating a multicolumn index like:
>
> CREATE INDEX "a_partition1_parent_and_tmstmp_idx" on "a_partition2" USING
> btree ("parent_id", "tmstmp" DESC);
>
> But this didn't help (it wasn't used).
>

You could try reversing the order and adding a column to be (tmstmp,
parent_id, id) and keeping the table well vacuumed. This would allow the
slow plan to still walk the indexes in tmstmp order but do it as an
index-only scan, so it could omit the extra trip to the table. That trip to
the table must be awfully slow to explain the numbers you show later in the
thread.

...

> This query plan (which is the same as when LIMIT is removed) has been a
> good short term solution when the number of "parent_id"s I'm using is still
> relatively small, but unfortunately queries grow untenably slow as the
> number of "parent_id"s involved increases:
>

What happens when you remove that extra order by phrase that you added?
The original slow plan should become much faster when the number of
parent_ids is large (it has to dig through fewer index entries before
accumulating 20 good ones), so you should try going back to that.

...

> I'd be very grateful for help with one or both of these questions:
> 1) Why is adding an unnecessary (from the perspective of result
> correctness) ORDER BY valuable for forcing the parent_id index usage, and
> does that indicate that there is something wrong with my
> table/indexes/statistics/etc.?
>

It finds the indexes on tmstmp to be falsely attractive, as it can walk in
tmstmp order and so avoid the sort. (Really not the sort itself, but the
fact that sort has to first read every row to be sorted, while walking an
index can abort once the LIMIT is satisfied). Adding an extra phrase to
the ORDER BY means the index is no longer capable of delivering rows in the
needed order, so it no longer looks falsely attractive. The same thing
could be obtained by doing a dummy operation, such as ORDER BY tmstmp + '0
seconds' DESC. I prefer that method, as it is more obviously a tuning
trick. Adding in "id" looks more like a legitimate desire to break any
ties that might occasionally occur in tmstmp.

As Tom pointed out, there clearly is something wrong with your statistics,
although we don't know what is causing it to go wrong. Fixing the
statistics isn't guaranteed to fix the problem, but it would be a good
start.

> 2) Is there any way I can improve my query time when there are many
> "parent_id"s involved? I seem to only be able to get the query plan to use
> at most one of the parent_id index and the tmstmp index at a time. Perhaps
> the correct multicolumn index would help?
>

A few things mentioned above might help.

But if they don't, is there any chance you could redesign your partitioning
so that all parent_id queries together will always be in the same
partition? And if not, could you just get rid of the partitioning
altogether? 1e7 row is not all that many and doesn't generally need
partitioning. Unless it is serving a specific purpose, it is probably
costing you more than you are getting.

Finally, could you rewrite it as a join to a VALUES list, rather than as an
in-list?

Cheers,

Jeff

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Lincoln Swaine-Moore 2018-07-16 21:29:00 Re: Improving Performance of Query ~ Filter by A, Sort by B
Previous Message Julien Rouhaud 2018-07-13 07:55:14 Re: performance statistics monitoring without spamming logs