Re:

From: Yevhenii Kurtov <yevhenii(dot)kurtov(at)gmail(dot)com>
To: Brad DeJong <Brad(dot)Dejong(at)infor(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re:
Date: 2017-06-29 05:17:44
Message-ID: CAJhrTGyckvhYsN3Y3jzZWX-PtW1R1iewP=q9nToke4taantqjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello folks,

Thank you very much for analysis and suggested - there is a lot to learn
here. I just tried UNION queries and got following error:

ERROR: FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT

I made a table dump for anyone who wants to give it a spin
https://app.box.com/s/464b12glmlk5o4gvzz7krc4c8s2fxlwr
and here is the gist for the original commands
https://gist.github.com/lessless/33215d0c147645db721e74e07498ac53

On Wed, Jun 28, 2017 at 8:10 PM, Brad DeJong <Brad(dot)Dejong(at)infor(dot)com> wrote:

>
>
> On 2017-06-28, Pavel Stehule wrote ...
> > On 2017-06-28, Yevhenii Kurtov wrote ...
> >> On 2017-06-28, Pavel Stehule wrote ...
> >>> On 2017-06-28, Yevhenii Kurtov wrote ...
> >>>> We have a query that is run almost each second and it's very
> important to squeeze every other ms out of it. The query is:
> >>>> ...
> >>>> I added following index: CREATE INDEX ON campaign_jobs(id, status,
> failed_at, started_at, priority DESC, times_failed);
> >>>> ...
> >>> There are few issues
> >>> a) parametrized LIMIT
> >>> b) complex predicate with lot of OR
> >>> c) slow external sort
> >>>
> >>> b) signalize maybe some strange in design .. try to replace "OR" by
> "UNION" query
> >>> c) if you can and you have good enough memory .. try to increase
> work_mem .. maybe 20MB
> >>>
> >>> if you change query to union queries, then you can use conditional
> indexes
> >>>
> >>> create index(id) where status = 0;
> >>> create index(failed_at) where status = 2;
> >>> create index(started_at) where status = 1;
> >>
> >> Can you please give a tip how to rewrite the query with UNION clause?
> >
> > SELECT c0."id" FROM "campaign_jobs" AS c0
> > WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> > UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> > WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> > UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> > WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> > ORDER BY c0."priority" DESC, c0."times_failed"
> > LIMIT $7
> > FOR UPDATE SKIP LOCKED
>
>
> Normally (at least for developers I've worked with), that kind of query
> structure is used when the "status" values don't overlap and don't change
> from query to query. Judging from Pavel's suggested conditional indexes
> (i.e. "where status = <constant>"), he also thinks that is likely.
>
> Give the optimizer that information so that it can use it. Assuming $1 = 0
> and $3 = 2 and $5 = 1, substitute literals. Substitute literal for $7 in
> limit. Push order by and limit to each branch of the union all (or does
> Postgres figure that out automatically?) Replace union with union all (not
> sure about Postgres, but allows other dbms to avoid sorting and merging
> result sets to eliminate duplicates). (Use of UNION ALL assumes that "id"
> is unique across rows as implied by only "id" being selected with FOR
> UPDATE. If multiple rows can have the same "id", then use UNION to
> eliminate the duplicates.)
>
> SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
> UNION ALL
> SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
> UNION ALL
> SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
> ORDER BY "priority" DESC, "times_failed"
> LIMIT 100
> FOR UPDATE SKIP LOCKED
>
>
> Another thing that you could try is to push the ORDER BY and LIMIT to the
> branches of the UNION (or does Postgres figure that out automatically?) and
> use slightly different indexes. This may not make sense for all the
> branches but one nice thing about UNION is that each branch can be tweaked
> independently. Also, there are probably unmentioned functional dependencies
> that you can use to reduce the index size and/or improve your match rate.
> Example - if status = 1 means that the campaign_job has started but not
> failed or completed, then you may know that started_at is set, but
> failed_at and ended_at are null. The < comparison in and of itself implies
> that only rows where "started_at" is not null will match the condition.
>
> SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND
> NOT (c0."id" = ANY($1)))) ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT 100
> UNION ALL
> SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND
> (c0."failed_at" > $2)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT
> 100
> UNION ALL
> SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND
> (c0."started_at" < $3)) ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT 100
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT 100
> FOR UPDATE SKIP LOCKED
>
> Including the "priority", "times_failed" and "id" columns in the indexes
> along with "failed_at"/"started_at" allows the optimizer to do index only
> scans. (May still have to do random I/O to the data page to determine tuple
> version visibility but I don't think that can be eliminated.)
>
> create index ... ("priority" desc, "times_failed", "id")
> where "status" = 0;
> create index ... ("priority" desc, "times_failed", "id", "failed_at")
> where "status" = 2 and "failed_at" is not null;
> create index ... ("priority" desc, "times_failed", "id", "started_at")
> where "status" = 1 and "started_at" is not null; -- and ended_at is null
> and ...
>
>
> I'm assuming that the optimizer knows that "where status = 1 and
> started_at < $3" implies "and started_at is not null" and will consider the
> conditional index. If not, then the "and started_at is not null" needs to
> be explicit.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

  • Re: at 2017-06-28 13:10:13 from Brad DeJong

Responses

  • Re: at 2017-06-29 15:50:27 from Pavel Stehule

Browse pgsql-performance by date

  From Date Subject
Next Message Pavel Stehule 2017-06-29 15:50:27 Re:
Previous Message Ulf Lohbrügge 2017-06-28 14:25:15 Re: Performance of information_schema with many schemata and tables