Re: [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sam Saffron <sam(dot)saffron(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4
Date: 2014-11-14 00:34:39
Message-ID: 26882.1415925279@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Sam Saffron <sam(dot)saffron(at)gmail(dot)com> writes:
> I have hit a rather odd issue with prepared queries on both pg 9.3 and 9.4 beta.
> I have this table (copy at http://samsaffron.com/testing.db.gz) with a
> very odd performance profile:

Interesting case. The issue seems to be that your statistics look like
this:

select * from pg_stats where tablename = 'topics' and attname = 'archetype';
schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram
------------+-----------+-----------+-----------+-----------+-----------+------------+---------------------------+---------------------+------------------+-------------+-------------------+------------------------+---------------------
public | topics | archetype | f | 0 | 12 | 2 | {private_message,regular} | {0.604957,0.395043} | | 0.612985 | | |
(1 row)

That is, archetype consists of 60% 'private_message', 40% 'regular', and
absolutely nothing else. So the condition archetype = 'banner' is very
selective, and a plan built knowing that that is the parameter value will
use the index:

# explain select * from topics where archetype = 'banner' limit 1;
QUERY PLAN
------------------------------------------------------------------------------
Limit (cost=0.29..651.49 rows=1 width=520)
-> Index Scan using idx11 on topics (cost=0.29..651.49 rows=1 width=520)
Index Cond: ((archetype)::text = 'banner'::text)
(3 rows)

However, that's still a pretty darn expensive indexscan, mainly because
archetype is not the leading key ... if you care about the performance
of this query, why don't you have an index to match?

# create index on topics(archetype);
CREATE INDEX
# explain select * from topics where archetype = 'banner' limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.29..6.80 rows=1 width=520)
-> Index Scan using topics_archetype_idx on topics (cost=0.29..6.80 rows=1 width=520)
Index Cond: ((archetype)::text = 'banner'::text)
(3 rows)

However, just fixing the index availability actually makes the performance
ratio even worse, because the prepared query still doesn't use the index:

# explain execute foo('banner');
QUERY PLAN
---------------------------------------------------------------------
Limit (cost=0.00..0.11 rows=1 width=520)
-> Seq Scan on topics (cost=0.00..1158.19 rows=10088 width=520)
Filter: ((archetype)::text = $1)
(3 rows)

(Yes, you can get this result in psql, you just need to repeat the EXECUTE
half a dozen times until it shifts to a generic plan.)

The problem here is that without knowledge of the comparison value, the
planner assumes that it will probably be one of the two values that make
up the table content. (After all, why would you query for something
else?) On that basis, a seqscan will probably hit a matching row in no
time, and so (with the LIMIT 1) it looks like a better deal than the
indexscan.

We've talked about this type of problem before. Part of the issue is
that costing of LIMIT doesn't apply any penalty for a bad worst-case
scenario, and part of it is that the heuristics for choosing between
custom and generic plans don't consider the possibility that the generic
plan's estimated cost is way wrong for lack of knowledge of the comparison
value. It's not real obvious how to improve either heuristic without
probably making some cases worse.

One thing that occurs to me is that if the generic plan estimate comes
out much cheaper than the custom one, maybe we should assume that the
generic's cost estimate is bogus. Right offhand I can't think of a reason
for a custom plan to look worse than a generic one, unless there's a
statistical quirk like this one.

In the meantime, I assume that your real data contains a small percentage
of values other than these two? If so, maybe cranking up the statistics
target would help. If the planner knows that there are more than two
values in the column, I think it would be less optimistic about assuming
that the comparison value is one of the big two.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G Johnston 2014-11-14 00:43:45 Re: [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4
Previous Message Sam Saffron 2014-11-13 23:23:21 Performance issue with libpq prepared queries on 9.3 and 9.4

Browse pgsql-hackers by date

  From Date Subject
Next Message David G Johnston 2014-11-14 00:43:45 Re: [GENERAL] Performance issue with libpq prepared queries on 9.3 and 9.4
Previous Message Peter Eisentraut 2014-11-14 00:30:29 Re: Teaching pg_dump to use NOT VALID constraints