Re: strange buildfarm failure on lionfish

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Stefan Kaltenbrunner" <stefan(at)kaltenbrunner(dot)cc>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange buildfarm failure on lionfish
Date: 2007-07-24 17:44:14
Message-ID: 13477.1185299054@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> I saw what I think was the identical failure last night on my own
>> machine, but it wasn't repeatable. Evidently the planner is changing to
>> a different plan for those queries, but why has this only started
>> recently? Maybe the recent changes to autovacuum defaults are causing
>> autovac to hit these tables when it never did before?

> Indeed the only alternate plan I can imagine for this is to do the join the
> other way around. And given the large difference in sizes between the two
> tables the only way I could get that to happen was by obliterating the
> statistics entirely for one table but having stats for the other.

Yeah, I turned off autovac and let the tests run through 'horology',
then stopped and looked at the plans for these queries. For the first
one, what you get with no ANALYZE having been done is

explain SELECT t.d1 + i.f1 AS "102" FROM TIMESTAMP_TBL t, INTERVAL_TBL i
WHERE t.d1 BETWEEN '1990-01-01' AND '2001-01-01'
AND i.f1 BETWEEN '00:00' AND '23:00';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=36.56..80.89 rows=99 width=24)
-> Seq Scan on timestamp_tbl t (cost=0.00..42.10 rows=11 width=8)
Filter: ((d1 >= '1990-01-01 00:00:00'::timestamp without time zone) AND (d1 <= '2001-01-01 00:00:00'::timestamp without time zone))
-> Materialize (cost=36.56..36.65 rows=9 width=16)
-> Seq Scan on interval_tbl i (cost=0.00..36.55 rows=9 width=16)
Filter: ((f1 >= '00:00:00'::interval) AND (f1 <= '23:00:00'::interval))

If timestamp_tbl is then ANALYZEd, the plan flips around to put i on the
outside of the nestloop (because the estimate of the number of matching
rows rises to 49, which is pretty good because the actual is 52).

OTOH, if interval_tbl is ANALYZEd, the estimate for it drops to 2 rows
(again a much better match to reality) and we go back to preferring
i on the inside, with or without timestamp_tbl having been analyzed.

And, at least in the serial-schedule case, the stats at this point
look like

relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze
--------+------------+---------------+----------+--------------+----------+---------------+-----------+-----------+-----------+------------+------------+-------------+-----------------+--------------+------------------
132885 | public | interval_tbl | 22 | 210 | | | 10 | 0 | 0 | 10 | 0 | | | |
132879 | public | timestamp_tbl | 45 | 2444 | | | 74 | 0 | 8 | 66 | 8 | | | |

So yesterday's change to reduce the analyze threshold to 50 means that
timestamp_tbl is now vulnerable to being asynchronously analyzed while
the tests run.

While I don't have any very strong objection to putting an ORDER BY
on these particular queries, I'm worried about how many other regression
tests will now start showing random failures. We have an awful lot
of small tables in the tests ...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2007-07-24 17:55:57 Re: strange buildfarm failure on lionfish
Previous Message Tom Lane 2007-07-24 17:16:12 Re: strange buildfarm failure on lionfish