Re: Performance query about large tables, lots of concurrent access

From: Karl Wright <kwright(at)metacarta(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance query about large tables, lots of concurrent access
Date: 2007-06-19 15:28:23
Message-ID: 4677F617.5000105@metacarta.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane wrote:
> Karl Wright <kwright(at)metacarta(dot)com> writes:
>> [2007-06-18 09:39:49,797]ERROR Plan: Index Scan using i1181764142395 on
>> intrinsiclink (cost=0.00..14177.29 rows=5 width=253)
>> [2007-06-18 09:39:49,797]ERROR Plan: Index Cond: ((jobid = $2) AND
>> ((childidhash)::text = ($3)::text))
>> [2007-06-18 09:39:49,797]ERROR Plan: Filter: ((childid = ($4)::text)
>> AND ((isnew = ($5)::bpchar) OR (isnew = ($6)::bpchar)))
>
>>> In this case it looks like the planner is afraid that that's exactly
>>> what will happen --- a cost of 14177 suggests that several thousand row
>>> fetches are expected to happen, and yet it's only predicting 5 rows out
>>> after the filter.
>
>> Well, that's odd, because the hash in question that it is using is the
>> SHA-1 hash of a URL. There's essentially one row per URL in this table.
>
> What about isnew?

Isnew is simply a flag which I want to set for all rows that belong to
this particular child, but only if it's one of two particular values.

>
> Also, how many rows do *you* expect out of the query? The planner is
> not going to be aware of the hashed relationship between childidhash
> and childid --- it'll think those are independent conditions which they
> evidently aren't. So it may be that the query really does retrieve
> thousands of rows, and the rows=5 estimate is bogus because it's
> double-counting the selectivity of the childid condition.
>

This can vary, but I expect there to be at on average a few dozen rows
returned from the overall query. The only way the index-condition part
of the query can be returning thousands of rows would be if: (a) there
is really a lot of data of this kind, or (b) the hash function is
basically not doing its job and there are thousands of collisions occurring.

In fact, that's not the case. In psql I just did the following analysis:

>>>>>>
metacarta=> explain select count(*) from intrinsiclink where
jobid=1181766706097 and
childidhash='7E130F3B688687757187F1638D8776ECEF3009E0';
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=14992.23..14992.24 rows=1 width=0)
-> Index Scan using i1181764142395 on intrinsiclink
(cost=0.00..14971.81 rows=8167 width=0)
Index Cond: ((jobid = 1181766706097::bigint) AND
((childidhash)::text = '7E130F3B688687757187F1638D8776ECEF3009E0'::text))
(3 rows)

metacarta=> select count(*) from intrinsiclink where jobid=1181766706097
and childidhash='7E130F3B688687757187F1638D8776ECEF3009E0';
count
-------
0
(1 row)
<<<<<<

Granted this is well after-the-fact, but you can see that the cost
estimate is wildly wrong in this case.

I did an ANALYZE on that table and repeated the explain, and got this:

>>>>>>
metacarta=> analyze intrinsiclink;
ANALYZE
metacarta=> explain select count(*) from intrinsiclink where
jobid=1181766706097 and
childidhash='7E130F3B688687757187F1638D8776ECEF3009E0';
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15276.36..15276.37 rows=1 width=0)
-> Index Scan using i1181764142395 on intrinsiclink
(cost=0.00..15255.53 rows=8333 width=0)
Index Cond: ((jobid = 1181766706097::bigint) AND
((childidhash)::text = '7E130F3B688687757187F1638D8776ECEF3009E0'::text))
(3 rows)
<<<<<<

... even more wildly wrong.

Karl

> regards, tom lane
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Francisco Reyes 2007-06-19 15:37:07 Re: Hardware suggestions
Previous Message Chris Browne 2007-06-19 15:22:17 Re: [PERFORM] Postgres VS Oracle