Re: Index-only scan performance regression

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scan performance regression
Date: 2012-01-31 19:15:43
Message-ID: CAEZATCX5=_Vjff1SDfEOxUWwO3T41zUgtebbUp2RcQqo59E2fA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 31 January 2012 17:35, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Jan 29, 2012 at 2:47 AM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> Given a freshly created table (not vacuumed), and a query that uses an
>> index-only scan, for example:
>>
>> CREATE TABLE foo(a int PRIMARY KEY);
>> INSERT INTO foo SELECT * FROM generate_series(1,1000000);
>> ANALYSE foo;
>>
>> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;
>>
>>                                                            QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
>> time=23.646..23.646 rows=1 loops=1)
>>   ->  Index Only Scan using foo_pkey on foo  (cost=0.00..300.42
>> rows=8975 width=0) (actual time=0.027..22.291 rows=10000 loops=1)
>>         Index Cond: (a <= 10000)
>>         Heap Fetches: 10000
>>  Total runtime: 23.673 ms
>> (5 rows)
>>
>>
>> the actual performance is much worse than the equivalent index scan,
>> as used in 9.1 and earlier:
>>
>> SET enable_indexonlyscan = off;
>> EXPLAIN ANALYSE SELECT count(*) FROM foo WHERE a <= 10000;
>>
>>                                                         QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=322.86..322.87 rows=1 width=0) (actual
>> time=3.219..3.219 rows=1 loops=1)
>>   ->  Index Scan using foo_pkey on foo  (cost=0.00..300.42 rows=8975
>> width=0) (actual time=0.014..2.302 rows=10000 loops=1)
>>         Index Cond: (a <= 10000)
>>  Total runtime: 3.240 ms
>> (4 rows)
>>
>>
>> Obviously this is the worst-case for an index-only scan, since there
>> is no visibility map, and it has to fetch each tuple from the heap,
>> but ideally this should perform around the same as an ordinary index
>> scan, since it's doing pretty much the same work.
>
> Yeah.
>
>> Digging around, it looks like the additional cost is coming from
>> visibilitymap_test(), which is calling smgrexists() for each tuple, to
>> see if the visibility map file has been created. So it's doing a file
>> access check for each row, while the visibility map doesn't exist.
>>
>> I'm not really familiar with this code, but a possible fix seems to be
>> to send an invalidation message in vm_extend() when it creates or
>> extends the visibility map, so that vm_readbuf() only has to re-check
>> the visibility map file if smgr_vm_nblocks has been invalidated. With
>> this change (attached) the worst-case index-only scan time becomes
>> around the same as the index scan time.
>
> I think that the additional sinval message might not do much good,
> because it won't necessarily be seen until the next time any given
> other backend does AcceptInvalidationMessages(), which might be a
> while.  That's not an issue for truncation because it requires
> AccessExclusiveLock, so we know that any subsequent access to the
> table will involve taking a heavyweight lock, which will
> AcceptInvalidationMessages().  Making vismap extension require
> AccessExclusiveLock would be a cure worse than the disease.
>
> In the case of an index-only scan, it wouldn't be the end of the world
> if we failed to notice that the relation had been extended.  The worst
> case would be that we wouldn't read the visibility map buffer and, as
> you say, that shouldn't be any worse than a normal index scan.  The
> only time we had better not fail to notice that the visibility map has
> been extended is when we're asked to clear a bit, because failure to
> notice that the new page is there - with a set bit that must now be
> cleared - could result in queries returning wrong answers.  So one
> possibility would be to tweak the relcache stuff to have a flag
> indicating whether the value is cached, and make that separate from
> the cached value.  Then, if the value is cached, and we're doing
> anything other than clearing a bit, we just say, eh, sorry, page isn't
> there.  If we're clearing a bit then we go to the trouble of
> revalidating.
>
> Thoughts?
>

In the case when we're asked to clear a bit, it would first try to pin
the relevant page, which would go through vm_readbuf() with extend set
to true. Then vm_extend() would notice that the visibility map had
already been extended, and it would read in the new page with the set
bit. So this case would continue to work, wouldn't it?

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-01-31 19:42:19 Re: Issues with C++ exception handling in an FDW
Previous Message Andres Freund 2012-01-31 19:01:05 Re: Issues with C++ exception handling in an FDW