Re: brininsert optimization opportunity

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ashwin Agrawal <ashwinstar(at)gmail(dot)com>
Subject: Re: brininsert optimization opportunity
Date: 2023-07-05 10:16:48
Message-ID: 35e7be37-0d35-d8ea-a4d5-eff84aefcc39@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/5/23 02:35, Soumyadeep Chakraborty wrote:
> On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>> I'm not sure if memory context callbacks are the right way to rely on
>> for this purpose. The primary purpose of memory contexts is to track
>> memory, so using them for this seems a bit weird.
>
> Yeah, this just kept getting dirtier and dirtier.
>
>> There are cases that do something similar, like expandendrecord.c where
>> we track refcounted tuple slot, but IMHO there's a big difference
>> between tracking one slot allocated right there, and unknown number of
>> buffers allocated much later.
>
> Yeah, the following code in ER_mc_callbackis is there I think to prevent double
> freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal())
> (The part about: /* Ditto for tupdesc references */).
>
> if (tupdesc->tdrefcount > 0)
> {
> if (--tupdesc->tdrefcount == 0)
> FreeTupleDesc(tupdesc);
> }
> Plus the above code doesn't try anything with Resource owner stuff, whereas
> releasing a buffer means:
> ReleaseBuffer() -> UnpinBuffer() ->
> ResourceOwnerForgetBuffer(CurrentResourceOwner, b);
>

Well, my understanding is the expandedrecord.c code increments the
refcount exactly to prevent the resource owner to release the slot too
early. My assumption is we'd need to do something similar for the revmap
buffers by calling IncrBufferRefCount or something. But that's going to
be messy, because the buffers are read elsewhere.

>> The fact that even with the extra context is still doesn't handle query
>> cancellations is another argument against that approach (I wonder how
>> expandedrecord.c handles that, but I haven't checked).
>>
>>>
>>> Maybe there is a better way of doing our cleanup? I'm not sure. Would
>>> love your input!
>>>
>>> The other alternative for all this is to introduce new AM callbacks for
>>> insert_begin and insert_end. That might be a tougher sell?
>>>
>>
>> That's the approach I wanted to suggest, more or less - to do the
>> cleanup from ExecCloseIndices() before index_close(). I wonder if it's
>> even correct to do that later, once we release the locks etc.
>
> I'll try this out and introduce a couple of new index AM callbacks. I
> think it's best to do it before releasing the locks - otherwise it
> might be weird
> to manipulate buffers of an index relation, without having some sort of lock on
> it. I'll think about it some more.
>

I don't understand why would this need more than just a callback to
release the cache.

>> I don't think ii_AmCache was intended for stuff like this - GIN and GiST
>> only use it to cache stuff that can be just pfree-d, but for buffers
>> that's no enough. It's not surprising we need to improve this.
>
> Hmmm, yes, although the docs state:
> "If the index AM wishes to cache data across successive index insertions within
> an SQL statement, it can allocate space in indexInfo->ii_Context and
> store a pointer
> to the data in indexInfo->ii_AmCache (which will be NULL initially)."
> they don't mention anything about buffer usage. Well we will fix it!
>
> PS: It should be possible to make GIN and GiST use the new index AM APIs
> as well.
>

Why should GIN/GiST use the new API? I think it's perfectly sensible to
only require the "cleanup callback" when just pfree() is not enough.

>> FWIW while debugging this (breakpoint on MemoryContextDelete) I was
>> rather annoyed the COPY keeps dropping and recreating the two BRIN
>> contexts - brininsert cxt / brin dtuple. I wonder if we could keep and
>> reuse those too, but I don't know how much it'd help.
>>
>
> Interesting, I will investigate that.
>
>>> Now, to finally answer your question about the speedup without
>>> generate_series(). We do see an even higher speedup!
>>>
>>> seq 1 200000000 > /tmp/data.csv
>>> \timing
>>> DROP TABLE heap;
>>> CREATE TABLE heap(i int);
>>> CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1);
>>> COPY heap FROM '/tmp/data.csv';
>>>
>>> -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622)
>>> COPY 200000000
>>> Time: 205072.444 ms (03:25.072)
>>> Time: 215380.369 ms (03:35.380)
>>> Time: 203492.347 ms (03:23.492)
>>>
>>> -- 3 runs (branch v2)
>>>
>>> COPY 200000000
>>> Time: 135052.752 ms (02:15.053)
>>> Time: 135093.131 ms (02:15.093)
>>> Time: 138737.048 ms (02:18.737)
>>>
>>
>> That's nice, but it still doesn't say how much of that is reading the
>> data. If you do just copy into a table without any indexes, how long
>> does it take?
>
> So, I loaded the same heap table without any indexes and at the same
> scale. I got:
>
> postgres=# COPY heap FROM '/tmp/data.csv';
> COPY 200000000
> Time: 116161.545 ms (01:56.162)
> Time: 114182.745 ms (01:54.183)
> Time: 114975.368 ms (01:54.975)
>

OK, so the baseline is 115 seconds. With the current code, an index
means +100 seconds. With the optimization, it's just +20. Nice.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2023-07-05 10:34:38 Re: Missing llvm_leave_fatal_on_oom() call
Previous Message Yuya Watari 2023-07-05 09:57:56 Re: [PoC] Reducing planning time when tables have many partitions