|From:||Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>|
|To:||Peter Geoghegan <pg(at)bowt(dot)ie>|
|Cc:||PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>|
|Subject:||Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
16.09.2019 21:58, Peter Geoghegan wrote:
> On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
> <a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>> I tested patch with nbtree_wal_test, and found out that the real issue is
>> not the dedup WAL records themselves, but the full page writes that they trigger.
>> Here are test results (config is standard, except fsync=off to speedup tests):
>> 'FPW on' and 'FPW off' are tests on v14.
>> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
> I think that is makes sense to focus on synthetic cases without
> FPWs/FPIs from checkpoints. At least for now.
>> With random insertions into btree it's highly possible that deduplication will often be
>> the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.
> I think that the problem here is that you didn't copy this old code
> from _bt_split() over to _bt_dedup_one_page():
> * Copy the original page's LSN into leftpage, which will become the
> * updated version of the page. We need this because XLogInsert will
> * examine the LSN and possibly dump it in a page image.
> PageSetLSN(leftpage, PageGetLSN(origpage));
> isleaf = P_ISLEAF(oopaque);
> Note that this happens at the start of _bt_split() -- the temp page
> buffer based on origpage starts out with the same LSN as origpage.
> This is an important step of the WAL volume optimization used by
That's it. I suspected that such enormous amount of FPW reflects some bug.
>> That's why there is no significant difference with log_newpage_buffer() approach.
>> And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
My point was that the problem is extra FPWs, so it doesn't matter
whether we deduplicate just several entries to free enough space or all
> The term "lazy deduplication" is seriously overloaded here. I think
> that this could cause miscommunications. Let me list the possible
> meanings of that term here:
> 1. First of all, the basic approach to deduplication is already lazy,
> unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
> a page split. I'm 100% sure that we both think that that works well
> compared to an eager approach (like GIN's).
> 2. Second of all, there is the need to incrementally WAL log. It looks
> like v14 does that well, in that it doesn't create
> "xlrec_dedup.n_intervals" space when it isn't truly needed. That's
In v12-v15 I mostly concentrated on this feature.
The last version looks good to me.
> 3. Third, there is incremental writing of the page itself -- avoiding
> using a temp buffer. Not sure where I stand on this.
I think it's a good idea. memmove must be much faster than copying
items tuple by tuple.
I'll send next patch by the end of the week.
> 4. Finally, there is the possibility that we could make deduplication
> incremental, in order to avoid work that won't be needed altogether --
> this would probably be combined with 3. Not sure where I stand on
> this, either.
> We should try to be careful when using these terms, as there is a very
> real danger of talking past each other.
>> Another, and more realistic approach is to make deduplication less intensive:
>> if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.
> I see that v14 uses the "dedupInterval" struct, which provides a
> logical description of a deduplicated set of tuples. That general
> approach is at least 95% of what I wanted from the
> _bt_dedup_one_page() WAL-logging.
>> Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
>> not because deduplication itself is cpu bound or something, but because WAL load decreased.
> I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
> faster as things are. I am not against bringing back
> BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
> think that it's a distraction. It may hide problems that we want to
> fix. Like the PageSetLSN() problem I mentioned just now, and maybe
> We will definitely need to have page space accounting that's a bit
> similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
> full (or has 4 bytes left, or something). That happens regularly now.
> That must start with teaching _bt_dedup_one_page() about how much
> space it will free. Basing it on the number of items on the page or
> whatever is not going to work that well.
> I think that it would be possible to have something like
> BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
> deduplication incremental, in the sense that it can give up on
> deduplication when it frees enough space (i.e. something like v13's
> 0002-* patch). I said that these two things are closely related, which
> is true, but it's also true that they don't overlap.
> Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
> made a handful of specific indexes (mostly from TPC-H) significantly
> smaller. I never tried to debug the problem. It's possible that we
> could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
> but not use it on rightmost pages, and get the best of both worlds.
> IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
> column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
> debug that when the time comes.
Now that extra FPW are proven to be a bug, I agree that giving up on
deduplication early is not necessary.
My previous considerations were based on the idea that deduplication
always adds considerable overhead,
which is not true after recent optimizations.
>> So I propose to develop this idea. The question is how to choose threshold.
>> I wouldn't like to introduce new user settings. Any ideas?
> I think that there should be a target fill factor that sometimes makes
> deduplication leave a small amount of free space. Maybe that means
> that the last posting list on the page is made a bit smaller than the
> other ones. It should be "goal orientated".
> The loop within _bt_dedup_one_page() is very confusing in both v13 and
> v14 -- I couldn't figure out why the accounting worked like this:
>> + /*
>> + * Project size of new posting list that would result from merging
>> + * current tup with pending posting list (could just be prev item
>> + * that's "pending").
>> + *
>> + * This accounting looks odd, but it's correct because ...
>> + */
>> + projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
>> + (dedupState->ntuples + itup_ntuples + 1) *
>> + sizeof(ItemPointerData));
> Why the "+1" here?
I'll look at it.
> I have significantly refactored the _bt_dedup_one_page() loop in a way
> that seems like a big improvement. It allowed me to remove all of the
> small palloc() calls inside the loop, apart from the
> BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems
> to have shaved about 2 seconds off the "land" unlogged table test,
> which was originally about 1 minute 2 seconds with v13's 0001-* patch
> (and without v13's 0002-* patch).
> It seems like can easily be integrated with the approach to WAL
> logging taken in v14, so everything can be integrated soon. I'll work
> on that.
New version is attached.
It is v14 (with PageSetLSN fix) merged with v13.
I also fixed a bug in btree_xlog_dedup(), that was previously masked by FPW.
v15 passes make installcheck.
I haven't tested it with land test yet. Will do it later this week.
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
|Next Message||Pavel Stehule||2019-09-17 16:50:35||strong memory leak in plpgsql from handled rollback and lazy cast|
|Previous Message||Tom Lane||2019-09-17 16:39:33||Re: subscriptionCheck failures on nightjar|