Re: Freeze avoidance of very large table.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: Freeze avoidance of very large table.
Date: 2015-10-02 17:24:46
Message-ID: CAD21AoCCq_R-fbdXSXXEV8jAM9C2CD4R=jGu0Z-Y7K7bbFt=xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 3, 2015 at 12:23 AM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Masahiko Sawada wrote:
>

Thank you for taking time to review this feature.
Attached the latest version patch (v15).

>> @@ -2972,10 +2981,15 @@ l1:
>> */
>> PageSetPrunable(page, xid);
>>
>> + /* clear PD_ALL_VISIBLE and PD_ALL_FORZEN flags */
>
> Typo "FORZEN".

Fixed.

>
>> if (PageIsAllVisible(page))
>> {
>> all_visible_cleared = true;
>> +
>> + /* all-frozen information is also cleared at the same time */
>> PageClearAllVisible(page);
>> + PageClearAllFrozen(page);
>
> I wonder if it makes sense to have a macro to clear both in unison,
> which seems a very common pattern.
>

Fixed.

>
>> diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
>> index 7c38772..a284b85 100644
>> --- a/src/backend/access/heap/visibilitymap.c
>> +++ b/src/backend/access/heap/visibilitymap.c
>> @@ -21,33 +21,45 @@
>> *
>> * NOTES
>> *
>> - * The visibility map is a bitmap with one bit per heap page. A set bit means
>> - * that all tuples on the page are known visible to all transactions, and
>> - * therefore the page doesn't need to be vacuumed. The map is conservative in
>> - * the sense that we make sure that whenever a bit is set, we know the
>> - * condition is true, but if a bit is not set, it might or might not be true.
>> + * The visibility map is a bitmap with two bits (all-visible and all-frozen)
>> + * per heap page. A set all-visible bit means that all tuples on the page are
>> + * known visible to all transactions, and therefore the page doesn't need to
>> + * be vacuumed. A set all-frozen bit means that all tuples on the page are
>> + * completely frozen, and therefore the page doesn't need to be vacuumed even
>> + * if whole table scanning vacuum is required (e.g. anti-wraparound vacuum).
>> + * A all-frozen bit must be set only when the page is already all-visible.
>> + * That is, all-frozen bit is always set with all-visible bit.
>
> "A all-frozen" -> "The all-frozen" (but "A set all-xyz" is correct).

Fixed.

>
>> * When we *set* a visibility map during VACUUM, we must write WAL. This may
>> * seem counterintuitive, since the bit is basically a hint: if it is clear,
>> - * it may still be the case that every tuple on the page is visible to all
>> - * transactions; we just don't know that for certain. The difficulty is that
>> - * there are two bits which are typically set together: the PD_ALL_VISIBLE bit
>> - * on the page itself, and the visibility map bit. If a crash occurs after the
>> - * visibility map page makes it to disk and before the updated heap page makes
>> - * it to disk, redo must set the bit on the heap page. Otherwise, the next
>> - * insert, update, or delete on the heap page will fail to realize that the
>> - * visibility map bit must be cleared, possibly causing index-only scans to
>> - * return wrong answers.
>> + * it may still be the case that every tuple on the page is visible or frozen
>> + * to all transactions; we just don't know that for certain. The difficulty is
>> + * that there are two bits which are typically set together: the PD_ALL_VISIBLE
>> + * or PD_ALL_FROZEN bit on the page itself, and the visibility map bit. If a
>> + * crash occurs after the visibility map page makes it to disk and before the
>> + * updated heap page makes it to disk, redo must set the bit on the heap page.
>> + * Otherwise, the next insert, update, or delete on the heap page will fail to
>> + * realize that the visibility map bit must be cleared, possibly causing index-only
>> + * scans to return wrong answers.
>
> In the "The difficulty ..." para, I would add the word "corresponding" before
> "visibility". Otherwise, it is not clear what the plural means exactly.

Fixed.

>> * VACUUM will normally skip pages for which the visibility map bit is set;
>> * such pages can't contain any dead tuples and therefore don't need vacuuming.
>> - * The visibility map is not used for anti-wraparound vacuums, because
>> + * The visibility map is not used for anti-wraparound vacuums before 9.5, because
>> * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid
>> * present in the table, even on pages that don't have any dead tuples.
>> + * 9.6 or later, the visibility map has a additional bit which indicates all tuple
>> + * on single page has been completely forzen, so the visibility map is also used for
>> + * anti-wraparound vacuums.
>
> This should not mention database versions. Just explain how the code
> behaves today, not how it behaved in the past. Those who want to
> understand how it behaved in 9.5 can read the 9.5 code. (Again typo
> "forzen".)

Changed these comment.
Sorry for the same typo frequently..

>> @@ -1115,6 +1187,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
>> tups_vacuumed, vacuumed_pages)));
>>
>> /*
>> + * This information would be effective for how much effect all-frozen bit
>> + * of VM had for freezing tuples.
>> + */
>> + ereport(elevel,
>> + (errmsg("Skipped %d frozen pages acoording to visibility map",
>> + vacrelstats->vmskipped_frozen_pages)));
>
> Message must start on lowercase letter. I don't understand what the
> comment means. Can you rephrase it?

Fixed.

>> @@ -1779,10 +1873,12 @@ vac_cmp_itemptr(const void *left, const void *right)
>> /*
>> * Check if every tuple in the given page is visible to all current and future
>> * transactions. Also return the visibility_cutoff_xid which is the highest
>> - * xmin amongst the visible tuples.
>> + * xmin amongst the visible tuples, and all_forzen which implies that all tuples
>> + * of this page are frozen.
>
> Typo "forzen" here again.

Fixed.

>> @@ -201,6 +239,110 @@ copy_file(const char *srcfile, const char *dstfile, bool force)
>> #endif
>>
>>
>> +/*
>> + * rewriteVisibilitymap()
>> + *
>> + * A additional bit which indicates that all tuples on page is completely
>> + * frozen is added into visibility map at PG 9.6. So the format of visibiilty
>> + * map has been changed.
>> + * Copies a visibility map file while adding all-frozen bit(0) into each bit.
>> + */
>> +static const char *
>> +rewriteVisibilitymap(const char *fromfile, const char *tofile, bool force)
>> +{
>> +#define REWRITE_BUF_SIZE (50 * BLCKSZ)
>> +#define BITS_PER_HEAPBLOCK 2
>> +
>> + int src_fd, dst_fd;
>> + uint16 vm_bits;
>> + ssize_t nbytes;
>> + char *buffer;
>> + int ret = 0;
>> + int save_errno = 0;
>> +
>> + if ((fromfile == NULL) || (tofile == NULL))
>> + {
>> + errno = EINVAL;
>> + return getErrorText(errno);
>> + }
>> +
>> + if ((src_fd = open(fromfile, O_RDONLY, 0)) < 0)
>> + return getErrorText(errno);
>> +
>> + if ((dst_fd = open(tofile, O_RDWR | O_CREAT | (force ? 0 : O_EXCL), S_IRUSR | S_IWUSR)) < 0)
>> + {
>> + save_errno = errno;
>> + if (src_fd != 0)
>> + close(src_fd);
>> +
>> + errno = save_errno;
>> + return getErrorText(errno);
>> + }
>> +
>> + buffer = (char *) pg_malloc(REWRITE_BUF_SIZE);
>> +
>> + /* Copy page header data in advance */
>> + if ((nbytes = read(src_fd, buffer, MAXALIGN(SizeOfPageHeaderData))) <= 0)
>> + {
>> + save_errno = errno;
>> + return getErrorText(errno);
>> + }
>
> Not clear why you bother with save_errno in this path. Forgot to
> close()? (Though I wonder why you bother to close() if the program is
> going to exit shortly thereafter anyway.)

Fixed.

>> diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
>> index 13aa891..fc92a5f 100644
>> --- a/src/bin/pg_upgrade/pg_upgrade.h
>> +++ b/src/bin/pg_upgrade/pg_upgrade.h
>> @@ -112,6 +112,11 @@ extern char *output_files[];
>> #define VISIBILITY_MAP_CRASHSAFE_CAT_VER 201107031
>>
>> /*
>> + * The format of visibility map changed with this 9.6 commit,
>> + *
>> + */
>> +#define VISIBILITY_MAP_FROZEN_BIT_CAT_VER 201509181
>
> Useless empty line in comment.

Fixed.

>> diff --git a/src/common/relpath.c b/src/common/relpath.c
>> index 66dfef1..52ff14e 100644
>> --- a/src/common/relpath.c
>> +++ b/src/common/relpath.c
>> @@ -30,6 +30,9 @@
>> * If you add a new entry, remember to update the errhint in
>> * forkname_to_number() below, and update the SGML documentation for
>> * pg_relation_size().
>> + * 9.6 or later, the visibility map fork name is changed from "vm" to
>> + * "vfm" bacause visibility map has not only information about all-visible
>> + * but also information about all-frozen.
>> */
>> const char *const forkNames[] = {
>> "main", /* MAIN_FORKNUM */
>
> Drop the change in comment? There's no "vfm" in this version of the
> patch, is there?

Fixed.

Regards,

--
Masahiko Sawada

Attachment Content-Type Size
000_add_frozen_bit_into_visibilitymap_v15.patch application/octet-stream 66.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2015-10-02 17:53:03 Re: No Issue Tracker - Say it Ain't So!
Previous Message Joshua D. Drake 2015-10-02 16:50:37 Re: Request for dogfood volunteers (was No Issue Tracker - Say it Ain't So!)