Re: Freeze avoidance of very large table.

From: Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: Freeze avoidance of very large table.
Date: 2015-07-09 18:05:26
Message-ID: CAD21AoCerYNMAWOQp1GaEvKO40-X3szKGiAQi_e9HCH8GwUetw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 7, 2015 at 8:49 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Jul 2, 2015 at 9:00 PM, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
>>
>>
>> Thank you for bug report, and comments.
>>
>> Fixed version is attached, and source code comment is also updated.
>> Please review it.
>>
>
> I am looking into this patch and would like to share my findings with
> you:

Thank you for comment.
I appreciate you taking time to review this patch.

>
> 1.
> @@ -2131,8 +2133,9 @@ heap_insert(Relation relation, HeapTuple tup,
> CommandId cid,
>
> CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
>
> /*
> - * Find buffer to insert this
> tuple into. If the page is all visible,
> - * this will also pin the requisite visibility map page.
> +
> * Find buffer to insert this tuple into. If the page is all visible
> + * of all frozen, this will also pin
> the requisite visibility map and
> + * frozen map page.
> */
>
> typo in comments.
>
> /of all frozen/or all frozen

Fixed.

> 2.
> visibilitymap.c
> + * The visibility map is a bitmap with two bits (all-visible and all-frozen
> + * per heap page.
>
> /and all-frozen/and all-frozen)
> closing round bracket is missing.

Fixed.

> 3.
> visibilitymap.c
> -/*#define TRACE_VISIBILITYMAP */
> +#define TRACE_VISIBILITYMAP
>
> why is this hash define opened?

Fixed.

> 4.
> -visibilitymap_count(Relation rel)
> +visibilitymap_count(Relation rel, bool for_visible)
>
> This API needs to count set bits for either visibility info, frozen info
> or both (if required), it seems better to have second parameter as
> uint8 flags rather than bool. Also, if it is required to be called at most
> places for both visibility and frozen bits count, why not get them
> in one call?

Fixed.

> 5.
> Clearing visibility and frozen bit separately for the dml
> operations would lead locking/unlocking the corresponding buffer
> twice, can we do it as a one operation. I think this is suggested
> by Simon as well.

Latest patch clears bits in one operation, and set all-frozen with
all-visible in one operation.
We can judge the page is all-frozen in two places: first scanning the
page(lazy_scan_heap), and after cleaning garbage(lazy_vacuum_page).

> 6.
> - * Before locking the buffer, pin the visibility map page if it appears to
> - * be necessary.
> Since we haven't got the lock yet, someone else might be
> + * Before locking the buffer, pin the
> visibility map if it appears to be
> + * necessary. Since we haven't got the lock yet, someone else might
> be
>
> Why you have deleted 'page' in above comment?

Fixed.

> 7.
> @@ -3490,21 +3532,23 @@ l2:
> UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
>
> if (vmbuffer != InvalidBuffer)
> ReleaseBuffer(vmbuffer);
> +
> bms_free
> (hot_attrs);
>
> Seems unnecessary change.

Fixed.

> 8.
> @@ -1919,11 +1919,18 @@ index_update_stats(Relation rel,
> {
> BlockNumber relpages =
> RelationGetNumberOfBlocks(rel);
> BlockNumber relallvisible;
> + BlockNumber
> relallfrozen;
>
> if (rd_rel->relkind != RELKIND_INDEX)
> - relallvisible =
> visibilitymap_count(rel);
> + {
> + relallvisible = visibilitymap_count(rel,
> true);
> + relallfrozen = visibilitymap_count(rel, false);
> + }
> else
> /* don't bother for indexes */
> + {
> relallvisible = 0;
> +
> relallfrozen = 0;
> + }
>
> I think in this function, you have forgotten to update the
> relallfrozen value in pg_class.

Fixed.

> 9.
> vacuumlazy.c
>
> @@ -253,14 +258,16 @@ lazy_vacuum_rel(Relation onerel, int options,
> VacuumParams *params,
> * NB: We
> need to check this before truncating the relation, because that
> * will change ->rel_pages.
> */
> -
> if (vacrelstats->scanned_pages < vacrelstats->rel_pages)
> + if ((vacrelstats->scanned_pages +
> vacrelstats->vmskipped_pages)
> + < vacrelstats->rel_pages)
> {
> - Assert(!scan_all);
>
> Why you have removed this Assert, won't the count of
> vacrelstats->scanned_pages + vacrelstats->vmskipped_pages be
> equal to vacrelstats->rel_pages when scall_all = true.

Fixed.

> 10.
> vacuumlazy.c
> lazy_vacuum_rel()
> ..
> + scanned_all |= scan_all;
> +
>
> Why this new assignment is added, please add a comment to
> explain it.

It's not necessary, removed.

> 11.
> lazy_scan_heap()
> ..
> + * Also, skipping even a single page accorind to all-visible bit of
> + * visibility map means that we can't update relfrozenxid, so we only want
> + * to do it if we can skip a goodly number. On the other hand, we count
> + * both how many pages we skipped according to all-frozen bit of visibility
> + * map and how many pages we freeze page, so we can update relfrozenxid if
> + * the sum of their is as many as tuples per page.
>
> a.
> typo
> /accorind/according

Fixed.

> b.
> is the second part of comment (starting from On the other hand)
> right? I mean you are comparing sum of pages skipped due to
> all_frozen bit and number of pages freezed with tuples per page.
> I don't understand how are they related?
>

It's wrong, I wanted to say at last sentence that, "so we can update
relfrozenxid if the sum of them is as many as pages of table."

> 12.
> @@ -918,8 +954,13 @@ lazy_scan_heap(Relation onerel, LVRelStats
> *vacrelstats,
> else
> {
> num_tuples += 1;
> + ntup_in_blk += 1;
> hastup = true;
>
> + /* If current tuple is already frozen, count it up */
> + if (HeapTupleHeaderXminFrozen(tuple.t_data))
> + already_nfrozen += 1;
> +
> /*
> * Each non-removable tuple must be checked to see if it needs
> * freezing. Note we already have exclusive buffer lock.
>
> Here, if tuple is already_frozen, can't we just continue and
> check for next tuple?

I think it's impossible because the logic related to old-style VACUUM
FULL is remained yet in HeapTupleHeaderXminFrozen().

> 13.
> +extern XLogRecPtr log_heap_frozenmap(RelFileNode rnode, Buffer heap_buffer,
> + Buffer fm_buffer);
>
> It seems like this function is not used.

Fixed.

> You have added relallfrozen similar to relallvisible, but how you
> are planning to use it, is there any usecase for it?

Yep, the value of relallvisible would be effective for in case where
the user want to know how the vacuuming takes time to do.
If this value is low score, it's a usually good idea to do VACUUM
FREEZE manually to prevent unpredictable anti-wrapping vacuum.

> a. please explain in comment why it is safe if someone clear the
> frozen bit concurrently
> b. won't skipping pages intermittently due to set frozen bit break the
> readahead mechanism? In this regard, if possible, I think we should
> do some tests to see the benefit of this patch. I understand that in
> general, it will be good to skip pages, however it seems better to check
> that with some different kind of tests.

In latest patch, we can skip the all-visible or all-frozen page until
we find next_not_all_visible_block,
and then we do re-check whether this page is all-frozen to skip to
vacuum this page even if scan_all is true.
Also, I added the message about number of the skipped frozen pages to
the verbose log for test.

> Please place it in core. I see value in having a diagnostic function for
> general use on production systems.

I added new heapfuncs.c file for related heap function which the DBA
uses, and then added theses function to that file.
But test cases are not yet, I'm making them.

Also something for pg_upgrade is also not yet.

TODO
- Test case for this feature
- pg_upgrade support.

Regards,

--
Sawada Masahiko

Attachment Content-Type Size
000_add_frozen_bit_into_visibilitymap_v6.patch application/octet-stream 57.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2015-07-09 18:08:30 Re: PL/pgSQL, RAISE and error context
Previous Message Tom Lane 2015-07-09 17:27:43 Re: WAL logging problem in 9.4.3?