Re: Freeze avoidance of very large table.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: sawada(dot)mshk(at)gmail(dot)com
Cc: alvherre(at)2ndquadrant(dot)com, Jim(dot)Nasby(at)bluetreble(dot)com, robertmhaas(at)gmail(dot)com, michael(dot)paquier(at)gmail(dot)com, bruce(at)momjian(dot)us, jeff(dot)janes(at)gmail(dot)com, amit(dot)kapila16(at)gmail(dot)com, andres(at)anarazel(dot)de, simon(at)2ndquadrant(dot)com, josh(at)agliodbs(dot)com, masao(dot)fujii(at)gmail(dot)com, petr(at)2ndquadrant(dot)com, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Freeze avoidance of very large table.
Date: 2016-02-10 07:39:15
Message-ID: 20160210.163915.163976704.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

At Thu, 4 Feb 2016 02:32:29 +0900, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoB1HnZ7thWYjqKve78gQ5+PyedbbkjAPbc5zLV3oA-CuA(at)mail(dot)gmail(dot)com>
> On Tue, Feb 2, 2016 at 7:22 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> > Masahiko Sawada wrote:
> >> I think we have two options.
> >>
> >> 1. Change page layout(PG_PAGE_LAYOUT_VERSION) to 5. pg_upgrade detects
> >> it and then converts only VM files.
> >> 2. Change pg_upgrade plugin mechanism so that it can handle other name
> >> conversion plugins (e.g., convertLayout_vm_to_vfm)
> >>
> >> I think #2 is better. Thought?
> >
> > My vote is for #2 as well. Maybe we just didn't have forks when this
> > functionality was invented; maybe the author just didn't think hard
> > enough about what would be the right interface to do it.
>
> I've almost wrote up the very rough patch. (it can pass regression test)
> Windows supporting is not yet, and Makefile is not correct.
>
> I've divided the main patch into two patches; add frozen bit patch and
> pg_upgrade support patch.
> 000 patch is almost same as previous code. (includes small fix)
> 001 patch provides rewriting visibility map as a pageConverter routine.
> 002 patch is for enhancement debug message in visibilitymap.c

Thanks, it becomes easy to read.

> In order to support pageConvert plugin, I made the following changes.
> * Main changes
> - Remove PAGE_CONVERSION
> - pg_upgrade plugin is located to 'src/bin/pg_upgrade/plugins' directory.
> - Move directory having plugins from '$(bin)/plugins' to '$(lib)/plugins'.

These seem fair.

> - Add new page-converter plugin function for visibility map.
> - Current code doesn't allow us to use link mode (-k) in the case
> where page-converter is required.
>
> But I changed it so that if page-converter for fork file is
> specified, we convert it actually even when link mode.
>
> * Interface designe
> convertFile() and convertPage() are plugin function for main relation
> file, and these functions are dynamically loaded by
> loadConvertPlugin().

Though I haven't looked this so closer, loadConverterPlugin looks
to continue deciding what plugin to load using old and new page
layout versions. Currently the acutually possible versions are 4
and if we increment it now, 5.

On the other hand, _vm came at the *catalog version* 201107031
(9.1 release) and _fsm came at 8.4 release. Both of them are of
page layout version 4. Are we allowed to increment page layout
version for this reason? And is this framework under
reconstruction is flexible enough for this kind of changes in
future? I don't think so.

We have added _vm and _fsm so far so we must use a version number
that can determine when _vm, _fsm and _vfm are introduced. I'm
afraid that it is out of purpose, catalog version seems to be
most usable, it is already used to know when the crash safe VM
has been introduced.

Using the catalog version, the plugin we provide first would be
converteLayout_201105231_201602071.so which has only a converter
from _vm to _vfm. This plugin is loaded for the combination of
the source cluster with the catalog version of 201105231(when VM
has been introduced) or later and the destination cluster with
that *before* 201602071 (this version).

If we change the format of fsm (vm no longer exists), we would
have a new plugin maybe named
converteLayout_200904091_2017xxxxx.so which has an, maybe,
inplace file converter for fsm. It will be loaded when a source
database is of the catalog version of 200904091(when FSM has been
introduced) or later and a destination before 2017xxxxx(the
version). Catalog version seems to work fine.

So far, I assumed that we can safely assume that the name of
files to be converted is <oid>[fork_name] so the possible types
of conversions would be the following.

- per-page conversion
- per-file conversion between the files with the same fork name.
- per-file conversion between the files with different fork names.

Since the plugin filename doesn't tell such things, they should
be told by the plugin itself. So a plugin is to provide the
following interface,

typedef struct ConverterTable
{
char *src_fork_name;
char *dst_fork_name;
FileConverterFunc file_conveter;
PageConverterFunc page_conveter;
} ConverterTable[];

Following such name convention of plugins, we may load multiple
plugins at once, so we collect all entries of the table of all
loaded plugins and check if any src_fork_name from them don't
duplicate.

Here, we have sufficient information to choose what conveter to
invoke and execute conversion like this.

for (fork_name in all_fork_names_including_"" )
{
find a converter comparing fork_name with src_fork_name.
check dst_fork_name and rename the target file if needed.
invoke the converter.
}

If we need to convert clogs or similars and need to prepare for
such events, the ConverterTable might have an additional member
and change the meaning of some of existing members.

typedef struct ConverterTable
{
enum target_type; /* FILE_NAME or FORK_NAME */
char *src_name;
char *dst_name;
FileConverterFunc file_conveter;
PageConverterFunc page_conveter;
} ConverterTable[];

when target_type == FILE_NAME, src_name and dst_name represents
the target file names relatively to $PGDATA.

# Yeah, I know it is too complicated.

> I added a new pageConvert plugin function converVMFile() for
> visibility map (fork file).
> If layout of CLOG, FSM or etc will be changed in the future, we could
> expand some new pageConvert plugin functions like convertCLOGFile() or
> convertFSMFile(), and these functions are dynamically loaded by
> loadAdditionalConvertPlugin().
> It means that main file and other fork file conversion are executed
> independently, and conversion for fork file are executed even if link
> mode is specified.
> Each conversion plugin is loaded and used only when it's required.

As I asked upthread, It is one of the most important design point
of plugin mechanism that what characteristics of src and/or dest
cluster to trigger loading of a plugin. And if page layout format
is it, are we allowed to increment for such irrelevant events? Or
using another characteristics like catalog version?

> I still agree with this plugin approach, but I felt it's still
> complicated a bit, and I'm concerned that patch size has been
> increased.
> Please give me feedbacks.

Yeah, I feel the same. What make it worse, the plugin mechanism
will get further complex if we make it more flexible for possible
usage as I proposed above. It is apparently too complicated for
deciding whether to load *just one*, for now, converter
function. And no additional converter is in sight.

I incline to pull out all the plugin stuff of pg_upgrade. We are
so prudent to make changes of file formats so this kind of events
will happen with several-years intervals. The plugin mechanism
would be valuable if we are encouriged to change file formats
more frequently and freely by providing it, but such situation
absolutely introduces more untoward things..

> If there are not objections about this, I'm going to spend time
> to improve it.

Sorry, but I do have strong objection to this... Does anyone else
have opinions for that?

regareds,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2016-02-10 07:57:05 Re: proposal: function parse_ident
Previous Message Michael Paquier 2016-02-10 07:31:34 Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby