Re: Cache Hash Index meta page.

From: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Cache Hash Index meta page.
Date: 2016-12-20 19:25:37
Message-ID: CAD__OuiAjLG_3mDchgy193s5-hhTjO6B8e4jvuPGGSv_tAzorA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 20, 2016 at 3:51 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Dec 16, 2016 at 5:16 AM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
> wrote:
>
>> Shouldn't _hash_doinsert() be using the cache, too
>>>
>>
>> Yes, we have an opportunity there, added same in code. But one difference
>> is at the end at-least once we need to read the meta page to split and/or
>> write. Performance improvement might not be as much as read-only.
>>
>
> Why would you need to read it at least once in one case but not the other?
>

For read-only: in _hash_first if target bucket is not plit after caching
the meta page contents. we never need to read the metapage. But for insert:
in _hash_doinsert at the end we modify the meta page to store the number of
tuples.

+ * Write-lock the metapage so we can increment the tuple count. After
+ * incrementing it, check to see if it's time for a split.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ metap->hashm_ntuples += 1;

Well, I guess that makes it more appealing to cache the whole page at least
> in terms of raw number of bytes, but I suppose my real complaint here is
> that there don't seem to be any clear rules for where, whether, and to what
> extent we can rely on the cache to be valid.
>

--- Okay will revert back to cache the entire meta page.

> Without that, I think this patch is creating an extremely serious
> maintenance hazard for anyone who wants to try to modify this code in the
> future. A future patch author needs to be able to tell what data they can
> use and what data they can't use, and why.
>

-- I think if it is okay, I can document same for each member
of HashMetaPageData whether to read from cached from meta page or directly
from current meta page. Below briefly I have commented for each member. If
you suggest I can go with that approach, I will produce a neat patch for
same.

*typedef struct HashMetaPageData*

*{*

*1. uint32 hashm_magic; /* magic no. for hash tables */*

-- I think this should remain same. So can be read from catched meta page.

*2. uint32 hashm_version; /* version ID */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*3. double hashm_ntuples; /* number of tuples stored in the table */*

*-*- This changes on every insert. So should not be read from chached data.

*4. uint16 hashm_ffactor; /* target fill factor (tuples/bucket) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*5. uint16 hashm_bsize; /* index page size (bytes) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*6. uint16 hashm_bmsize; /* bitmap array size (bytes) - must be a power of
2 */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage

*7. uint16 hashm_bmshift; /* log2(bitmap array size in BITS) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage

*8. *If hashm_maxbucket, hashm_highmask and hashm_lowmask are all read and
cached at same time when metapage was locked, then key to bucket number map
function _hash_hashkey2bucket() should always produce same output. If
bucket is split after caching above elements (which can be known because
old bucket pages will never move once allocated and we mark bucket
pages hasho_prevblkno with incremented hashm_highmask), we can invalidate
them and re-read same from meta page. If your intention is not to save a
metapage read while trying to map the key to buket page, then do not read
them from cached meta page.

* uint32 hashm_maxbucket; /* ID of maximum bucket in use */*

* uint32 hashm_highmask; /* mask to modulo into entire table */*

* uint32 hashm_lowmask; /* mask to modulo into lower half of table */*

*9.*

* uint32 hashm_ovflpoint;/* splitpoint from which ovflpgs being*

* * allocated */*

-- Since used for allocation of overflow pages, should get latest value
directly from meta page.

*10.*

* uint32 hashm_firstfree; /* lowest-number free ovflpage (bit#) */*

-- Should always be read from metapage directly.

*11. *

* uint32 hashm_nmaps; /* number of bitmap pages */*

-- Should always be read from metapage directly.

*12.*

*RegProcedure hashm_procid; /* hash procedure id from pg_proc */*

-- Never used till now, When we use, need to look at it about its policy.

*13*

*uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before*

* * each splitpoint */*

Spares before given split_point never change and bucket pages never move.
So when combined with cached hashm_maxbucket, hashm_highmask
and hashm_lowmask, all read at same time under lock, BUCKET_TO_BLKNO should
always produce same output, pointing to right physical block. Should only
be used to save a meta page read when we want to map key to bucket block as
said above.

*14.*

* BlockNumber hashm_mapp[HASH_MAX_BITMAPS]; /* blknos of ovfl bitmaps */*

Always read from metapage durectly.

*} HashMetaPageData;*

> Any existing bugs should be the subject of a separate patch.
>
>
Sorry I will make a new patch for same sepatately.

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-12-20 20:10:21 Re: Protect syscache from bloating with negative cache entries
Previous Message Josh Berkus 2016-12-20 18:41:33 Re: Migration Docs WAS: Proposal for changes to recovery.conf API