Re: Drop type "smgr"?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Shawn Debnath <sdn(at)amazon(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Drop type "smgr"?
Date: 2019-03-01 22:04:06
Message-ID: CA+hUKGK5c7pCTtXY8=K-Ow96FHppG1F+GTCf78fKz9XTxV7G2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 1, 2019 at 9:11 PM Konstantin Knizhnik
<k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
> One more thing... From my point of view one of the drawbacks of Postgres
> is that it requires underlaying file system and is not able to work with
> raw partitions.
> It seems to me that bypassing fle system layer can significantly improve
> performance and give more possibilities for IO performance tuning.
> Certainly it will require a log of changes in Postgres storage layer so
> this is not what I suggest to implement or even discuss right now.
> But it may be useful to keep it in mind in discussions concerning
> "generic storage manager".

Hmm. Speculation-around-the-water-cooler-mode: I think the arguments
for using raw partitions are approximately the same as the arguments
for using a big data file that holds many relations. The three I can
think of are (1) the space is entirely preallocated, which *might*
have performance and safety advantages, (2) if you encrypt it, no one
can see the structure (database OIDs, relation OIDs and sizes) and (3)
it might allow pages to be moved from one relation to another without
copying or interleaved in interesting ways (think SQL Server parallel
btree creation that "stitches together" btrees produced by parallel
workers, or Oracle "clustered" tables where the pages of two tables
are physically interleaved in an order that works nicely when you join
those two tables, or perhaps schemes for moving data between
relations/partitions quickly). On the other hand, to make that work
you have to own the problem of space allocation/management that we
currently leave to the authors of XFS et al, and those guys have
worked on that for *years and years* and they work really well. If
you made all that work for big preallocated data files, then sure, you
could also make it work for raw partitions, but I'm not sure how much
performance advantage there is for that final step. I suspect that a
major reason for Oracle to support raw block devices many years ago
was because back then there was no other way to escape from the page
cache. Direct IO hasn't always been available or portable, and hasn't
always worked well. That said, it does seem plausible that we could
do the separation of (1) block -> pathname/offset mappings and (2)
actual IO operations in a way that you could potentially write your
own pseudo-filesystem that stores a whole PostgreSQL cluster inside
big data files or raw partitions. Luckily we don't need to tackle
such mountainous terrain to avoid the page cache, today.

--
Thomas Munro
https://enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2019-03-01 22:05:54 Re: Tighten error control for OpenTransientFile/CloseTransientFile
Previous Message Chapman Flack 2019-03-01 21:51:06 Re: Infinity vs Error for division by zero