Re: Raw device on PostgreSQL

From: Jose Luis Tallon <jltallon(at)adv-solutions(dot)net>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andreas Karlsson <andreas(at)proxel(dot)se>, Benjamin Schaller <benjamin(dot)schaller(at)s2018(dot)tu-chemnitz(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Raw device on PostgreSQL
Date: 2020-05-01 10:22:39
Message-ID: 435d05a4-acd6-856c-3050-4dae70b85d00@adv-solutions.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30/4/20 6:22, Thomas Munro wrote:
> On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Yeah, I think the question is what are the expected benefits of using
>> raw devices. It might be an interesting exercise / experiment, but my
>> understanding is that most of the benefits can be achieved by using file
>> systems but with direct I/O and async I/O, which would allow us to
>> continue reusing the existing filesystem code with much less disruption
>> to our code base.
> Agreed.
>
> [snip] That's probably the main work
> required to make this work, and might be a valuable thing to have
> independently of whether you stick it on a raw device, a big data
> file, NV RAM
   ^^^^^^  THIS, with NV DIMMs / PMEM (persistent memory) possibly
becoming a hot topic in the not-too-distant future
> or some other kind of storage system -- but it's a really
> difficult project.

Indeed.... But you might have already pointed out the *only* required
feature for this to work: a "database" of relfilenode ---which is
actually an int, or rather, a tuple (relfilenode,segment) where both
components are 32-bit currently: that is, a 64bit "objectID" of sorts---
to "set of extents" ---yes, extents, not blocks: sequential I/O is still
faster in all known storage/persistent (vs RAM) systems---- where the
current I/O primitives would be able to write.

Some conversion from "absolute" (within the "file") to "relative"
(within the "tablespace") offsets would need to happen before delegating
to the kernel... or even dereferencing a pointer to an mmap'd region !,
but not much more, ISTM (but I'm far from an expert in this area).

Out of the top of my head:

CREATE TABLESPACE tblspcname [other_options] LOCATION '/dev/nvme1n2'
WITH (kind=raw, extent_min=4MB);

  or something similar to that approac might do it.

    Please note that I have purposefully specified "namespace 2" in an
"enterprise" NVME device, to show the possibility.

OR

  use some filesystem (e.g. XFS) with DAX[1] (mount -o dax ) where
available along something equivalent to  WITH(kind=mmaped)

... though the locking we currently get "for free" from the kernel would
need to be replaced by something else.

Indeed it seems like an enormous amount of work.... but it may well pay
off. I can't fully assess the effort, though

Just my .02€

[1] https://www.kernel.org/doc/Documentation/filesystems/dax.txt

Thanks,

    / J.L.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Victor Wagner 2020-05-01 10:47:11 Postgresql Windows build and modern perl (>=5.28)
Previous Message Atsushi Torikoshi 2020-05-01 10:10:23 pg_stat_reset_slru(name) doesn't seem to work as documented