Re: mdnblocks() sabotages error checking in _mdfd_getseg()

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mdnblocks() sabotages error checking in _mdfd_getseg()
Date: 2015-12-10 16:10:10
Message-ID: CA+Tgmobvu=CwkjSOL0z9tdxCYevAjSTg6C=E5ZuhL5Yg2VS0TA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 9, 2015 at 4:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> The comment in mdnblocks.c says this:
>
>> * Because we pass O_CREAT, we will create the
>> next segment (with
>> * zero length) immediately, if the last
>> segment is of length
>> * RELSEG_SIZE. While perhaps not strictly
>> necessary, this keeps
>> * the logic simple.
>
>> I don't really see how this "keeps the logic simple".
>
> My (vague) recollection is that this is defending some assumptions
> elsewhere in md.c. You sure you haven't broken anything with this?
> Relation extension across a segment boundary, perhaps?

It doesn't look like that particular thing gets broken here. The key
to extending across a segment boundary is that _mdfd_openseg() needs
to get O_CREAT as its fourth argument, but _mdfd_getseg() thankfully
does not rely on mdnblocks() to have already done that: it passes
O_CREAT itself. I did also test it by creating a 13GB
pgbench_accounts table, and at least in that simple case it works
fine. It's possible that there is some other thing I've missed, but
in general I think that if there is some dependency on mdnblocks()
creating new segments, that's a bad idea and we should try to get rid
of it. A call that supposedly reports the number of blocks in a
relation really has no business changing on-disk state.

(More broadly, as Kevin was pointing out to me yesterday, md.c looks
like it could do with a face lift. Keeping a linked list of 1GB
segments and chasing down the list to find the length of the file may
have been fine when relations over 1GB were rare, but that's now
routine. Some relations may be over 1TB, and using a linked list of
1000 entries to keep track of all of those segments does not seem like
an especially good choice. In fact, having no way to get the relation
length other than scanning 1000 files doesn't seem like an especially
good choice even if we used a better data structure. Putting a header
page in the heap would make getting the length of a relation O(1)
instead of O(segments), and for a bonus, we'd be able to reliably
detect it if a relation file disappeared out from under us. That's a
difficult project and definitely not my top priority, but this code is
old and crufty all the same.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-12-10 16:10:55 Re: proposal: multiple psql option -c
Previous Message Robert Haas 2015-12-10 15:59:55 Re: [PROPOSAL] VACUUM Progress Checker.