Re: [HACKERS] Parallel Hash take II

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: [HACKERS] Parallel Hash take II
Date: 2017-12-14 08:06:34
Message-ID: CAEepm=0XjazRX_bLCD7dSKwCQ+vS1Vczxbvpnj57U5aO7VhEWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 14, 2017 at 11:45 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> + bool overflow; /* Continuation of previous chunk? */
> + char data[FLEXIBLE_ARRAY_MEMBER];
> +} SharedTuplestoreChunk;
>
> Ah. I was thinking we could have the 'overflow' variable be an int,
> indicating the remaining length of the oversized tuple. That'd allow us
> to skip ahead to the end of the oversized tuple in concurrent processes
> after hitting it.

Right, that is a bit better as it avoids extra read-skip cycles for
multi-overflow-chunk cases. Done that way.

> + if (accessor->write_pointer + accessor->sts->meta_data_size >=
> + accessor->write_end)
> + elog(ERROR, "meta-data too long");
>
> That seems more like an Assert than a proper elog? Given that we're
> calculating size just a few lines above...

It's an error because the logic is not smart enough to split the
optional meta-data and tuple size over multiple chunks. I have added
comments there to explain. That error can be reached by CALL
test_sharedtuplestore(1, 1, 1, 32756, 1), but 32755 is OK. My goal
here is to support arbitrarily large tuples, not arbitrarily large
per-tuple meta-data, since for my use case I only need 4 bytes (a hash
value). This could be improved if required by later features
(probably anyone wanting more general meta-data would want variable
sized meta-data anyway, whereas this is fixed, and it would also be
nice if oversized tuples didn't have to start at the beginning of a
new chunk).

I fixed two nearby fencepost bugs: I made the limit that triggers that
error smaller by size(uint32) and fixed a problem when small tuples
appear after an oversize tuple in a final overflow chunk (found by
hacking the test module to create mixtures of different sized tuples).

> + if (accessor->sts->meta_data_size > 0)
> + memcpy(accessor->write_pointer, meta_data,
> + accessor->sts->meta_data_size);
> + written = accessor->write_end - accessor->write_pointer -
> + accessor->sts->meta_data_size;
> + memcpy(accessor->write_pointer + accessor->sts->meta_data_size,
> + tuple, written);
>
> Also, shouldn't the same Assert() be here as well if you have it above?

No, when it comes to the tuple we just write as much of it as will
fit, and write the rest in the loop below. Comments improved to make
that clear.

> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not read from shared tuplestore temporary file"),
> + errdetail("Short read while reading meta-data")));
>
> The errdetail doesn't follow the style guide (not a sentence ending with
> .), and seems internal-ish. I'm ok with keeping it, but perhaps we
> should change all these to be errdetail_internal()? Seems pointless to
> translate all of them.

Done.

> + LWLockAcquire(&p->lock, LW_EXCLUSIVE);
> + eof = p->read_page >= p->npages;
> + if (!eof)
> + {
> + read_page = p->read_page;
> + p->read_page += STS_CHUNK_PAGES;
> + }
> + LWLockRelease(&p->lock);
>
> So if we went to the world I'm suggesting, with overflow containing the
> length till the end of the tuple, this'd probably would have to look a
> bit different.

Yeah. I almost wanted to change it back to a spinlock but now it's
grown bigger again...

> + if (!eof)
> + {
> + SharedTuplestoreChunk chunk_header;
> +
> + /* Make sure we have the file open. */
> + if (accessor->read_file == NULL)
> + {
> + char name[MAXPGPATH];
> +
> + sts_filename(name, accessor, accessor->read_participant);
> + accessor->read_file =
> + BufFileOpenShared(accessor->fileset, name);
> + if (accessor->read_file == NULL)
> + elog(ERROR, "could not open temporary file %s", name);
>
> Isn't this more an Assert or just not anything? There's now way
> BufFileOpenShared should ever return NULL, no?

Right. As of commit 923e8dee this can no longer return NULL (instead
it would raise an error), so I removed this redundant check.

> + /* If this is an overflow chunk, we skip it. */
> + if (chunk_header.overflow)
> + continue;
> +
> + accessor->read_ntuples = 0;
> + accessor->read_ntuples_available = chunk_header.ntuples;
> + accessor->read_bytes = offsetof(SharedTuplestoreChunk, data);
>
> Perhaps somewhere around here comment that we'll just loop around and
> call sts_read_tuple() in the next loop iteration?

Done.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
parallel-hash-v29.patchset.tgz application/x-gzip 44.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-12-14 08:08:44 Re: [HACKERS] pg_upgrade failed with error - ERROR: column "a" in child table must be marked NOT NULL
Previous Message Pavel Stehule 2017-12-14 08:01:06 Re: procedures and plpgsql PERFORM