Re: WIP: [[Parallel] Shared] Hash

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-26 21:47:26
Message-ID: 20170326214726.sqk4zt2bbq5ofthd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> Here is a new patch series responding to feedback from Peter and Andres:

+
+/* Per-participant shared state. */
+typedef struct SharedTuplestoreParticipant
+{
+ LWLock lock;

Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's
probably ok, for now.

+ bool error; /* Error occurred flag. */
+ bool eof; /* End of file reached. */
+ int read_fileno; /* BufFile segment file number. */
+ off_t read_offset; /* Offset within segment file. */

Hm. I wonder if it'd not be better to work with 64bit offsets, and just
separate that out upon segment access.

+/* The main data structure in shared memory. */

"main data structure" isn't particularly meaningful.

+struct SharedTuplestore
+{
+ int reading_partition;
+ int nparticipants;
+ int flags;

Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */?

+ Size meta_data_size;

What's this?

+ SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER];

I'd add a comment here, that there's further data after participants.

+};

+
+/* Per-participant backend-private state. */
+struct SharedTuplestoreAccessor
+{

Hm. The name and it being backend-local are a bit conflicting.

+ int participant; /* My partitipant number. */
+ SharedTuplestore *sts; /* The shared state. */
+ int nfiles; /* Size of local files array. */
+ BufFile **files; /* Files we have open locally for writing. */

Shouldn't this mention that it's indexed by partition?

+ BufFile *read_file; /* The current file to read from. */
+ int read_partition; /* The current partition to read from. */
+ int read_participant; /* The current participant to read from. */
+ int read_fileno; /* BufFile segment file number. */
+ off_t read_offset; /* Offset within segment file. */
+};

+/*
+ * Initialize a SharedTuplestore in existing shared memory. There must be
+ * space for sts_size(participants) bytes. If flags is set to the value
+ * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once,
+ * because underlying files will be deleted.

Any reason not to use flags that are compatible with tuplestore.c?

+ * Tuples that are stored may optionally carry a piece of fixed sized
+ * meta-data which will be retrieved along with the tuple. This is useful for
+ * the hash codes used for multi-batch hash joins, but could have other
+ * applications.
+ */
+SharedTuplestoreAccessor *
+sts_initialize(SharedTuplestore *sts, int participants,
+ int my_participant_number,
+ Size meta_data_size,
+ int flags,
+ dsm_segment *segment)
+{

Not sure I like that the naming here has little in common with
tuplestore.h's api.

+
+MinimalTuple
+sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{

This needs docs.

+ SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts);
+ MinimalTuple tuple = NULL;
+
+ for (;;)
+ {

...
+ /* Check if this participant's file has already been entirely read. */
+ if (participant->eof)
+ {
+ BufFileClose(accessor->read_file);
+ accessor->read_file = NULL;
+ LWLockRelease(&participant->lock);
+ continue;

Why are we closing the file while holding the lock?

+
+ /* Read the optional meta-data. */
+ eof = false;
+ if (accessor->sts->meta_data_size > 0)
+ {
+ nread = BufFileRead(accessor->read_file, meta_data,
+ accessor->sts->meta_data_size);
+ if (nread == 0)
+ eof = true;
+ else if (nread != accessor->sts->meta_data_size)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from temporary file: %m")));
+ }
+
+ /* Read the size. */
+ if (!eof)
+ {
+ nread = BufFileRead(accessor->read_file, &tuple_size, sizeof(tuple_size));
+ if (nread == 0)
+ eof = true;

Why is it legal to have EOF here, if metadata previously didn't have an
EOF? Perhaps add an error if accessor->sts->meta_data_size != 0?

+ if (eof)
+ {
+ participant->eof = true;
+ if ((accessor->sts->flags & SHARED_TUPLESTORE_SINGLE_PASS) != 0)
+ SharedBufFileDestroy(fileset, accessor->read_partition,
+ accessor->read_participant);
+
+ participant->error = false;
+ LWLockRelease(&participant->lock);
+
+ /* Move to next participant's file. */
+ BufFileClose(accessor->read_file);
+ accessor->read_file = NULL;
+ continue;
+ }
+
+ /* Read the tuple. */
+ tuple = (MinimalTuple) palloc(tuple_size);
+ tuple->t_len = tuple_size;

Hm. Constantly re-allocing this doesn't strike me as a good idea (not to
mention that the API doesn't mention this is newly allocated). Seems
like it'd be a better idea to have a per-accessor buffer where this can
be stored in - increased in size when necessary.

- Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2017-03-26 21:57:10 Re: [PATCH] few fts functions for jsonb
Previous Message Corey Huinker 2017-03-26 21:42:10 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)