SerializeParamList vs machines with strict alignment

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: SerializeParamList vs machines with strict alignment
Date: 2018-09-10 03:27:12
Message-ID: 11629.1536550032@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wondered why buildfarm member chipmunk has been failing hard
for the last little while. Fortunately, it's supplying us with
a handy backtrace:

Program terminated with signal 7, Bus error.
#0 EA_flatten_into (allocated_size=<optimized out>, result=0xb55ff30e, eohptr=0x188f440) at array_expanded.c:329
329 aresult->dataoffset = dataoffset;
#0 EA_flatten_into (allocated_size=<optimized out>, result=0xb55ff30e, eohptr=0x188f440) at array_expanded.c:329
#1 EA_flatten_into (eohptr=0x188f440, result=0xb55ff30e, allocated_size=<optimized out>) at array_expanded.c:293
#2 0x003c3dfc in EOH_flatten_into (eohptr=<optimized out>, result=<optimized out>, allocated_size=<optimized out>) at expandeddatum.c:84
#3 0x003c076c in datumSerialize (value=3934060, isnull=<optimized out>, typByVal=<optimized out>, typLen=<optimized out>, start_address=0xbea3bd54) at datum.c:341
#4 0x002a8510 in SerializeParamList (paramLI=0x1889f18, start_address=0xbea3bd54) at params.c:195
#5 0x002342cc in ExecInitParallelPlan (planstate=0xffffffff, estate=0x18863e0, sendParams=0x46e, nworkers=1, tuples_needed=-1) at execParallel.c:700
#6 0x002461dc in ExecGather (pstate=0x18864f0) at nodeGather.c:151
#7 0x00236b20 in ExecProcNodeFirst (node=0x18864f0) at execProcnode.c:445
#8 0x0022fc2c in ExecProcNode (node=0x18864f0) at ../../../src/include/executor/executor.h:237
#9 ExecutePlan (execute_once=<optimized out>, dest=0x188a108, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x18864f0, estate=0x18863e0) at execMain.c:1721
#10 standard_ExecutorRun (queryDesc=0x188a138, direction=<optimized out>, count=0, execute_once=true) at execMain.c:362
#11 0x0023d630 in postquel_getnext (fcache=0x1888408, es=0x1889d68) at functions.c:867
#12 fmgr_sql (fcinfo=0x701c7c) at functions.c:1164

This is remarkably hard to replicate on other machines, but I eventually
managed to duplicate it on gaur's host, after which it became really
obvious that the parallel-query data transfer logic has never been
stressed very hard on machines with strict data alignment rules.

In particular, SerializeParamList does this:

/* Write flags. */
memcpy(*start_address, &prm->pflags, sizeof(uint16));
*start_address += sizeof(uint16);

immediately followed by this:

datumSerialize(prm->value, prm->isnull, typByVal, typLen,
start_address);

and datumSerialize might do this:

EOH_flatten_into(eoh, (void *) *start_address, header);

Now, I will plead mea culpa that the expanded-object API doesn't
say in large red letters that the target address for EOH_flatten_into
is supposed to be maxaligned. It only says

* The flattened representation must be a valid in-line, non-compressed,
* 4-byte-header varlena object.

Still, one might reasonably suspect from that that *at least* 4-byte
alignment is expected. This code path isn't providing such alignment,
and machines that require it will crash. The only reason we've not
noticed, AFAICS, is that nobody has been running with
force_parallel_mode = regress on alignment-picky hardware.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jinhua Luo 2018-09-10 03:30:16 Re: How to find local logical replication origin?
Previous Message Higuchi, Daisuke 2018-09-10 02:01:53 stat() on Windows might cause error if target file is larger than 4GB