Possible issue with expanded object infrastructure on Postgres 9.6.1

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Possible issue with expanded object infrastructure on Postgres 9.6.1
Date: 2017-01-20 01:45:37
Message-ID: CAM3SWZRmGTKcO3yB1gwuGte9JLogxy53YQUjj5JegYe8=QXivQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A customer is on 9.6.1, and complains of a segfault observed at least
3 times. In all cases, the logs look like this:

Jan 11 16:11:07 ip-10-0-118-82 kernel: [41913.530453] traps:
postgres[6561] general protection ip:55fcf08b0491 sp:7ffc17dfa650
error:0 in postgres[55fcf0557000+638000]
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-1] LOG: server
process (PID 643) was terminated by signal 11: Segmentation fault
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-2] DETAIL: Failed
process was running: SELECT COUNT(count_column) FROM (SELECT 1 AS
count_column FROM "albums" INNER JOIN "album_photo_assignments" ON
"albums"."id" = "album_photo_assignments"."album_id" WHERE
"albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id"
= $1 LIMIT 1000 OFFSET 0) subquery_for_count
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [5-1] LOG: terminating
any other active server processes
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-1] WARNING:
terminating connection because of crash of another server process
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-2] DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-3] HINT: In a
moment you should be able to reconnect to the database and repeat your
command.

The "general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in
postgres[55fcf0557000+638000]" details are identical in all 3
segfaults observed. Note that the kernel in question is
"4.4.0-57-generic #78~14.04.1-Ubuntu". These crashes were days apart,
but I have yet to isolate the problem. The customer tells me that this
is a representative query plan:

explain SELECT COUNT(count_column)
FROM (SELECT 1 AS count_column
FROM "albums"
INNER JOIN "album_photo_assignments" ON "albums"."id" =
"album_photo_assignments"."album_id"
WHERE "albums"."deleted_at" IS NULL AND
"album_photo_assignments"."photo_id" =
'fd0bcbb3-9c4c-4036-954a-cf01935d004c'
LIMIT 1000 OFFSET 0)
subquery_for_count ;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────
Aggregate (cost=16419.94..16419.94 rows=1 width=8)
-> Limit (cost=1000.06..16419.94 rows=1 width=4)
-> Nested Loop (cost=1000.06..16419.94 rows=1 width=4)
-> Gather (cost=1000.00..16415.87 rows=1 width=16)
Workers Planned: 1
-> Parallel Seq Scan on album_photo_assignments
(cost=0.00..15415.77 rows=1 width=16)
Filter: (photo_id =
'fd0bcbb3-9c4c-4036-954a-cf01935d004c'::uuid)
-> Index Scan using albums_pkey on albums
(cost=0.06..4.06 rows=1 width=16)
Index Cond: (id = album_photo_assignments.album_id)
Filter: (deleted_at IS NULL)
(10 rows)

Note that this isn't a prepared statement, unlike the problematic
query we see from the logs. And, I've not yet managed to verify that
the problem case involved execution of this particular "$1 substitute
constant", since that isn't displayed in the logs. So, while this
query plan might well be representative of the problem, it may also
not be.

I can use GDB to get details of the instruction pointer that appeared
in the kernel trap error, which shows a function from the expanded
value representation infrastructure:

(gdb) info symbol 0x55fcf08b0491
EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres
(gdb) info symbol 0x55fcf08b0490
EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres
(gdb) disassemble 0x55fcf08b0490
Dump of assembler code for function EOH_get_flat_size:
0x000055fcf08b0490 <+0>: push %rbp
0x000055fcf08b0491 <+1>: mov 0x8(%rdi),%rax
0x000055fcf08b0495 <+5>: mov %rsp,%rbp
0x000055fcf08b0498 <+8>: pop %rbp
0x000055fcf08b0499 <+9>: mov (%rax),%rax
0x000055fcf08b049c <+12>: jmpq *%rax
End of assembler dump.

(I believe that Linux outputs VMA-wise addresses in the kernel trap
error quoted above, an address from the userland program (Postgres)
address space, so GDB does in fact display interesting disassembly --
the documentation on this seems extremely limited.)

Note that there is a text array column in the table "albums". I note
that there have been fixes in this area already, that are not in 9.6.1
(commit 77cd0dc, for one). I know next to nothing about this code, and
would appreciate some guidance.

--
Peter Geoghegan

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-01-20 01:45:57 Re: Re: Clarifying "server starting" messaging in pg_ctl start without --wait
Previous Message Andres Freund 2017-01-20 01:35:29 Re: Re: Clarifying "server starting" messaging in pg_ctl start without --wait