RFC: seccomp-bpf support

From: Joe Conway <mail(at)joeconway(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Cc: Joshua Brindle <joshua(dot)brindle(at)crunchydata(dot)com>
Subject: RFC: seccomp-bpf support
Date: 2019-08-28 15:13:27
Message-ID: bc032e95-7e8b-ed00-8d87-ed9db449bdd6@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall
filtering mechanism which allows reduction of the kernel attack surface
by preventing (or at least audit logging) normally unused syscalls.

Quoting from this link:
https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt

"A large number of system calls are exposed to every userland process
with many of them going unused for the entire lifetime of the
process. As system calls change and mature, bugs are found and
eradicated. A certain subset of userland applications benefit by
having a reduced set of available system calls. The resulting set
reduces the total kernel surface exposed to the application. System
call filtering is meant for use with those applications."

Recent security best-practices recommend, and certain highly
security-conscious organizations are beginning to require, that SECCOMP
be used to the extent possible. The major web browsers, container
runtime engines, and systemd are all examples of software that already
support seccomp.

---------
A seccomp (bpf) filter is comprised of a default action, and a set of
rules with actions pertaining to specific syscalls (possibly with even
more specific sets of arguments). Once loaded into the kernel, a filter
is inherited by all child processes and cannot be removed. It can,
however, be overlaid with another filter. For any given syscall match,
the most restrictive (a.k.a. highest precedence) action will be taken by
the kernel. PostgreSQL has already been run "in the wild" under seccomp
control in containers, and possibly systemd. Adding seccomp support into
PostgreSQL itself mitigates issues with these approaches, and has
several advantages:

* Container seccomp filters tend to be extremely broad/permissive,
typically allowing about 6 out 7 of all syscalls. They must do this
because the use cases for containers vary widely.
* systemd does not implement seccomp filters by default. Packagers may
decide to do so, but there is no guarantee. Adding them post install
potentially requires cooperation by groups outside control of
the database admins.
* In the container and systemd case there is no particularly good way to
inspect what filters are active. It is possible to observe actions
taken, but again, control is possibly outside the database admin
group. For example, the best way to understand what happened is to
review the auditd log, which is likely not readable by the DBA.
* With built-in support, it is possible to lock down backend processes
more tightly than the postmaster.
* With built-in support, it is possible to lock down different backend
processes differently than each other, for example by using ALTER ROLE
... SET or ALTER DATABASE ... SET.
* With built-in support, it is possible to calculate and return (in the
form of an SRF) the effective filters being applied to the postmaster
and the current backend.
* With built-in support, it could be possible (this part not yet
implemented) to have separate filters for different backend types,
e.g. autovac workers, background writer, etc.

---------
Attached is a patch for discussion, adding support for seccomp-bpf
(nowadays generally just called seccomp) syscall filtering at
configure-time using libseccomp. I would like to get this in shape to be
committed by the end of the November CF if possible.

The code itself has been through several rounds of revision based on
discussions I have had with the author of libseccomp as well as a few
other folks. However as of the moment:

* Documentation - general discussion missing entirely
* No regression tests

---------
For convenience, here are a couple of additional links to relevant
information regarding seccomp:
https://en.wikipedia.org/wiki/Seccomp
https://github.com/seccomp/libseccomp

---------
Specific feedback requested:
1. Placement of pg_get_seccomp_filter() in
src/backend/utils/adt/genfile.c
originally made sense but after several rewrites no longer does.
Ideas where it *should* go?
2. Where should a general discussion section go in the docs, if at all?
3. Currently this supports a global filter at the postmaster level,
which is inherited by all child processes, and a secondary filter
at the client backend session level. It likely makes sense to
support secondary filters for other types of child processes,
e.g. autovacuum workers, etc. Add that now (pg13), later release,
or never?
4. What is the best way to approach testing of this feature? Tap
testing perhaps?
5. Default GUC values - should we provide "starter" lists, or only a
procedure for generating a list (as below).

---------
Notes on usage:
===============
In order to determine your minimally required allow lists, do something
like the following on a non-production server with the same architecture
as production:

0. Setup:
* install libseccomp, libseccomp-dev, and seccomp
* install auditd if not already installed
* configure postgres --with-seccomp and maybe --enable-tap-tests to
improve feature coverage (see below)

1. Modify postgresql.conf and/or create <pg_source_dir>/postgresql_tmp.conf
8<--------------------
seccomp = on
global_syscall_default = allow
global_syscall_allow = ''
global_syscall_log = ''
global_syscall_error = ''
global_syscall_kill = ''
session_syscall_default = log
session_syscall_allow = '*'
session_syscall_log = '*'
session_syscall_error = '*'
session_syscall_kill = '*'
8<--------------------

2. Modify /etc/audit/auditd.conf
* disp_qos = 'lossless'
* change max_log_file_action = 'ignore'

3. Stop auditd, clear out all audit.logs, start auditd:
* systemctl stop auditd.service # if running
* echo -n "" > /var/log/audit/audit.log
* systemctl start auditd.service

4. Start/restart postgres.

5. Exercise postgres as much as possible (one or more of the following):
* make installcheck-world
* make check world \
EXTRA_REGRESS_OPTS=--temp-config=<pg_source_dir>/postgresql_tmp.conf
* run your application through its paces
* other random testing of relevant postgres features

Note: at this point audit.log will start growing quickly. During `make
check world` mine grew to just under 1 GB.

6. Process results:
a) systemctl stop auditd.service
b) Run the provided "get_syscalls.sh" script
c) Cut and paste the result as the value of session_syscall_allow.

7. Optional:
a) global_syscall_default = 'log'
b) Repeat steps 3-5
c) Repeat step 6a and 6b
d) Cut and paste the result as the value of global_syscall_allow

8. Iterate steps 3-6b.
* Output should be empty.
* If there are any new syscalls, add to global_syscall_allow and
session_syscall_allow.
* Iterate until output of "get_syscalls.sh" script is empty.

9. Optional:
* Change global and session defaults to "error" or "kill"
* Reduce the allow lists if desired
* This can be done for specific database users, by doing
ALTER ROLE... SET session_syscall_allow to '<some reduced allow list>'

10. Adjust settings to taste, restart postgres, and monitor audit.log
going forward.

Below are some values from my system. Note that I have made no attempt
thus far to do static code analysis -- this list was build using `make
check world` several times.
8<-------------------------
seccomp = on

global_syscall_default = log
global_syscall_allow =
'accept,access,bind,brk,chmod,clone,close,connect,dup,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getppid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,listen,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,pipe,poll,prctl,pread64,prlimit64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,seccomp,select,sendto,setitimer,set_robust_list,setsid,setsockopt,shmat,shmctl,shmdt,shmget,shutdown,socket,stat,statfs,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,wait4,write'
global_syscall_log = ''
global_syscall_error = ''
global_syscall_kill = ''

session_syscall_default = log
session_syscall_allow =
'access,brk,chmod,close,connect,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,poll,pread64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,select,sendto,setitimer,setsockopt,shutdown,socket,stat,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,write'
session_syscall_log = '*'
session_syscall_error = '*'
session_syscall_kill = '*'
8<-------------------------

That results in the following effective filters at the ("context"
equals) global and session levels:

8<-------------------------
select * from pg_get_seccomp_filter() order by 4,1;
syscall | syscallnum | filter_action | context
-----------------+------------+----------------+---------
accept | 43 | global->allow | global
access | 21 | global->allow | global
bind | 49 | global->allow | global
brk | 12 | global->allow | global
chmod | 90 | global->allow | global
clone | 56 | global->allow | global
close | 3 | global->allow | global
connect | 42 | global->allow | global
<default> | -1 | global->log | global
dup | 32 | global->allow | global
epoll_create1 | 291 | global->allow | global
epoll_ctl | 233 | global->allow | global
epoll_wait | 232 | global->allow | global
exit_group | 231 | global->allow | global
fadvise64 | 221 | global->allow | global
fallocate | 285 | global->allow | global
fcntl | 72 | global->allow | global
fdatasync | 75 | global->allow | global
fstat | 5 | global->allow | global
fsync | 74 | global->allow | global
ftruncate | 77 | global->allow | global
futex | 202 | global->allow | global
getdents | 78 | global->allow | global
getegid | 108 | global->allow | global
geteuid | 107 | global->allow | global
getgid | 104 | global->allow | global
getpeername | 52 | global->allow | global
getpid | 39 | global->allow | global
getppid | 110 | global->allow | global
getrandom | 318 | global->allow | global
getrusage | 98 | global->allow | global
getsockname | 51 | global->allow | global
getsockopt | 55 | global->allow | global
getuid | 102 | global->allow | global
ioctl | 16 | global->allow | global
kill | 62 | global->allow | global
link | 86 | global->allow | global
listen | 50 | global->allow | global
lseek | 8 | global->allow | global
lstat | 6 | global->allow | global
mkdir | 83 | global->allow | global
mmap | 9 | global->allow | global
mprotect | 10 | global->allow | global
mremap | 25 | global->allow | global
munmap | 11 | global->allow | global
openat | 257 | global->allow | global
pipe | 22 | global->allow | global
poll | 7 | global->allow | global
prctl | 157 | global->allow | global
pread64 | 17 | global->allow | global
prlimit64 | 302 | global->allow | global
pwrite64 | 18 | global->allow | global
read | 0 | global->allow | global
readlink | 89 | global->allow | global
recvfrom | 45 | global->allow | global
recvmsg | 47 | global->allow | global
rename | 82 | global->allow | global
rmdir | 84 | global->allow | global
rt_sigaction | 13 | global->allow | global
rt_sigprocmask | 14 | global->allow | global
rt_sigreturn | 15 | global->allow | global
seccomp | 317 | global->allow | global
select | 23 | global->allow | global
sendto | 44 | global->allow | global
setitimer | 38 | global->allow | global
set_robust_list | 273 | global->allow | global
setsid | 112 | global->allow | global
setsockopt | 54 | global->allow | global
shmat | 30 | global->allow | global
shmctl | 31 | global->allow | global
shmdt | 67 | global->allow | global
shmget | 29 | global->allow | global
shutdown | 48 | global->allow | global
socket | 41 | global->allow | global
stat | 4 | global->allow | global
statfs | 137 | global->allow | global
symlink | 88 | global->allow | global
sync_file_range | 277 | global->allow | global
sysinfo | 99 | global->allow | global
umask | 95 | global->allow | global
uname | 63 | global->allow | global
unlink | 87 | global->allow | global
utime | 132 | global->allow | global
wait4 | 61 | global->allow | global
write | 1 | global->allow | global
accept | 43 | session->log | session
access | 21 | session->allow | session
bind | 49 | session->log | session
brk | 12 | session->allow | session
chmod | 90 | session->allow | session
clone | 56 | session->log | session
close | 3 | session->allow | session
connect | 42 | session->allow | session
<default> | -1 | session->log | session
dup | 32 | session->log | session
epoll_create1 | 291 | session->allow | session
epoll_ctl | 233 | session->allow | session
epoll_wait | 232 | session->allow | session
exit_group | 231 | session->allow | session
fadvise64 | 221 | session->allow | session
fallocate | 285 | session->allow | session
fcntl | 72 | session->allow | session
fdatasync | 75 | session->allow | session
fstat | 5 | session->allow | session
fsync | 74 | session->allow | session
ftruncate | 77 | session->allow | session
futex | 202 | session->allow | session
getdents | 78 | session->allow | session
getegid | 108 | session->allow | session
geteuid | 107 | session->allow | session
getgid | 104 | session->allow | session
getpeername | 52 | session->allow | session
getpid | 39 | session->allow | session
getppid | 110 | session->log | session
getrandom | 318 | session->allow | session
getrusage | 98 | session->allow | session
getsockname | 51 | session->allow | session
getsockopt | 55 | session->allow | session
getuid | 102 | session->allow | session
ioctl | 16 | session->allow | session
kill | 62 | session->allow | session
link | 86 | session->allow | session
listen | 50 | session->log | session
lseek | 8 | session->allow | session
lstat | 6 | session->allow | session
mkdir | 83 | session->allow | session
mmap | 9 | session->allow | session
mprotect | 10 | session->allow | session
mremap | 25 | session->allow | session
munmap | 11 | session->allow | session
openat | 257 | session->allow | session
pipe | 22 | session->log | session
poll | 7 | session->allow | session
prctl | 157 | session->log | session
pread64 | 17 | session->allow | session
prlimit64 | 302 | session->log | session
pwrite64 | 18 | session->allow | session
read | 0 | session->allow | session
readlink | 89 | session->allow | session
recvfrom | 45 | session->allow | session
recvmsg | 47 | session->allow | session
rename | 82 | session->allow | session
rmdir | 84 | session->allow | session
rt_sigaction | 13 | session->allow | session
rt_sigprocmask | 14 | session->allow | session
rt_sigreturn | 15 | session->allow | session
seccomp | 317 | session->log | session
select | 23 | session->allow | session
sendto | 44 | session->allow | session
setitimer | 38 | session->allow | session
set_robust_list | 273 | session->log | session
setsid | 112 | session->log | session
setsockopt | 54 | session->allow | session
shmat | 30 | session->log | session
shmctl | 31 | session->log | session
shmdt | 67 | session->log | session
shmget | 29 | session->log | session
shutdown | 48 | session->allow | session
socket | 41 | session->allow | session
stat | 4 | session->allow | session
statfs | 137 | session->log | session
symlink | 88 | session->allow | session
sync_file_range | 277 | session->allow | session
sysinfo | 99 | session->allow | session
umask | 95 | session->allow | session
uname | 63 | session->allow | session
unlink | 87 | session->allow | session
utime | 132 | session->allow | session
wait4 | 61 | session->log | session
write | 1 | session->allow | session
(170 rows)
8<-------------------------

If you made it all the way to here, thank you for your attention :-)

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Attachment Content-Type Size
get_syscalls.sh application/x-shellscript 508 bytes
seccomp-2019.08.28.00.diff text/x-patch 58.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2019-08-28 15:53:59 Re: Index Skip Scan
Previous Message Peter Eisentraut 2019-08-28 14:58:43 Improve base backup protocol documentation