Re: Where to store Blobs?

From: Morris de Oryx <morrisdeoryx(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Where to store Blobs?
Date: 2019-04-20 12:27:35
Message-ID: CAKqncchTyF3_M+rnBw=5Cn0cFLa6VQy6htt=VD7XUUpdoTiVMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Good question, and there are some excellent thoughts and cautionary tales
in the thread already. I've faced this question several times down the
years and, as others have said, the best answer depends. Roughly speaking,
I can think of three obvious places to store documents:

* A database.

* A file system.

* A commodity file-system-like cloud service, such as S3, B2 Cloud, Azure,
or Google Cloud.

There are pros and cons to each, and circumstances that will make one
option or another untenable. It also depends a lot on your overall
architecture, if documents can change, how your searches work, your
security/regulatory requirements, the overall workflow, etc.

But given all of that, I *hate* bloating a database with BLOBs that aren't
themselves searchable. That's what file systems are for. This past week,
I've been coding against the B2 Cloud API and it's quite nice...simpler
than S3, and much cheaper. In this setup, we keep a document catalog in the
database that ties together a source record and an external document. The
"cloud_doc" record contains identifiers and information from both parts of
this story. So, descriptors, tags and IDs needed when you're looking for
something. In our case, that's large XML log files. These logs are either
of zero value or great value. We need to keep them, but 99.99999% of the
time they're pure bloat. We've got a record about the event the log
details. The cloud_doc record includes a reference back to the original,
and to the storage location of the full log. So, the service type, bucket
ID, document ID, etc. So the catalog record links our real data with the
remote document, same as you would storing file paths to a local file tree.

Storing the data externally has some risks and limitations that make it
unsuitable in some situations. For example, you can't include the file or
cloud system in a transaction in any normal sense. Also, depending on
platform, there may be an unacceptable lag between pushing a file up and it
becoming visible to other systems. But for common cases, external (file or
cloud) storage can be pretty great.

Likewise, it's kind of scary to have the files someplace else where Bad
Things Might Happen and your database and file tree are out of sync. That's
genuinely worrisome. It's likely overkill, but I like to store references
back into the database in the meta-data for the document up on the cloud.
On S3 and B2, you store meta-data as "tags". I haven't used the document
storage systems on Azure or Google, but they likely have similar
functionality. A "tag" up on the cloud repository is a name-value-pair that
you can add custom data to. So, for example, I'll store the ID of our cloud
document catalog record in the meta-data (tags) for the document on the
cloud service. I also stash the ID of the source record that the log
relates to. If all goes well, we'll never need these tags but if worst came
to worst, they provide a way of calculating the parents of each external
document.

That last paragraph about tags points out something important: Cloud
document storage can be preferable to a standard file system for reasons
other than simplified provisioning and cost. Many file systems don't allow
your readily add and search meta-data like this, whereas it's a quite
accessible feature of cloud storage systems.

Going in another direction entirely, there may be times when what you need
is a local file storage system for documents. File systems are...scary, but
what about SQLite? You have a single file with a modern SQL syntax where
you can stuff BLOBs. You don't bloat your Postgres data file, but still
have a database for the documents instead of a file system. SQLite only
looks to solve problems in a pretty specific set of cases but, as it turns
out, those cases are also quite common. Combining it with a Postgres setup
seems pretty exotic, but there might be a time. It depends on your setup.

On Sat, Apr 20, 2019 at 10:59 AM Jamesie Pic <jpic(at)yourlabs(dot)org> wrote:

> I forgot to mention that my deployments include automated migrations as
> often as possible, sometimes destructive for refactoring purpose, as such,
> to maintain PostgreSQL on a basic linux box I am:
>
> - for having an automated backup prior in the automated deployment script
> that may play destructive migrations,
> - against the needless overhead of coupling both binary and relational
> data in operations that slows the whole thing down or makes it less reliable
>
> Also got supposedly many new points against, mixed with more detail on the
> points briefly exposed in my previous email, going deeper in detail, about
> how it fits in the big picture of my personal practice ... and how this has
> destabilized my prod for months:
>
> https://blog.yourlabs.org/post/184290880553/storing-hd-photos-in-a-relational-database-recipe
>
> tl;dr
> If you store media files in PostgreSQL on a production server, then do
> take disk space alarms seriously even if they happen only during backups.
> Otherwise I fail to see how to avoid a pattern of recurring incidents,
> "manually unblocking automated deployments" (double debt interest cost
> because also defeats the purpose of automating deployment), when not
> filling up a disk during the nightly backup dump ...
>
> Hope this helps,
>
> Have a great day
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Daulat Ram 2019-04-20 18:50:47 Backup and Restore (pg_dump & pg_restore)
Previous Message Jamesie Pic 2019-04-20 00:54:37 Re: Where to store Blobs?