On 19/05/25 14:41, Achilleas Mantzios wrote:
>
>
> On 5/19/25 09:14, Moreno Andreo wrote:
>>
>> On 16/05/25 21:33, Achilleas Mantzios wrote:
>>> On 16/5/25 18:45, Moreno Andreo wrote:
>>>
>>>> Hi,
>>>> we are moving our old binary data approach, moving them from
>>>> bytea fields in a table to external storage (making database
>>>> smaller and related operations faster and smarter).
>>>> In short, we have a job that runs in background and copies data
>>>> from the table to an external file and then sets the bytea field to
>>>> NULL.
>>>> (UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id = <uuid>)
>>>>
>>>> This results, at the end of the operations, to a table that's less
>>>> than one tenth in size.
>>>> We have a multi-tenant architecture (100s of schemas with identical
>>>> architecture, all inheriting from public) and we are performing the
>>>> task on one table per schema.
>>>>
>>> So? toasted data are kept on separate TOAST tables, unless those
>>> bytea cols are selected, you won't even touch them. I cannot
>>> understand what you are trying to achieve here.
>>>
>>> Years ago, when I made the mistake to go for a coffee and let my
>>> developers "improvise" , the result was a design similar to what you
>>> are trying to achieve. Years after, I am seriously considering
>>> moving those data back to PostgreSQL.
>> The "related operations" I was talking about are backups and database
>> maintenance when needed, cluster/replica management, etc. With a
>> smaller database size they would be easier in timing and effort, right?
> Ok, but you'll lose replica functionality for those blobs, which means
> you don't care about them, correct me if I am wrong.
I'm not saying I don't care about them, the opposite, they are protected
with Object Versioning and soft deletion, this should assure a good
protection against e.g. ransomware, if someone manages to get in there
(and if this happens, we'll have bigger troubles than this)
>> We are mostly talking about costs, here. To give things their names,
>> I'm moving bytea contents (85% of total data) to files into Google
>> Cloud Storage buckets, that has a fraction of the cost of the disks
>> holding my database (on GCE, to be clear ).
> May I ask the size of the bytea data (uncompressed) ?.
single records vary from 150k to 80 MB, the grand total is more than 8,5
TB in a circa 10 TB data footprint
>> This data is not accessed frequently (just by the owner when he needs
>> to do it), so no need to keep it on expensive hardware.
>> I've already read in these years that keeping many big bytea fields
>> in databases is not recommended, but might have misunderstood this.
>
> Ok, I assume those are unimportant data, but let me ask, what is the
> longevity or expected legitimacy of those ? I haven't worked with
> those just reading :
>
> https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing
>
> would you choose e.g. "*Anywhere Cache storage" ?
> *
>
Absolutely not, this is *not* unimportant data, and we are using
Standard Storage, for 0,02$/GB/month + operations, that compared to a
0.17$/GB/month of an SSD or even more for the Hyperdisks we are using,
is a good price drop.
>
> **
>
>> Another way would have been to move these tables to a different
>> tablespace, in cheaper storage, but it still would have been 3 times
>> the buckets cost.
>>
> can you actually mount those Cloud Storage Buckets under a supported
> FS in linux and just move them to tablespaces backed by this storage ?
>
Never tried, I mounted this via FUSE and had some simple operations in
the past, but not sure it can handle database operations in terms of I/O
bandwidth
>
>> Why are you considering to get data back to database tables?
> Because now if we need to migrate from cloud to on-premise, or just
> upgrade or move the specific server which holds those data I will have
> an extra headache. Also this is a single point of failure, or best
> case a cause for fragmented technology introduced just for the sake of
> keeping things out of the DB.
This is managed as an hierarchical disk structure, so the calling server
may be literally everywhere, it just needs an account (or a service
account) to get in there
>>>
>>>
>>>> The problem is: this is generating BIG table bloat, as you may
>>>> imagine.
>>>> Running a VACUUM FULL on an ex-22GB table on a standalone test
>>>> server is almost immediate.
>>>> If I had only one server, I'll process a table a time, with a
>>>> nightly script, and issue a VACUUM FULL to tables that have already
>>>> been processed.
>>>>
>>>> But I'm in a logical replication architecture (we are using a
>>>> multimaster system called pgEdge, but I don't think it will make
>>>> big difference, since it's based on logical replication), and I'm
>>>> building a test cluster.
>>>>
>>> So you use PgEdge , but you wanna lose all the benefits of
>>> multi-master , since your binary data won't be replicated ...
>> I don't think I need it to be replicated, since this data cannot be
>> "edited", so either it's there or it's been deleted. Buckets have
>> protections for data deletions or events like ransomware attacks and
>> such.
>> Also multi-master was an absolute requirement one year ago because of
>> a project we were building, but it has been abandoned and now a
>> simple logical replication would be enough, but let's do one thing a
>> time.
> Multi-master is cool, you can configure your pooler / clients to take
> advantage of this for full load balanced architecture, but if not a
> strict requirement , you can live without it, as so many of us, and
> employ other means of load balancing the reads.
That's what we are doing, it's a really cool feature, but I experienced
(maybe because it uses old pglogical extension) that the replication is
a bit fragile, especially when dealing with those bytea fields (when I
ingest big loads, say 25-30 GB or more), it happened to break
replication, and recreating a replica from scratch with "normal size"
tables is not a big deal, since it can be achieved automatically,
because they normally fit in shared memory and can be transferred by the
replicator, but you can imagine what would be the effort and the
downtime necessary to create a base backup, transfer it to the replica,
build the DB and restart a 10-TB database (ATM we are running with a
2-node cluster).
>>>> I've been instructed to issue VACUUM FULL on both nodes, nightly,
>>>> but before proceeding I read on docs that VACUUM FULL can disrupt
>>>> logical replication, so I'm a bit concerned on how to proceed. Rows
>>>> are cleared one a time (one transaction, one row, to keep errors to
>>>> the record that issued them)
> Mind if you shared the specific doc ?
Obviously I can't find it from a quick search, I'll search deeper, I
don't think it went off a dream :-).
>>>>
>>> PgEdge is based on the old pg_logical, the old 2ndQuadrant
>>> extension, not the native logical replication we have since pgsql
>>> 10. But I might be mistaken.
>> Don't know about this, it keeps running on latest pg versions (we are
>> about to upgrade to 17.4, if I'm not wrong), but I'll ask
>>>> I read about extensions like pg_squeeze, but I wonder if they are
>>>> still not dangerous for replication.
>>>>
>>> What's pgEdge take on that, I mean the bytea thing you are trying to
>>> achieve here.
>> They are positive, it's they that suggested to do VACUUM FULL on both
>> nodes... I'm quite new to replication, so I'm searching some advise
>> here.
>
> As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=
> native logical replication. You may look here :
>
> https://github.com/pgEdge/spock
>
> If multi-master is not a must you could convert to vanilla postgresql
> and focus on standard physical and logical replication.
>
No, multimaster is cool, but as I said, the project has been
discontinued and it's not a must anymore. This is the first step,
actually. We are planning to return to plain PostgreSQL, or CloudSQL for
PostgreSQL, using logical replication (that seems the most reliable of
the two). We created a test case for both the options, and they seem to
be OK for now, even if I have still to do adequate stress tests. And
when I'll do the migration, I'd like to be migrating plain data only and
leave blobs where they are.
>
>>>> Thanks for your help.
>>>> Moreno.-
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>