Backing up PostgreSQL

2016-06-29T07:00:00 BST │   Comment

You know this good ol' IT saying: "There are two types of people: those how back up their data and those how are yet to start" Backing up databases is especially important since they usually contain information critical not only to day-to-day business operation, but also for business continuity. People leave, software fails, world changes. What remain is your data and if you have it, you can always start from the scratch - with new software, new people, in new world. Let's have a look at various backups strategies available for PostgreSQL databases so that we can prepare for the worst.

This article was published by compose.io as a part of Write Stuff programme. Unlike other articles on my blog, this one is licensed under the terms of CC BY-NC-SA 4.0

Backup types: physical vs logical

When talking about database backups, the first distinction mentioned is usually about filesystem-level snapshots vs SQL dumps. A database is essentially a bunch of files sitting somewhere on a drive in some datacentre. Backing up files is easy, isn't it? But from another point of view, a database can be perceived as a sequence of SQL statements used to create and manipulate the data. How hard can it be to repeat all these statements?

Both backup types are useful, but each comes with its own set of advantages, disadvantages. And both have some caveats.

Physical backups

Physical backups, also known as filesystem-level backups, are basically snapshots of files constituting the database. But with constant writes going to the underlying drives it can be quite hard to make these snapshots internally consistent. PostgreSQL documentation describes two key notions - continuous archiving and point-in-time recovery. These complement each other and it's important to understand them.

In order for physical backups to be consistent, we have to have a way of guaranteeing transaction atomicity and durability - in other words we have to make sure that we either commit the entire transaction or leave the database unchanged. And if we do commit, it has to be forever. PostgreSQL uses write-ahead logging to provide the A and D letters of the ACID acronym. Write-ahead log segments are exactly the files that the continuous archiving process takes care of, well, archiving. And the information kept in WAL segments allows the database engine to keep the data consistent but also to to bring it back to consistent state after a crash.

While it is normal for some database files to change during filesystem backup creation, not all the changes are safe and some may lead to an inconsistent snapshot being created. That's why, apart from some nice command line tools, PostgreSQL provides also a low-level API for taking physical backups. Executing pg_start_backups() before and pg_stop_backup() after ensures that there will be no unsafe changes done to the data files during the process. But still it's important to remember that a physical backup is useless without all the WAL segments generated between executing the two functions mentioned above.

Filesystem-level snapshot along with all the WAL segments necessary for its restoration is called a "base backup".

Stand-alone physical backups vs PITR-enabled physical backups

While it was said that continuous archiving and point-in-time recovery complement each other, they can also be decoupled.

It is very useful to have continuous archiving configured so that you are always sure you have access to necessary WAL segments, but in order to have a functional base backup you only need the ones generated while you were creating this backup. If you have only those segments and do not archive the ones generated after your backups is ready, you have a "stand-alone physical backup".

But you can always continue to archive write-ahead log segments (and thus actually have continuous archiving). What that will give you is the ability to restore your database from the base backup you have and then continue to replay WAL segments on it. This is called "point-in-time recovery", since you can always stop replaying the WAL and still have a consistent state; so essentially you can restore your database to any point in time starting with the time you've created your base backup.

Logical backups

Logical backups are also know as SQL dumps. This is because they contain SQL statements used to create the database schema and fill it with data. SQL dump always represents a consistent state of the database taken at a given moment since the dumping process follows exactly the same semantics as any other ordinary database session. This also means that it places locks and, on the other hand, can be forced to wait for other sessions to release their locks.

PostgreSQL documentation gives an overview of the process. The dumping software simply goes through all the tables discovering their schema and fetching all the rows. Nothing overly complicated, though it has to be clever enough to know about the order in which to go through all the relations so that while restoring all the constraints can be met.

It is perfectly normal for SQL dumps to have tables' data spread over time meaning you while you can have the last row of one table with timestamp A, some other table might have rows with timestamp B. If you have rules outside your database about how rows like that correlate, you should account for that yourself.

Entire instance logical backups vs single database logical backups

In one PostgreSQL installation we can have multiple databases - this is referred to as "cluster" in the official documentation, but this term is confusing enough to not to use it, so let's call it simply an "instance".

Although dumping a single database is enough to restore all of its schema and data, there are also some other extra objects that could be necessary to bring the restored database to full operation.

All privileges are dumped with the database, but database roles, memberships and attributes are global objects that exist on the instance level and therefore are not included in a single database dump.

PostgreSQL provides two tools for creating logical backups. While pg_dumpall can be used to dump all the objects in an instance, it can also serve as complementary tool to the other one and dump only the global objects. The other tool is pg_dump which is used to dump a single database from an instance.

The good, the bad and the caveat

Physical backups

Pros Cons
* no locking - physical backups do not cause lock contention and do not depend on other connections releasing their locks * high I/O pressure - copying the entire data directory may put a significant pressure on your storage, but this doesn't have to necessarily apply if you're able to use some sort of filesystem-level snapshotting
* entire instance backed up - there's no need to restore single databases - one restore process is enough for entire cluster * entire instance backed up - there's no way to restore single databases or single tables, one has to restore all or nothing
* almost instantaneous restore process - no need to execute SQL statements to get data back * very large base backups - physical backups include all the indices and bloat and may therefore be much, much larger than SQL dumps
* PITR capabilities - the ability to go back in time * need to archive WAL segments - you need to provide additional storage for binary log files
* just works - you just run postgres with the backed up directory and it just works * binary format - having backup in binary files means you are limited to restoring it using exactly the same PostgreSQL version it was created with

Logical backups

Pros Cons
* less data to copy - logical backups do not copy all data files, so may impose a lower I/O pressure on the system, than physical backups * locking - logical backups are both prone to lock waits and cause lock waits
* single database objects can be backed up - it's easy to backup and restore even a single table
* smaller backup files - SQL dumps do not contain index data or bloat, so they are much smaller, than physical backups * long restoration process - SQL statements have to executed to get schema and data back
* no need for storing extra files - backups are self-contained * no PITR capabilities - a SQL dump can serve only as a snapshot of the data at the time it was created
* cross-version compatibility - SQL dumps are not dependant on restoring with the same version as was used to create them * extra effort required - care has to be taken to back up global objects, tune the server to allow for faster restoration etc

Compose.io supported backups

Compose gives you standalone physical backups. This means you should take extra care of the PostgreSQL version you want to use with your backups.

Since the standard pg_dump tool uses a regular database session, there's nothing stopping you from creating SQL dumps of Compose-hosted database. But, since you don't have access to the database superuser account, you cannot use pg_dumpall to dump global database instance objects.

When it comes to restoring from backups, all you really can do is to use the ones available to you in Compose's WebUI. You might be able to partially restore your SQL dumps, but you'll get errors if you'll try to restore them in full.


If you like this article please consider sharing it with your friends.