Plain text & un*x tools are still the best

Today I learned (2019-07-06)
Tagged: awk sql postgres

I did some freelance work for a friendly company that had a problem with its production dataset. They had a Postgres database, running on a 1TB drive, holding historic data in a ~500GB table.

The historic data wasn't needed for anything aside from archival reasons, so the developers had tried — in between feature work — to remove the old rows while maintaining DB uptime. Unfortunately, all their previous attempts at clearing out obsolete data had failed.

Here's what they tried (on clones of the production VM):

DELETE FROM samples WHERE timestamp < 1234567890
VACUUM
VACUUM FULL
INSERT INTO new_samples SELECT * FROM samples WHERE (..), followed by TRUNCATE TABLE samples and rename

All of these experiments failed to inspire confidence, as they either filled up the available hard drive space with temporary data, or took so long that the potential downtime was deemed unacceptable (looking at you, VACUUM FULL). In particular, any operation that tried to rebuild the samples table afresh would fail. This was due to the fact that the disk was too small to hold two copies of the table, as dictated by the atomicity requirement of SQL operations.

After some experimentation, we determined that the fastest way to achieve our goals of deleting old data, slimming down index bloat, and reclaiming space for the OS, was to:

export the entire DB to a text file of SQL statements (a.k.a a dump)
remove the existing database completely
clean up the dump in text form
import the trimmed-down dump into a fresh database

This operation proved to be the fastest and most efficient in terms of disk-space required, although we were forced to incur downtime.

The joy of AWK

A concise AWK script was perfect for the job:

1.   BEGIN { in_samples=0; min_ts=1514764800; }
2.   /^\\.$/ { if (in_samples==1) { in_samples=0; } }
3.   // {
4.     if (in_samples) {
5.       if ($1 >= min_ts) { print } else { } ;
6.     } else {
7.       print
8.     }
9.   }
10.  /COPY public.samples/ { in_samples=1; }

In short, the script can be in 2 states:

in_samples := 0 | 1

It starts out in in_samples=0 (1.), and copies over every line of text (7.). If it finds the first line of the beginning of the samples table (10.), it switches to in_samples=1. In this state, it will only copy over samples that are NEWER than January 1, 2018 (5.). If it finds the end-of-table-data marker and is in_samples, it will exit this state (2.). Unless there are two tables in the dump named public.samples (there aren't), the script will never enter in_samples=1 again, and will simply copy over all other rows verbatim (line 7., again).

It's important to note that awk evaluates all clauses in order for every line of input (except for the special BEGIN and END clauses), so some lines of input might 'hit' several awk statements. This is beneficial, but also means that the order of the clauses is important. (Consider what would happen if the clause on line 10. had been placed before the clause on line 3.?)

Summing up

The entire operation of pg_dumpall, awk, and psql import took around 3 hours, which was acceptable downtime for this database as part of scheduled nighttime maintenance. The space taken up by the Postgres data directory went down from ~760GB to ~200GB.

That day I learned that plain text is still the lingua franca of UN*X, and that 40-year-old tools are still excellent at what they were built to do.