Plain text & un*x tools are still the best
Today I learned (2019-07-06)
Tagged:
I did some freelance work for a friendly company that had a problem with its production dataset. They had a Postgres database, running on a 1TB drive, holding historic data in a ~500GB table.
The historic data wasn't needed for anything aside from archival reasons, so the developers had tried — in between feature work — to remove the old rows while maintaining DB uptime. Unfortunately, all their previous attempts at clearing out obsolete data had failed.
Here's what they tried (on clones of the production VM):
DELETE FROM samples WHERE timestamp < 1234567890
VACUUM
VACUUM FULL
INSERT INTO new_samples SELECT * FROM samples WHERE (..)
, followed byTRUNCATE TABLE samples
and rename
All of these experiments failed to inspire confidence, as they either filled up
the available hard drive space with temporary data, or took so long that the
potential downtime was deemed unacceptable (looking at you, VACUUM FULL
). In
particular, any operation that tried to rebuild the samples
table afresh
would fail. This was due to the fact that the disk was too small to hold two
copies of the table, as dictated by the atomicity requirement of SQL
operations.
After some experimentation, we determined that the fastest way to achieve our goals of deleting old data, slimming down index bloat, and reclaiming space for the OS, was to:
- export the entire DB to a text file of SQL statements (a.k.a a
dump
) - remove the existing database completely
- clean up the dump in text form
- import the trimmed-down dump into a fresh database
This operation proved to be the fastest and most efficient in terms of disk-space required, although we were forced to incur downtime.
The joy of AWK
A concise AWK script was perfect for the job:
1. BEGIN { in_samples=0; min_ts=1514764800; }
2. /^\\.$/ { if (in_samples==1) { in_samples=0; } }
3. // {
4. if (in_samples) {
5. if ($1 >= min_ts) { print } else { } ;
6. } else {
7. print
8. }
9. }
10. /COPY public.samples/ { in_samples=1; }
In short, the script can be in 2 states:
in_samples := 0 | 1
It starts out in in_samples=0
(1.
), and copies over every line of text
(7.
). If it finds the first line of the beginning of the samples
table
(10.
), it switches to in_samples=1
. In this state, it will only copy over
samples that are NEWER than January 1, 2018 (5.
). If it finds the
end-of-table-data marker and is in_samples
, it will exit this state
(2.
). Unless there are two tables in the dump named public.samples
(there
aren't), the script will never enter in_samples=1
again, and will simply copy
over all other rows verbatim (line 7.
, again).
It's important to note that awk
evaluates all clauses in order for every line
of input (except for the special BEGIN
and END
clauses),
so some lines of input might 'hit' several awk statements. This is beneficial,
but also means that the order of the clauses is important. (Consider what would
happen if the clause on line 10.
had been placed before the clause on line
3.
?)
Summing up
The entire operation of pg_dumpall
, awk
, and psql
import took around 3
hours, which was acceptable downtime for this database as part of scheduled
nighttime maintenance. The space taken up by the Postgres data
directory went
down from ~760GB to ~200GB.
That day I learned that plain text is still the lingua franca of UN*X, and that 40-year-old tools are still excellent at what they were built to do.