PostgreSQL Tips and Tricks
Introduction
Background
PostgreSQL, sometimes referred to as Postgres, can seem daunting when compared to MySQL. The sheer amount of features and options could make you back off a bit when approaching this database server. I wrote this article on tuning PostgreSQL performance, tips and tricks because I found there was a lack of concise information on the internet.
I assume you are already familiar with setting up PostgreSQL on your favourite Linux distribution (try emerge postgresql or apt-get install postgresql as root). The tricks I'll be presenting here can be applied to versions 7.x and 8.x as far as I was able to test them. I definitely tested these tricks on PostgreSQL 7.3.5 on Gentoo Linux and PostgreSQL 8.0.1 on Ubuntu 5.10 "Breezy Badger".
Outsmarting The Query Planner
A head start, or so I thought
A few years ago, I thought I was smart. Really smart. I was writing a new database application and I had chosen PostgreSQL as the backend server. Being relatively new to it and having heard some rumors that "PostgreSQL has more features than MySQL but is slower," I decided that features and ease of maintenance outweighed performance. I already knew MySQL, I just met PostgreSQL.
In a typical development scenario I wrote my tables, created views and made stored procedures. All went well, I was happy I had abandoned MySQL at this point, just because I now had referential integrity and features like views. Mind that MySQL was in version 3.x back then, slowly moving to 4.x. I was happy and confident I made the right choice.
Testing went well and as the weeks went by, I filled the database with the data I wanted. The database was going to be used to fuel a website and I just needed a little more time to get all data into the database before the launch of the website. I was on track and the project progressed in a fast pace.
The right tools for the job
A week before the launch of the website I started using the built-in query planner using EXPLAIN ANALYZE statements to check for execution times of key queries. Much to my surprise it seemed that PostgreSQL wasn't utilizing my column indices the way I wanted it to. The query planner reported back execution times of around 40 milliseconds for simple one-record queries, like (just a fictive example):
SELECT authorization_level, name FROM authorization_levels WHERE administrator_id = 7;
Delving into the query planner's analysis I was seeing a lot of sequential table scans (seqscans for short) being planned for my "slow" queries. I knew there had to be some way around this self-proclaimed madness. Skimming over the mailing list archives and the online documentation I found out that there is an option in the postgresql.conf configuration file specifically tailored for this goal. Look for this line:
#enable_seqscan = true
With the enable_seqscan option set to false and the hash removed, my query started using my index on the administrator_id column. And it executed in around 25 milliseconds, which seemed great at that time. Being overly confident, I analyzed some other simple queries like this and started adding indices to all columns that where being used in WHERE clauses.
Something happening, slowly but surely
The website launched, and with great success at that. The first days, weeks, months passed and all was well, no active maintenance was necessary and I made sure I had daily backups. In the beginning I even ran the backup job every hour, but that seemed overkill after a while. The daily backup job consists of a simple VACUUM FULL ANALYZE and a pg_dump_all so I can load the raw SQL data in a different server when the need arrives.
Now you might think this story turns into a miserable tale about failed backups, incomplete copies and data loss. Well, it won't. This article continues with the fact that after two years I noticed the server's RAID array slowly filling up day by day and majorly dropped query performance even though the regularly performed VACUUMs.
What was going on? I was decided to figure it out.
Disk drive drivel
The aforementioned RAID array was mounted as the /var partition on which the PostgreSQL server stored it's data. Filling up to nearly 90%, excluding superuser reserved blocks, I figured some maintenance was necessary or the whole website would grind to an unpleasant halt. A first analysis using du revealed that it was indeed the /var/lib/postgresql directory gobbling up space.
I must admit that using du isn't really helpful in the case of PostgreSQL. Of course, you get to see the disk usage in the form you requested, but it doesn't give you a lot of information to judge the cause. See this short snippet from a smaller PostgreSQL server:
postgres@breezy-server:~/8.0/main$ du -k
176 ./global
4 ./pg_xlog/archive_status
49220 ./pg_xlog
12 ./pg_clog
12 ./pg_subtrans
4712 ./base/1
4712 ./base/17229
8120 ./base/17230
4 ./base/17231/pgsql_tmp
27176 ./base/17231
4872 ./base/17232
4712 ./base/17233
12756 ./base/17234
67064 ./base
4 ./pg_tblspc
116504 .
Anyone feeling lucky to give a clue where all the megabytes are being used for? I didn't, so I had to look further.
If you're in trouble you can send a telegram to the Right People
I've got one hard to find piece of advice coming up, which could take you hours to find if you don't know what to look for. I tried searching for retrieving the disk usage per table or the table size for PostgreSQL, but it didn't get me to what I wanted. So let me offer you this next, simple but effective, SQL gem for retrieving the disk usage size for tables and indices on your PostgreSQL database:
-- select size of tables and indices in random order
SELECT relname, reltuples, relpages FROM pg_class ;
-- select size of tables and indices in descending order of size
SELECT relname, reltuples, relpages FROM pg_class ORDER BY relpages DESC ;
-- select size of tables and indices in descending order of tuple- / recordcount
SELECT relname, reltuples, relpages FROM pg_class ORDER BY reltuples DESC ;
Go ahead now, try these on your own database. I find the middle query to be the most informational because you see exactly where the major part of your disk usage comes from. Look at the number of pages taken up by each entry, the default page size is 8 Kb, so multiply the number you see by eight and you've an idea of how large each table and index is on disk.
I faced reality the hard way, sort of
Looking at the results of the size-query I saw the top entries all being column indices. Column indices I added in the beginning of the site to evade sequential scans. Then it dawned on me that these indices, which were quite out of proportion at more than 80 megabytes each, were the root cause for both the disk usage and the performance drops.
I imagined my poor server machine reading a full 80 megabyte index from disk just to decide which single record would have to be returned.
I decided to check out whether or not my configuration change of disabling sequential scans had any impact on the query performance. It did. The execution time without sequential scans was over 40.000 milliseconds, or 40 seconds. With sequential scans it dropped to 600 milliseconds, only two-thirds of a single second. That's a big difference.
Enabling or disabling sequential scans can be done from within the PostreSQL console using these statements:
-- disable sequential scans for the current connection
SET enable_seqscan = false ;
-- enable sequential scans for the current connection
SET enable_seqscan = true ;
If you're using the readline support for PostgreSQL, you can easily hop back and forth previous SQL statements using the up and down arrow keys.
Try to fiddle around with all the settings, it will save you one day when you least expect it.
Cause and effect
After identifying the problem, I quickly enabled the sequential scans serverwide from within the configuration file. The query planner had proved to be effective on large scale, at least good enough for managing the queries on my database. Afterwards, none of my enforced column indices where being used anymore and I decided to get rid of them. End result: same data, faster queries, at a fraction of the disk usage.
Conclusion
If it ain't broke (enough) don't fix it
I was tempted to put a question mark at the end of this paragraph header because I figured it'd be a nice cynical touch. The conclusion already seems clear: premature optimization like this case can be fatal. Luckily I had the tools and knowledge available to tackle the problem, because otherwise I might have been severely disappointed in PostgreSQL.
As I see it now, PostgreSQL is a really good database server which is capable of offering advanced RDBMS features to anyone able to install a Linux distribution. I provided you with some of the tools and knowledge I found valuable, I hope you take the time to contact me if you want to talk about this article. I will post new findings and anecdotes in the future.
Related Pages
See also the MySQL Tips and Tricks page on this site.
About this article
This article was added on the 16th of December 2005 and updated on the 15th of March 2006. On the 10th of September 2008 I corrected a small error in one of the table size queries, thanks to Dana Olson.