The Startup CTO’s Handbook To Running on Postgres
So you’re starting a new project – or company – and you’ve decided that you’re going to run on Postgres. The hard part (picking your database) is over, and now the fun part begins: making sure that you don’t need to think about it again for another few years.
This post will run through some lesser-known tips for configuring Postgres for the long haul, from connection pooling to security basics to extensions and indices.
Set Up Connection Pooling for Postgres With PGBouncer
By default, Postgres forks a separate process for each client connection from the main OS process. At low volume, the time it takes to create and destroy those processes – plus the fact that they’re never reused – won’t matter. You can set a number of max_connections manually, but eventually, as you scale up, you’ll likely run into issues here. Connection pooling helps you essentially “cache” these processes and reuse them as clients connect and disconnect from your database.
Though you can build connection pooling into your application logic, most opt for a third-party tool, and in Postgres’s case, that’s PGBouncer. It’s an open source, lightweight connection pooler that you can install on either your database server or your application server. You’ve got 3 levels of pooling to choose from:
- Session pooling: stays true to the “client connections are indefinite” model and keeps a connection open for the entire time a client is connected.
- Transaction pooling: connections last for a single transaction, after which they’re sent back to the pool.
- Statement pooling: connections last for just a query, so if you’ve got multiple as part of a transaction, it wouldn’t work at all.
Most choose session pooling – it’s the most conservative and least risky to dropping connections – but every app is different, and you’ll need to figure out the right mode for your constraints.
Connection Pooling’s Performance Impact
The million-dollar question, though: does this actually work? Percona ran a series of benchmarks to figure out how PGBouncer impacts performance. With a small number of concurrent clients (<60), PGBouncer actually degrades transactions per second (TPS) by a good deal because of the overhead of pooling. But by the time you scale up to >100, you start to see meaningful performance benefits.
So do you need a connection pooler right away to support your first few users? Probably not. But using PGBouncer will help you once you reach even low/moderate traffic.
Postgres Security for Dummies
For the first few weeks you’re working on your project, it’s usually just one or two developers working on an empty database; security is not top of mind. But as you launch your app to the world, it’s going to need to be. And with databases, it sometimes seems like there are a million different ways to lock things down.
Restricting Access at the Host or User Level
Let’s start with access. Postgres restricts access in two ways:
- At the host level – defining IP addresses and domains with access rights
- At the user level – defining database users and their permissions
The pg_hba.conf file in the PGDATA directory is where you define who can connect to which databases. If you don’t have an entry for a client in there, they will not be able to access the database. Assuming your application server is running somewhere else, here’s how you might allow it to access the database:
# Trust any connection via TCP/IP from this machine
host all 127.0.0.1 255.255.255.255 trust
Outside of just “trust any connections from this machine,” there are tons of different ways to authenticate your client with the database server, from password to ident to certificates. And if you’ve eschewed the great comforts of RDS (or Kinsta) and are running your backend on the same server as your database, you can connect via Unix sockets instead of TCP/IP.
Authorization and Privileges
Once your client itself is authenticated, you need to deal with the question of authorization. The SQL standard defines a privilege system, and each object in Postgres (like a table, row, etc.) has different privileges relating to it that can be assigned to users: things like SELECT
and UPDATE
, but also TRUNCATE
, REFERENCES
, TRIGGER
, etc. You bestow privileges upon users with the GRANT
command.
Best practice is to follow the principle of least privilege, so the database user that you create for your client(s) should only be able to access whatever it needs to access.
Row Level Security
The last thing to cover here is row level security. RLS exists from the perspective of the table (not the user) and restricts which rows can be accessed, updated, etc. By default, tables do not have RLS enabled, so your user will be able to do whatever their access policies dictate. To enable RLS for a table, you’d start with:
ALTER TABLE [table_name] ENABLE ROW LEVEL SECURITY
And then add a policy. Let’s say you want to restrict read access to your lightsaber_internals table to trustworthy individuals, who are already defined in the jedi user group, such that only a lightsaber’s owner can see its internal details. Here’s how you’d do it:
ALTER TABLE lightsaber_internals ENABLE ROW LEVEL SECURITY
CREATE POLICY jedi_only ON lightsaber_internals TO jedi
USING (jedi = lightsaber_jedi);
RLS policies like this are useful when you need security at a more granular level than just tables (situations with PII, etc.).
Think in Advance About Scaling Problems
At every startup I’ve ever worked at, there has been some degree of manual scaling when it comes to the database. One day you’ll wake up, and Datadog is freaking out because your Postgres server is completely out of space. You’ll investigate, update the incidents page, and eventually upgrade the disk size until it happens again (although next time, it may be a RAM issue). Getting ahead of this stuff can help! A few suggestions:
1. Set Up Database Monitoring
Most of the companies I’ve worked at use Datadog for their db monitoring. If you’re using a managed database service, you can probably get by for a bit using their native stuff. Datadog has a good post on their blog covering major metrics you should be watching, like read and write throughput, sequential scans, data written to disk, etc.
2. Put Together Guidelines for Scaling Vertically
When your team gets paged – and it will happen – the last thing you want is for everyone to need to put hands on deck to get the issue resolved, when in most cases, a simple scale up solves the problem. It’s good to put together a basic plan for your team on what’s within scope when you’re running out of space or compute.
3. Vacuuming and Tuning Your Autovacuum
When you DELETE
data in Postgres or UPDATE
data (which is functionally equivalent to deleting and inserting), Postgres doesn’t actually delete that data right away (). Instead, it’s “marked” as deleted by storing the transaction ID of the delete in an xmax header; the reason for this is that it makes MVCC in Postgres more straightforward. But if these rows aren’t really deleted eventually, they’ll start to waste disk space and give you problems.
The easiest way to get rid of these rows is using the VACUUM
command. You could run a vacuum manually whenever dead rows build up or even just set it up to run every x minutes, but a better strategy is to autovacuum based on how many dead rows have accumulated. Tuning your autovacuum is a nuanced topic beyond the scope of this post: I’d highly recommend reading 2ndQuadrant’s post about it.
4. Set Up a Read Replica (or Two)
This one is easy. If you anticipate a meaningful increase in traffic (a launch coming up, etc.), you can easily create read-only replicas (or at least one); they’ll help offload some work from the main DB instance.
If you opt for multiple replicas, you’ll get the added benefit of improving availability if any one of them goes down for any reason. Adding replicas is pretty straightforward in most DBaaS providers; just keep an eye out on cost: they are often priced at the same level as a main DB instance despite being read-only.
Add Indexes To Your (Anticipated) Largest Tables
Database indexes help speed up read queries by creating ancillary data structures that make your scans faster. For many use cases, adding an index to a table or two is basically a no-brainer. In Postgres, you can create an index with the CREATE INDEX
command (duh). When you query a table, the database will check to see if an index exists, and use it if it does (you can verify this is happening with EXPLAIN
, by the way).
The most popular type of index in Postgres – and the default one when using CREATE INDEX
– is a B-Tree Index. Essentially, it takes the column you want to create an index on, sorts it, and stores pointers to the sorted rows. That way, you can get binary search efficiency on whatever column you want, not just the column that the actual table is sorted on (if there even is one). You can read more in-depth about how these trees are implemented in the Postgres docs here.
Though useful, indexes are not all fun and games; they take up space, and if you aren’t careful with how many and what type you create, they can actually start to degrade database performance. Nobody says it better than the Postgres docs themselves:
“Indexes are primarily used to enhance database performance (though inappropriate use can result in slower performance).”
Under the hood, when you create an index, Postgres materializes a lookup table that has the index and a pointer to the index’s record. Too many of those tables eat up disk space, make INSERT queries take longer, and force the query engine to consider more options before choosing how to execute a query.
Bonus: Add Some Postgres Extensions
One thing that makes Postgres unique is native support for extensions from third parties. You can create them out of SQL and C, and they can be as small as a couple of statements or as large as an entire software library. Using publicly available / open source extensions helps you in the same way that using a software package does; why write your own code when you can use someone else’s? Here are a few of the more popular Postgres extensions:
Timescale
Timescale is a Postgres extension for working with time series data. In short, it makes your queries (a lot) faster and stores time series data very efficiently. You can find installation instructions here, or consider Timescale’s cloud-hosted option if you’re really running your business on time-series data (although you are probably already aware of this if you are).
PostGIS
PostGIS adds support to Postgres for storing, indexing, and querying geographic data (think lines, polygons, locations, etc.). If you’re using a cloud provider, most pre-install PostGIS. But if you need to install it yourself, you can find installation instructions here.
pg_stat_staements
pg_stat_statements creates a view in your Postgres database with statistics on every query run on the database. You can see stats like how long the query takes to run (mean, median, average, etc.), who run the query, block cache hits, number of blocks written, and a lot more (44 total columns in that view). To install, just add it to your .conf file and restart the server.
pg_audit
pg_audit helps with companies that might be subject to detailed audits (e.g. governmental, financial, etc.). You can get Postgres to log every single statement against the database by setting `log_statement=all`, but that doesn’t mean the information you need is going to be easy to search and find. pg_audit uses Postgres’s internal logging functions to make it easier to find and work with those logs that an auditor might need. You can find installation instructions here.
Summary
Postgres is a great (and highly popular) option to build your company on and one we’re proud to support at Kinsta. We hope these tips help you get up and running and ready for scale. Other tips or thoughts from your experience? Let us know here.
The post The Startup CTO’s Handbook To Running on Postgres appeared first on Kinsta®.
共有 0 条评论