At Hardfin, we use AWS RDS for managed PostgreSQL instances to simplify our infrastructure and focus on product. For the most part, RDS is “just PostgreSQL” and acts like the PostgreSQL instances that we run on our development machines. Unfortunately, it differs in a few key ways — and that opens the door for dreaded “it works on my machine” issues.

We recently encountered a surprising error in RDS, which we couldn’t replicate in our local development setup. In order to prevent future issues like this, we took steps to make our Dockerized PostgreSQL instance much more similar to an RDS instance.

We are big proponents of Infrastructure-as-Code at Hardfin. We utilize Terraform to describe as much of our infrastructure as we can, including AWS RDS instances running PostgreSQL.

In addition to AWS resources, we use Terraform to directly manage some PostgreSQL resources. A fresh RDS instance comes with a single role (the RDS master user) available to DB operators. We only use this this RDS master user via the PostgreSQL provider to create new application roles with minimal privileges.

Getting locked out

RDS best practices recommend against using the RDS master user within applications; it has elevated privileges beyond what applications should be able to do.

As part of scoping a minimal set of privileges, we utilize grants to create (1) an admin role used during migrations and other offline operations, and, (2) an application role used directly while running the application. The admin role permissions allow a full set of DDL operations while the application role has a scope limited to DML operations.

As part of the process of limiting access, we merged a change to limit which roles can CONNECT to a PostgreSQL DATABASE and to even further limit which roles can CREATE in the DATABASE. After running this against a Dockerized PostgreSQL instance used for local development, all changes were applied successfully.

Once unit tests passed in CI, we merged the change and ran the Terraform plan in our sandbox environment against an actual RDS instance. These changes too were applied successfully, or so we thought. Later that day, another small change was made to the same Terraform workspace. With no signs of trouble in testing, this change was merged. When we went to generate a plan in sandbox, we found out the RDS master user was locked out:

$ terraform apply
...
postgresql_grant.revoke_public: Refreshing state... [id=public_initech_database]
...
│ Error: error detecting capabilities: error PostgreSQL version: pq: permission denied for database "initech"
│   with postgresql_grant.revoke_public,
│   on grants.tf line 6, in resource "postgresql_grant" "revoke_public":
│    6: resource "postgresql_grant" "revoke_public" {
...

What went wrong

The RDS master user got locked out as a result of a change that introduced three DATABASE grants:

resource "postgresql_grant" "revoke_public" {
  database    = postgresql_database.initech.name
  role        = "public"
  object_type = "database"
  privileges  = []
}

resource "postgresql_grant" "app_role_connect" {
  database    = postgresql_database.initech.name
  role        = postgresql_role.app_role_gibbons.name
  object_type = "database"
  privileges  = ["CONNECT"]
}

resource "postgresql_grant" "admin_role_full" {
  database    = postgresql_database.initech.name
  role        = postgresql_role.admin_role_lumbergh.name
  object_type = "database"
  privileges  = ["CREATE", "TEMPORARY", "CONNECT"]
}

After applying, only the admin and application role were allowed to connect to the database1:

initech=> \l initech
                                 List of databases
  Name   |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
---------+----------+----------+-------------+-------------+-----------------------
 initech | lumbergh | UTF8     | en_US.UTF-8 | en_US.UTF-8 | lumbergh=CTc/lumbergh+
         |          |          |             |             | gibbons=c/lumbergh
(1 row)

Crucially, this list does not include the RDS master user:

$ PGPASSWORD=$(cat ./very-secret.txt) psql \
>   "postgres://initech_master_user@initech-sandbox.cthulhu2hyup.zz-outercentral-7.rds.amazonaws.com:5432/initech"
psql: error: connection to server at "initech-sandbox.cthulhu2hyup.zz-outercentral-7.rds.amazonaws.com" (10.54.42.10), port 5432 failed: FATAL:  permission denied for database "initech"
DETAIL:  User does not have CONNECT privilege.

Detecting issues locally

We never detected this CONNECT issue because we were using a superuser in our Dockerized PostgreSQL instance during local development with the Terraform provider. A superuser can never get locked out because it can do anything and everything in the PostgreSQL instance.

To fix this, we introduced an equivalent to the RDS master user locally and stopped using the superuser from Terraform. After switching to this user in the Terraform provider, we were able to replicate the CONNECT issue.

To create the RDS master user locally, we jumped on an RDS instance and examined the attributes of the RDS master user:

initech=> \du initech_master_user
                             List of roles
      Role name      |          Attributes           |    Member of
---------------------+-------------------------------+-----------------
 initech_master_user | Create role, Create DB       +| {rds_superuser}
                     | Password valid until infinity |

initech=> SELECT * FROM pg_catalog.pg_roles WHERE rolname = 'initech_master_user';
       rolname       | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolreplication | rolconnlimit | rolpassword | rolvaliduntil | rolbypassrls | rolconfig |  oid
---------------------+----------+------------+---------------+-------------+-------------+----------------+--------------+-------------+---------------+--------------+-----------+-------
 initech_master_user | f        | t          | t             | t           | t           | f              |           -1 | ********    | infinity      | f            |           | 13370
(1 row)

Using this information, we created an SQL file (rds_master_user.sql) to create this role in the Dockerized PostgreSQL instance:

CREATE ROLE initech_master_user
  WITH ENCRYPTED PASSWORD '...'
  VALID UNTIL 'infinity'
  NOSUPERUSER
  INHERIT
  CREATEROLE
  CREATEDB
  LOGIN
  NOREPLICATION
  CONNECTION LIMIT -1
  NOBYPASSRLS;

In order to ensure this role gets created during PostgreSQL startup, we volume mount it at /docker-entrypoint-initdb.d/rds_master_user.sql.

Fixing the issue

In order to fix the broken RDS instance, we temporarily granted the admin role to the RDS master user:

GRANT lumbergh TO initech_master_user;
-- AFTER: REVOKE lumbergh FROM initech_master_user;

and made sure the master user could connect as well:

locals {
  aws_master_user = "initech_master_user"
}

resource "postgresql_grant" "aws_master_user_connect" {
  database    = postgresql_database.initech.name
  role        = local.aws_master_user
  object_type = "database"
  privileges  = ["CONNECT"]
}

Conclusion

Subtle behavior differences between local development and production can lead to not-so-subtle issues. There is a long tail of problems that can arise if the development setup is too different, particularly issues around permissions.

Emulating the PostgreSQL RDS master user allows Terraform to manage roles locally in the same way they are managed in production. This is just one of many changes needed to make the local development database faithful to production. Luckily, using Terraform locally allows us to bridge many other gaps by giving us the same roles and grants that we have in RDS.

Bill Lumbergh; Initech


  1. The DATABASE privileges (e.g. lumbergh=CTc/lumbergh+) corresponds to CREATE (C), TEMPORARY (T) and CONNECT (c). ↩︎