May Contain Blueberries

the sometimes journal of Jeremy Beker


Shortly after I started my new job at Kolide in March, I was presented with a very interesting problem. We dynamically create views in our database to perform certain queries that are specifically scoped within our system. These views are then queried by our webapp before being destroyed when they are no longer needed. This had been working perfectly fine since the original feature was created, but due to the fact that the view needs to be created, all of the work had to be performed on our primary (writer) database. This presented several problems:

  • It added load to our primary database for what was functionally a read-only operation
  • It required that we do some machinations to switch users to ensure that the view was restricted as much as possible
  • There was a danger that an out of control query could impact our primary, writer database

My project was to come up with a way to do the same operations, but use our read only replica to perform the queries themselves to isolate them as much as possible.

The high level solution was pretty straightforward:

  • Create the view on the primary
  • Execute the query using the view on the read-only replica
  • Remove the view on the primary

A simple implementation would look something like this in ruby:

ActiveRecord::Base.connected_to(role: :writing) do
  create_view_function
end

ActiveRecord::Base.connected_to(role: :reading) do
  execute_query_using_view
end

ActiveRecord::Base.connected_to(role: :writing) do
  delete_view_function
end

However, if you run this code, you will fail on execute_query_using_view with an error stating that the view does not exist. And this is where things get complicated.

Tangent on Postgres replicas: When you have Postgres databases set up in a primary-replica relationship the replica is kept up to date by receiving updates from the primary as changes are made to the primary database. This is done through the WAL (write ahead log). Before the primary database writes out data tables to disk, the complete set of transactions are recorded in the WAL. This allows the primary to recover the database to a consistent state in the event of a crash. But it also facilitates replication. The replica streams the contents of the WAL from the primary and applies those changes to its data tables resulting in an eventually consistent copy of the primary. And therein lies the rub, the replica is potentially always a little behind the primary.

So the error we received that the view did not exist on the replica is really that the view does not exist on the replica yet. The replica has not had enough time to replay the WAL and get the view in place before we tried to query against it.

This requires a more refined solution then:

  • Create the view on the primary
  • Wait until the view has been replicated
  • Execute the query using the view on the read-only replica
  • Remove the view on the primary

Tangent on transaction IDs: Every transaction that is committed to the database is assigned a transaction ID. Per the documentation “The internal transaction ID type xid is 32 bits wide and wraps around every 4 billion transactions.” There are useful functions that can be used to query the status of the transactions that have been committed to the databases.

So, in order to make this work, we have defined two functions that can be used to help understand what has been committed and where.

#
# Return the txn id of what has actually been written out from the server's internal buffers
#
def committed_wal_location(connection)
  # Get the current writer WAL location
  results = connection.exec_query("select pg_current_wal_lsn()::text")

  # the result is a single row with a single column, pull that out
  results.rows.first.first
end

and:

#
# Checks to see if the connection passed has replayed the txn to its tables
#
def has_replica_committed_wal_location?(connection, wal_location)
  # create bind variable
  binds = [ ActiveRecord::Relation::QueryAttribute.new(
      "id", wal_location, ActiveRecord::Type::String.new
  )]

  # Get the current writer WAL location
  results = connection.exec_query("select pg_wal_lsn_diff ( pg_last_wal_replay_lsn(), $1 )", 'sql', binds)

  # the result is a single row with a single column, pull that out and convert to a number
  # negative means that this connection is behind the WAL LSN that was sent in (lag)
  # zero means we are in sync
  # positive means the replica is in the future, shouldn't happen :)
  !results.rows.first.first.to_i.negative?
end

Using these methods, we can build a more robust system that accomplishes our goal:

ActiveRecord::Base.connected_to(role: :writing) do
  # Check out a connection to the pool and return it once done
  ActiveRecord::Base.connection_pool.with_connection do |connection|
    begin
      # Generate dynamic views
      connection.exec_query_with_types(create_statements)

      # Explicitly commit this so that it gets pushed to the replica
      connection.commit_db_transaction

      # Get the WAL current position so that we can wait until the replica has
      # caught up before executing the query over there
      view_creation_timestamp = committed_wal_location(connection)

      # Run the query on the replica
      ActiveRecord::Base.connected_to(role: :reading) do
        ActiveRecord::Base.connection_pool.with_connection do |ro_connection|
          # Verify that the replica is at least up to date with the primary before executing query
          retries = 0
          while !has_replica_committed_wal_location?(ro_connection, view_creation_timestamp)
            raise ReplicaLagError.new("Replica has not committed data in #{retries} tries.") if retries > MAX_REPLICA_RETRIES
            retries +=1

            # Exponential backoff to wait for replica to catch up
            sleep(0.1 * retries**2)
          end

          results = ro_connection.exec_query_with_types(view_sql)
        end
      end
    ensure
      # Drop the views that we created at the beginning
      connection.exec_query_with_types(drop_statements)
    end
  end
end

With this code, we accomplished our goal of being able to create views dynamically, use views to restrict data access, but still gain the advantages and protections of actually executing queries against the views on our replica systems. As of this writing, this has been in production for several months without any issues.