Postgres failover: Documentation

Rationale
=========

Since introduction of the WAL-based replication into the PostgreSQL, it is
possible to create high-availability and load-balancing clusters.

However, there is no support for failover in the client libraries. So, only
way to provide transparent for client application failover is IP address
migration. This approach has some limitation, i.e. it requires that
master and backup servers reside in the same subnet or may not be
feasible for other reasons. It may seem that dynamic DNS update allows
to overcome this limitation. But DNS information is typically heavily
cached, and it is scarcely possible that DNS changes would propagate in
the period of few seconds. Moreover, it might be quite difficult to
allow dynamic DNS updates by failover servers due to political reasons.

Commercial RDBMS, such as Oracle, employ more flexible approach. They
allow to specify multiple servers in the connect string, so if primary
server is not available, client library tries to connect to other ones.

PostgreSQL JDBC driver already implements such feature and allows to
specify multiple servers in the URL.

This approach allows to use geographically distributed failover clusters
and also is a cheap way to implement load-balancing (which is not
possible with IP address migration).

Implementation of this feature at the client library level would provide
uniform interface, usable by all applications and save much work
compared with the implementation on the application level.

Proposed change
===============

Allow to specify multiple hosts in the libpq connect string. 

If multiple hosts are specified, connection could be attempted to 
these server either in sequential order, in random order or
simultaneously, then choosing one connection to use and closing other.

Each variant has its own advantages and drawbacks and each can be
preferred in some situations.

Attempt to connect servers in order and trying the second one only if
first fails minimizes load on the server, but can result in quite long
connect times (if server is powered off completely and connection
attempt fails only when connect\_timeout is reached).

Random order allows to balance load to multiple servers. Unfortunately,
load-balancing has very limited applications on Postgresql, because
multimaster replication is not supported yet. But there can be
situations where multiple readonly clients exists, and they can be
loadbalanced among pool of hot standby servers. Also there are some
proprietary multimaster solutions, which users can benefit from
loadbalancing on the client library level.

Simultaneous connection allows to minimize connect time in case master
server fails, and one of standby servers takes over. But it places
extra load on standby server, if it does accept connections.

In this case each client connects to all servers at once, then checks
read-only status of the connections and closes connections to all
standbys, leaving connection to the master operable.

It is evident that this load should be bearable by standby servers,
because it is exactly same amount of the connections, as master accepts,
and standby servers should be able to take over master role at any time.

If cluster administrator don't want standby server to carry this load,
one can make standby reject connection either by postgresql configuration
or using OS firewall. This approach has evident drawback that
standby to master promotion procedure have to change this configuration
too and make server accept client connection.

Syntax
------


Libpq connect string can be either set of the keyword=value pairs
or an URL. 

In the first form it can be just allowed to specify keyword host
multiple times.

    "host=main-server host=standby1 host=standby2 port=5432 dbname=database"

But this approach has drawback, that it doesn't allow to specify
different ports for different servers.

In order to allow different port on each cluster node, it is allowed
to specify port in the host parameter, using colon as separator (as it
is done in the URL)

   "host=main-server:5432 host=standby1 host=standby2 port=6453"


In this example main-server use port 5432, and all standbys use port,
specified in the `port' parameter
syntax:


In this form connect procedure should walk over the lists of hosts and
ports in parallel.	

In the second form host can be specified either in the first part of URL
or in the query parameters.

    postgresql://user@host/database

    postgresql:///database?host=hostname&user=username

If host is specified as a parameter, it is also possible to allow
multiple host parameters without breaking existing syntax.


But we propose syntax, compatible with postgresql JDBC:

    postgresql://host1:port1,host2:port2/database

In order to implement three modes of operations described above, 
additional parameters should be added:

*. hostorder=keyword, where keyword is one of sequential, parallel and
random. Default is sequential. Sequential is try to connect first server
in the connect string, if it fails, use second and so on. Parallel - try
to connect all servers simultaneously and use first one which would be
able to serve (by default, we need read-write connection, so after
successful authentication we execute `pg_in_recovery()' sql function
operator and consider connection successful only if it returns false)

*. loadBalanceHosts = boolean (as in JDBC and with same semantics). If
true, than connection to multiple hosts is made in random order. For the
compatibility with other parameters, as connect\_timout, this parameter
should probably have an alias load\_balalce\_hosts. Equivalent of
hostorder=random

*. readonly = boolean. If false, then connection is assumed to be
successful only after `select pg_in_recovery()` statement is executed
and returned False. Default false (i.e. any server which accepts
connection assumed to be usable).

*. failover\_timeout = seconds. Number of seconds to retry connection if
no usable master server found. Must be sufficient to allow cluster
watchdog software to complete promotion of one of standbys to master.

Support for this syntax extensions is added to the PQconnectdb, 
PQconnectdbParams, PQConnectStart and PQConnectStartParams,
but not PQsetdb/PQsetdblogin functions. 


Behavior
--------

If PQconnectdb encounters connect string with multiple hosts specified,
it attempts to establish connection with all these hosts according the
hostorder parameter value. If hostorder is sequential (the default) it
tries hosts in order they are specified in the command line, and moves
to next one only if connection to previous is failed. Be sure to specify 
sensible connect_timeout.

If hostorder is random, tries hosts in random order.

If hostorder is parallel, then tries to connect all host simultaneously
and uses that connection which is fully established first. By default.

If the load\_balance\_hosts parameter is true, it is equvalent of 
hostorder=random.

If the parameter readonly is false, after authenticating with server it
executes show transaction\_read\_only, to find out whether current
connection is to the master or to the hot standby, and connection is
	considered successful only if server allows read write transactions.
This allows to have clients which write to the database and clients
which perform read-only access. Read-only clients would be load-balanced
between the master and slave servers, and read-write clients connect only to
the master (whichever server has this role at the moment of connection).

If no usable server found, waits for failover\_delay seconds and then
repeats the procedure until  failover\_timeout seconds passed.

Implementation notes
====================

Information of the alternate servers should be stored in the PGconn structure.
There is already room for storing several server IPs if DNS name
resolves into multiple IPs.

Function PQreset should be able to take advantage of new syntax and
possibly open connection to the new master, if failover occurred
during lifetime of the connection.

Most of changes should reside in the PQconnectPoll state machine. This
function is used by all public connect functions.
Probably couple of new states should be added to ConnStatusType enum.

There also should be changes in the option parsing and in the pg_conn
structure to incorporate multiple hosts data.

There should be new interface function, to allow asynchroneous
applications to get all the currently open sockets to integrate in the
event loop.

If we can manage to avoid simultaneous connections by the applications,
which don't call this function, we probably can achieve 100%
compatibility with existing applications, as it is already documented
feature of PGsocket, that socket descriptor can change during connection
setup phase.


Possible drawbacks
==================

Compatibility
-------------

Proposed patch requires no modifications to the server or protocol, and 
modification of synchronous function (PQconnectdb, PQconnectdbParams) 
doesn't introduce incompatible changes to the client library. 

Even if connect string with multiple host would be erroneously used
with version of libpq, which do not support this feature, it is not an
error.  It just use last host specified in the connect string.

There could be some compatibility problems with asynchronous connections
created with PQConnectStart functions. Problem is that we are trying
to establish several connections at once, and there are several sockets
which should be integrated into application event loop.

Even if we would try servers in some particular order (such as randomized
order during load balancing), file descriptor of socket can change during
execution PQConnectPoll, and existing applications are not prepared to it.

Performance impact
------------------

Performance impact seems to be negligible.

1. If connect string contain only one host, the only complication is the
maintenance of the data structure, which possible can hold more than
one host name. Connection process itself would not be affected.

2. If there is pure high-availability cluster, i.e. standby servers do
not accept client connections on the specified port, there is no extra
load on standby servers, and almost no (only several unsuccessful
connect calls) on client.

3. If there is high-availability cluster with standby servers accepting
connection, then these standby servers have to process same amount of
connection as master does, and client has to handle as many connections
as there are servers in the connect string, executing show
transaction\_read\_only on all of them. It would insufficiently increase
connect time.

4. If there is load balancing cluster, there is no performance impacts
for read-only client, but each read-write client causes standby servers
to process extra connection to the point where server can report
read-only state of transaction (i.e. including SSL handshake and
postgresql authentication). Typically, in the situation where read-only
clients should be load-balanced using this feature, there are much more 
read-only clients, than read-write ones. So some extra load related with 
read-write connection seems to be justified by simplification of client
configuration.

-- 
                  Victor Wagner <vitus@wagner.pp.ru>