Skip to content

Commit 78132d1

Browse files
authored
Implement online enable/disable monitor support. (#591)
This allows for adjusting to a lost monitor while keeping our services online. At run-time, we can now $ pg_autoctl disable monitor $ pg_autoctl enable monitor --monitor ... Taking care of the order of operations, the cluster can continue behaving correctly with a minimum amount of disturbance.
1 parent dfa3c97 commit 78132d1

28 files changed

Lines changed: 1153 additions & 76 deletions

.travis.yml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,22 @@ env:
1414
matrix:
1515
fast_finish: true
1616
include:
17-
- env: PGVERSION=10 TEST=single
18-
- env: PGVERSION=11 TEST=single
19-
- env: PGVERSION=12 TEST=single
20-
- env: PGVERSION=13 TEST=single
2117
- env: PGVERSION=10 TEST=multi
2218
- env: PGVERSION=11 TEST=multi
2319
- env: PGVERSION=12 TEST=multi
2420
- env: PGVERSION=13 TEST=multi
21+
- env: PGVERSION=10 TEST=single
22+
- env: PGVERSION=11 TEST=single
23+
- env: PGVERSION=12 TEST=single
24+
- env: PGVERSION=13 TEST=single
25+
- env: PGVERSION=10 TEST=ssl
26+
- env: PGVERSION=11 TEST=ssl
27+
- env: PGVERSION=12 TEST=ssl
28+
- env: PGVERSION=13 TEST=ssl
29+
- env: PGVERSION=10 TEST=monitor
30+
- env: PGVERSION=11 TEST=monitor
31+
- env: PGVERSION=12 TEST=monitor
32+
- env: PGVERSION=13 TEST=monitor
2533
- env: LINTING=true
2634
before_install:
2735
- git clone -b v0.7.18 --depth 1 https://github.com/citusdata/tools.git

Makefile

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,45 @@ DOCKER_RUN_OPTS = --privileged -ti --rm
77

88
NOSETESTS = $(shell which nosetests3 || which nosetests)
99

10+
# Tests for the monitor
11+
TESTS_MONITOR = test_extension_update
12+
TESTS_MONITOR += test_installcheck
13+
TESTS_MONITOR += test_monitor_disabled
14+
TESTS_MONITOR += test_replace_monitor
15+
16+
# Tests for single standby
17+
TESTS_SINGLE = test_auth
18+
TESTS_SINGLE += test_basic_operation
19+
TESTS_SINGLE += test_basic_operation_listen_flag
20+
TESTS_SINGLE += test_create_run
21+
TESTS_SINGLE += test_create_standby_with_pgdata
22+
TESTS_SINGLE += test_debian_clusters
23+
TESTS_SINGLE += test_ensure
24+
TESTS_SINGLE += test_skip_pg_hba
25+
26+
# Tests for SSL
27+
TESTS_SSL = test_enable_ssl
28+
TESTS_SSL += test_ssl_cert
29+
TESTS_SSL += test_ssl_self_signed
30+
1031
# Tests for multiple standbys
11-
MULTI_SB_TESTS = $(basename $(notdir $(wildcard tests/test*_multi*)))
12-
MULTI_SB_TESTS += $(basename $(notdir $(wildcard tests/test*_disabled*)))
32+
TESTS_MULTI = test_multi_async
33+
TESTS_MULTI += test_multi_ifdown
34+
TESTS_MULTI += test_multi_maintenance
35+
TESTS_MULTI += test_multi_standbys
1336

1437
# TEST indicates the testfile to run
1538
TEST ?=
1639
ifeq ($(TEST),)
1740
TEST_ARGUMENT = --where=tests
1841
else ifeq ($(TEST),multi)
19-
TEST_ARGUMENT = --where=tests --tests=$(MULTI_SB_TESTS)
42+
TEST_ARGUMENT = --where=tests --tests=$(TESTS_MULTI)
2043
else ifeq ($(TEST),single)
21-
TEST_ARGUMENT = --where=tests --exclude='_multi_' --exclude='disabled'
44+
TEST_ARGUMENT = --where=tests --tests=$(TESTS_SINGLE)
45+
else ifeq ($(TEST),monitor)
46+
TEST_ARGUMENT = --where=tests --tests=$(TESTS_MONITOR)
47+
else ifeq ($(TEST),ssl)
48+
TEST_ARGUMENT = --where=tests --tests=$(TESTS_SSL)
2249
else
2350
TEST_ARGUMENT = $(TEST:%=tests/%.py)
2451
endif

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,9 @@ def __init__(self, **options):
7171
# built documents.
7272
#
7373
# The short X.Y version.
74-
version = "1.4"
74+
version = "1.5"
7575
# The full version, including alpha/beta/rc tags.
76-
release = "1.4.2"
76+
release = "1.5.0.2"
7777

7878
# The language for content autogenerated by Sphinx. Refer to documentation
7979
# for a list of supported languages.

docs/faq.rst

Lines changed: 6 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -165,26 +165,10 @@ node metadata. Specifically we need the nodes state to be in sync with what
165165
each ``pg_autoctl`` process has received the last time they could contact
166166
the monitor, before it has been unavailable.
167167

168-
Barring that, the way forward is to register your nodes again to the new
169-
monitor. To be able to register again, we need to have a clean initial local
170-
state on every node, and the ``pg_autoctl drop node`` command achieves that.
168+
It is possible to register nodes that are currently running to a new monitor
169+
without restarting Postgres on the primary. For that, the procedure
170+
mentionned in :ref:`replacing_monitor_online` must be followed, using the
171+
following commands::
171172

172-
.. warning::
173-
174-
This procedure includes a step where the Postgres service has to be
175-
stopped and started again.
176-
177-
On every Postgres node, starting with the current primary, remove the local node state and register the node
178-
again to the new running monitor::
179-
180-
# when running with systemd, stop the systemd service first
181-
$ sudo systemctl stop pgautofailover
182-
183-
# drop node ignores connection error to the monitor, and stops Postgres
184-
$ pg_autoctl drop node
185-
186-
# register again, and restart Postgres on the node
187-
$ pg_autoctl create postgres --monitor <new monitor uri> <--same options --as the --first time>
188-
189-
# when running with systemd, now start the systemd service again
190-
$ sudo systemctl start pgautofailover
173+
$ pg_autoctl disable monitor
174+
$ pg_autoctl enable monitor

docs/operations.rst

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ As a result, here is the standard upgrade plan for pg_auto_failover:
5151
1. Upgrade the pg_auto_failover package on the all the nodes, monitor
5252
included.
5353

54-
When using a debian based OS, this looks like the following command when
54+
When using a debian based OS, this looks like the following command when
5555
from 1.4 to 1.5::
5656

5757
sudo apt-get remove pg-auto-failover-cli-enterprise-1.4 postgresql-11-auto-failover-enterprise-1.4
@@ -361,6 +361,43 @@ The monitor reports every state change decision to a LISTEN/NOTIFY channel
361361
named ``state``. PostgreSQL logs on the monitor are also stored in a table,
362362
``pgautofailover.event``, and broadcast by NOTIFY in the channel ``log``.
363363

364+
.. _replacing_monitor_online:
365+
366+
Replacing the monitor online
367+
----------------------------
368+
369+
When the monitor node is not available anymore, it is possible to create a
370+
new monitor node and then switch existing nodes to a new monitor by using
371+
the following commands.
372+
373+
1. Apply the STONITH approach on the old monitor to make sure this node is
374+
not going to show up again during the procedure. This step is sometimes
375+
refered to as “fencing”.
376+
377+
2. On every node, ending with the (current) Postgres primary node for each
378+
group, disable the monitor while ``pg_autoctl`` is still running::
379+
380+
$ pg_autoctl disable monitor --force
381+
382+
3. Create a new monitor node::
383+
384+
$ pg_autoctl create monitor ...
385+
386+
4. On the current primary node first, so that it's registered first and as
387+
a primary still, for each group in your formation(s), enable the
388+
monitor online again::
389+
390+
$ pg_autoctl enable monitor --monitor postgresql://...
391+
392+
5. On every other (secondary) node, enable the monitor online again::
393+
394+
$ pg_autoctl enable monitor --monitor postgresql://...
395+
396+
This operation relies on the fact that a ``pg_autoctl`` can be operated
397+
without a monitor, and when reconnecting to a new monitor, this process
398+
reset the parts of the node state that comes from the monitor, such as the
399+
node identifier.
400+
364401
Trouble-Shooting Guide
365402
----------------------
366403

docs/ref/reference.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,13 @@ keeper::
7676
secondary Enable secondary nodes on a formation
7777
maintenance Enable Postgres maintenance mode on this node
7878
ssl Enable SSL configuration on this node
79+
monitor Enable a monitor for this node to be orchestrated from
7980

8081
pg_autoctl disable
8182
secondary Disable secondary nodes on a formation
8283
maintenance Disable Postgres maintenance mode on this node
8384
ssl Disable SSL configuration on this node
85+
monitor Disable the monitor for this node
8486

8587
pg_autoctl get
8688
+ node get a node property from the pg_auto_failover monitor

0 commit comments

Comments
 (0)