Skip to content

Commit 99ae2b8

Browse files
authored
Implement drop-at-a-distance semantics. (#734)
* Implement drop-at-a-distance semantics. This allows running `pg_autoctl drop node --name foo` from the monitor, or even using the SQL API directly, and have the node realise it's been dropped. And then stop. The configuration file, state file, and PGDATA are not touched, for that the command pg_autoctl drop node --pgdata ... --destroy can be used on the node itself. * Allow pg_autoctl create node on top of a dropped node. * Allow pg_autoctl drop node to finish removing a lost node. When a node won't come back up again, the first call to pg_autoctl drop node sets the assigned role to DROPPED, but the node is not in a position to call node_active() to clean this entry. Now, another call to pg_autoctl drop node from the monitor allows to clean-up the node for real, with the force option set to true. If the node still comes back up again, the situation is properly detected and the command pg_autoctl create node --run is now able to re-register the node with its old nodeid and continue from there. * Per review, introduce pg_autoctl drop node --force. * Per review, fix some pg_autoctl drop node cases. In particular dropping a local node that is running in the background was broken. Refactor and simplify the code to make all cases work: 1. stop local node and drop 2. drop local node while it's running in the background 3. drop node from the monitor while it's running 4. drop node from the monitor after having stopped it In case 4. we need to then resort to pg_autoctl drop node --force, as expected. * Allow Postgres 14 failures. * Improve handling of node being drop during init phases, per review. * Per review, allow re-creating a primary node from a dropped node.
1 parent f7fa45c commit 99ae2b8

42 files changed

Lines changed: 1944 additions & 979 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.travis.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,11 @@ matrix:
3535
- env: PGVERSION=13 TEST=ssl
3636
- env: PGVERSION=14 TEST=ssl
3737
- env: LINTING=true
38+
allow_failures:
39+
- env: PGVERSION=14 TEST=multi
40+
- env: PGVERSION=14 TEST=single
41+
- env: PGVERSION=14 TEST=monitor
42+
- env: PGVERSION=14 TEST=ssl
3843
before_install:
3944
- git clone -b v0.7.18 --depth 1 https://github.com/citusdata/tools.git
4045
- sudo make -C tools install

docs/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,9 @@ def __init__(self, **options):
7171
# built documents.
7272
#
7373
# The short X.Y version.
74-
version = "1.5"
74+
version = "1.6"
7575
# The full version, including alpha/beta/rc tags.
76-
release = "1.5.2"
76+
release = "1.6.0"
7777

7878
# The language for content autogenerated by Sphinx. Refer to documentation
7979
# for a list of supported languages.

docs/failover-state-machine.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,18 @@ Missing WAL bytes are fetched from one of the most advanced standby nodes by
255255
using Postgres cascading replication features: it is possible to use any
256256
standby node in the ``primary_conninfo``.
257257

258+
Dropped
259+
^^^^^^^
260+
261+
The dropped state is assigned to a node when the ``pg_autoctl drop node``
262+
command is used. This allows the node to implement specific local actions
263+
before being entirely removed from the monitor database.
264+
265+
When a node reports reaching the dropped state, the monitor removes its
266+
entry. If a node is not reporting anymore, maybe because it's completely
267+
unavailable, then it's possible to run the ``pg_autoctl drop node --force``
268+
command, and then the node entry is removed from the monitor.
269+
258270
Failover logic
259271
--------------
260272

docs/fsm.png

21.6 KB
Loading

docs/ref/pg_autoctl_drop_node.rst

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This command drops a Postgres node from the pg_auto_failover monitor::
1919
--hostname drop the node with given hostname and pgport
2020
--pgport drop the node with given hostname and pgport
2121
--destroy also destroy Postgres database
22+
--force force dropping the node from the monitor
2223

2324
Description
2425
-----------
@@ -37,6 +38,11 @@ options ``--hostname`` and ``--pgport`` or the pair of options
3738
monitor database, and get it removed from the known list of nodes on the
3839
monitor.
3940

41+
Then option ``--force`` can be used when the target node to remove does not
42+
exist anymore. When a node has been lost entirely, it's not going to be able
43+
to finish the procedure itself, and it is then possible to instruct the
44+
monitor of the situation.
45+
4046
Options
4147
-------
4248

@@ -75,20 +81,32 @@ Options
7581
Postgres database for the monitor. When using ``--destroy``, the Postgres
7682
installation is also deleted.
7783

84+
--force
85+
86+
By default a node is expected to reach the assigned state DROPPED when it
87+
is removed from the monitor, and has the opportunity to implement clean-up
88+
actions. When the target node to remove is not available anymore, it is
89+
possible to use the option ``--force`` to immediately remove the node from
90+
the monitor.
91+
7892
Examples
7993
--------
8094

8195
::
8296

8397
$ pg_autoctl drop node --destroy --pgdata ./node3
84-
17:49:42 12504 INFO Removing local node from the pg_auto_failover monitor.
85-
17:49:42 12504 INFO Removing local node state file: "/Users/dim/dev/MS/pg_auto_failover/tmux/share/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node3/pg_autoctl.state"
86-
17:49:42 12504 INFO Removing local node init state file: "/Users/dim/dev/MS/pg_auto_failover/tmux/share/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node3/pg_autoctl.init"
87-
17:49:42 12504 INFO Removed pg_autoctl node at "/Users/dim/dev/MS/pg_auto_failover/tmux/node3" from the monitor and removed the state file "/Users/dim/dev/MS/pg_auto_failover/tmux/share/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node3/pg_autoctl.state"
88-
17:49:42 12504 INFO Stopping PostgreSQL at "/Users/dim/dev/MS/pg_auto_failover/tmux/node3"
89-
17:49:42 12504 INFO /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl --pgdata /Users/dim/dev/MS/pg_auto_failover/tmux/node3 --wait stop --mode fast
90-
17:49:42 12504 INFO /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl status -D /Users/dim/dev/MS/pg_auto_failover/tmux/node3 [3]
91-
17:49:42 12504 INFO pg_ctl: no server running
92-
17:49:42 12504 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
93-
17:49:42 12504 INFO Removing "/Users/dim/dev/MS/pg_auto_failover/tmux/node3"
94-
17:49:42 12504 INFO Removing "/Users/dim/dev/MS/pg_auto_failover/tmux/config/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node3/pg_autoctl.cfg"
98+
17:52:21 54201 INFO Reaching assigned state "secondary"
99+
17:52:21 54201 INFO Removing node with name "node3" in formation "default" from the monitor
100+
17:52:21 54201 WARN Postgres is not running and we are in state secondary
101+
17:52:21 54201 WARN Failed to update the keeper's state from the local PostgreSQL instance, see above for details.
102+
17:52:21 54201 INFO Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0.
103+
17:52:21 54201 INFO FSM transition to "dropped": This node is being dropped from the monitor
104+
17:52:21 54201 INFO Transition complete: current state is now "dropped"
105+
17:52:21 54201 INFO This node with id 4 in formation "default" and group 0 has been dropped from the monitor
106+
17:52:21 54201 INFO Stopping PostgreSQL at "/Users/dim/dev/MS/pg_auto_failover/tmux/node3"
107+
17:52:21 54201 INFO /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl --pgdata /Users/dim/dev/MS/pg_auto_failover/tmux/node3 --wait stop --mode fast
108+
17:52:21 54201 INFO /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl status -D /Users/dim/dev/MS/pg_auto_failover/tmux/node3 [3]
109+
17:52:21 54201 INFO pg_ctl: no server running
110+
17:52:21 54201 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
111+
17:52:21 54201 INFO Removing "/Users/dim/dev/MS/pg_auto_failover/tmux/node3"
112+
17:52:21 54201 INFO Removing "/Users/dim/dev/MS/pg_auto_failover/tmux/config/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node3/pg_autoctl.cfg"

src/bin/pg_autoctl/cli_common.c

Lines changed: 0 additions & 168 deletions
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,6 @@ int ssl_flag = 0;
4141
/* stores --node-id, only used with --disable-monitor */
4242
int monitorDisabledNodeId = -1;
4343

44-
static void stop_postgres_and_remove_pgdata_and_config(ConfigFilePaths *pathnames,
45-
PostgresSetup *pgSetup);
46-
4744
/*
4845
* cli_common_keeper_getopts parses the CLI options for the pg_autoctl create
4946
* postgres command, and others such as pg_autoctl do discover. An example of a
@@ -1361,171 +1358,6 @@ cli_pprint_json(JSON_Value *js)
13611358
}
13621359

13631360

1364-
/*
1365-
* cli_drop_local_node drops the local Postgres node.
1366-
*/
1367-
void
1368-
cli_drop_local_node(KeeperConfig *config, bool dropAndDestroy)
1369-
{
1370-
Keeper keeper = { 0 };
1371-
1372-
/*
1373-
* Now also stop the pg_autoctl process.
1374-
*/
1375-
if (file_exists(config->pathnames.pid))
1376-
{
1377-
pid_t pid = 0;
1378-
1379-
if (read_pidfile(config->pathnames.pid, &pid))
1380-
{
1381-
log_info("An instance of pg_autoctl is running with PID %d, "
1382-
"stopping it.", pid);
1383-
1384-
if (kill(pid, SIGQUIT) != 0)
1385-
{
1386-
log_error(
1387-
"Failed to send SIGQUIT to the keeper's pid %d: %m", pid);
1388-
exit(EXIT_CODE_INTERNAL_ERROR);
1389-
}
1390-
}
1391-
}
1392-
1393-
/* only keeper_remove when we still have a state file around */
1394-
if (!config->monitorDisabled)
1395-
{
1396-
if (file_exists(config->pathnames.state))
1397-
{
1398-
/* keeper_remove uses log_info() to explain what's happening */
1399-
if (!keeper_remove(&keeper, config))
1400-
{
1401-
log_fatal("Failed to remove local node from the pg_auto_failover "
1402-
"monitor, see above for details");
1403-
1404-
exit(EXIT_CODE_BAD_STATE);
1405-
}
1406-
1407-
log_info("Removed pg_autoctl node at \"%s\" from the monitor and "
1408-
"removed the state file \"%s\"",
1409-
config->pgSetup.pgdata,
1410-
config->pathnames.state);
1411-
}
1412-
else
1413-
{
1414-
log_warn("Skipping node removal from the monitor: "
1415-
"state file \"%s\" does not exist",
1416-
config->pathnames.state);
1417-
}
1418-
}
1419-
else
1420-
{
1421-
/* when the monitor is disabled, just remove the state files */
1422-
if (!unlink_file(config->pathnames.init))
1423-
{
1424-
log_error("Failed to remove state init file \"%s\"",
1425-
config->pathnames.init);
1426-
}
1427-
1428-
if (!unlink_file(config->pathnames.state))
1429-
{
1430-
log_error("Failed to remove state file \"%s\"",
1431-
config->pathnames.state);
1432-
}
1433-
}
1434-
1435-
/*
1436-
* Either --destroy the whole Postgres cluster and configuration, or leave
1437-
* enough behind us that it's possible to re-join a formation later.
1438-
*/
1439-
if (dropAndDestroy)
1440-
{
1441-
(void)
1442-
stop_postgres_and_remove_pgdata_and_config(
1443-
&config->pathnames,
1444-
&config->pgSetup);
1445-
}
1446-
else
1447-
{
1448-
/*
1449-
* We need to stop Postgres now, otherwise we won't be able to drop the
1450-
* replication slot on the other node, because it's still active.
1451-
*/
1452-
log_info("Stopping PostgreSQL at \"%s\"", config->pgSetup.pgdata);
1453-
1454-
if (!pg_ctl_stop(config->pgSetup.pg_ctl, config->pgSetup.pgdata))
1455-
{
1456-
log_error("Failed to stop PostgreSQL at \"%s\"",
1457-
config->pgSetup.pgdata);
1458-
exit(EXIT_CODE_PGCTL);
1459-
}
1460-
1461-
/*
1462-
* Now give the whole picture to the user, who might have missed our
1463-
* --destroy option and might want to use it now to start again with a
1464-
* fresh environment.
1465-
*/
1466-
log_warn("Configuration file \"%s\" has been preserved",
1467-
config->pathnames.config);
1468-
1469-
if (directory_exists(config->pgSetup.pgdata))
1470-
{
1471-
log_warn("Postgres Data Directory \"%s\" has been preserved",
1472-
config->pgSetup.pgdata);
1473-
}
1474-
1475-
log_info("drop node keeps your data and setup safe, you can still run "
1476-
"Postgres or re-join the pg_auto_failover cluster later");
1477-
log_info("HINT: to completely remove your local Postgres instance and "
1478-
"setup, consider `pg_autoctl drop node --destroy`");
1479-
}
1480-
}
1481-
1482-
1483-
/*
1484-
* stop_postgres_and_remove_pgdata_and_config stops PostgreSQL and then removes
1485-
* PGDATA, and then config and state files.
1486-
*/
1487-
static void
1488-
stop_postgres_and_remove_pgdata_and_config(ConfigFilePaths *pathnames,
1489-
PostgresSetup *pgSetup)
1490-
{
1491-
log_info("Stopping PostgreSQL at \"%s\"", pgSetup->pgdata);
1492-
1493-
if (!pg_ctl_stop(pgSetup->pg_ctl, pgSetup->pgdata))
1494-
{
1495-
log_error("Failed to stop PostgreSQL at \"%s\"", pgSetup->pgdata);
1496-
log_fatal("Skipping removal of directory \"%s\"", pgSetup->pgdata);
1497-
exit(EXIT_CODE_PGCTL);
1498-
}
1499-
1500-
/*
1501-
* Only try to rm -rf PGDATA if we managed to stop PostgreSQL.
1502-
*/
1503-
if (directory_exists(pgSetup->pgdata))
1504-
{
1505-
log_info("Removing \"%s\"", pgSetup->pgdata);
1506-
1507-
if (!rmtree(pgSetup->pgdata, true))
1508-
{
1509-
log_error("Failed to remove directory \"%s\": %m", pgSetup->pgdata);
1510-
exit(EXIT_CODE_INTERNAL_ERROR);
1511-
}
1512-
}
1513-
else
1514-
{
1515-
log_warn("Skipping removal of \"%s\": directory does not exists",
1516-
pgSetup->pgdata);
1517-
}
1518-
1519-
log_info("Removing \"%s\"", pathnames->config);
1520-
1521-
if (!unlink_file(pathnames->config))
1522-
{
1523-
/* errors have already been logged. */
1524-
exit(EXIT_CODE_BAD_CONFIG);
1525-
}
1526-
}
1527-
1528-
15291361
/*
15301362
* logLevelToString returns the string to use to enable the same logLevel in a
15311363
* sub-process.

src/bin/pg_autoctl/cli_common.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,6 @@ void keeper_cli_destroy_node(int argc, char **argv);
177177
bool cli_getopt_ssl_flags(int ssl_flag, char *optarg, PostgresSetup *pgSetup);
178178
bool cli_getopt_accept_ssl_options(SSLCommandLineOptions newSSLOption,
179179
SSLCommandLineOptions currentSSLOptions);
180-
void cli_drop_local_node(KeeperConfig *config, bool dropAndDestroy);
181180

182181
char * logLevelToString(int logLevel);
183182

0 commit comments

Comments
 (0)