a short overview of my case:
Originally I had a simple running consul cluster with one server (server1) and several clients. Here is my server1.json configuration file:
{
"server": true,
"datacenter": "dc1",
"data_dir": "/opt/consul",
"bind_addr": "0.0.0.0",
"client_addr":"0.0.0.0",
"bootstrap_expect": 3,
"ui": true,
"retry_join": ["provider=aws tag_key=Function tag_value=consul-server"],
"encrypt":"WT2T9..."
}
NOTE: bootstrap_expect was originally = 1, and I changed to 3 when I added new servers.
For some testing purposes I want to add two more servers to the cluster.
So I configure and add the servers one by one. Everything works as expected and I can confirm that my cluster now has 3 servers with two new servers (server1 and server2) being followers.
Here is my server2.json configuration file:
The config file for server3 looks the same.
Now when my cluster is running, for testing purposes I stoped the service of the server1, which was the leader. As a result, new leader was elected.
consul members
command shows:
Node Address Status Type Build Protocol DC Segment
server2 xx.xx.x.xx1:8301 alive server 1.9.3 2 dc1 <all>
server1 xx.xx.x.xx2:8301 left server 1.8.3 2 dc1 <all>
server3 xx.xx.x.xx3:8301 alive server 1.9.3 2 dc1 <all>
Looks good! Server1 has left.
Now I want to bring server1 back to my cluster.
So I restart the consul.service of the server1 and I expect that it will rejoin the cluster.
And it joins it, but in several seconds it is failed and leaves the cluster.
Here are some outputs:
The output of onsul members
:
consul.service - Consul server agent
Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2021-02-04 13:33:25 UTC; 3s ago
Docs: https://www.consul.io/
Process: 18632 ExecStart=/usr/local/bin/consul agent -ui -config-dir=/etc/consul.d/ (code=exited, status=2)
Main PID: 18632 (code=exited, status=2)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0007ec400, 0x40, 0x40)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:113 +0x75
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM(0xc000302900)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:219 +0x42f
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000302900, 0xc00085e340)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:146 +0x55
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: created by github.com/hashicorp/raft.(*raftState).goFunc
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:144 +0x66
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Failed with result 'exit-code'.
The output of journalctl -xe -u consul
shows me the following
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.267Z [INFO] agent: discover-aws: Instance i-0ae416a280967e345 has private ip 10.10.1.555: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.268Z [INFO] agent: discover-aws: Instance i-0e5394a632853e200 has private ip 10.10.1.777: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.268Z [INFO] agent: discover-aws: Instance i-019966401267318b4 has private ip 10.10.1.222: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.268Z [INFO] agent: Discovered servers: cluster=LAN cluster=LAN servers="10.10.1.555 10.10.1.777 10.10.1.222"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.268Z [INFO] agent: (LAN) joining: lan_addresses=[10.10.1.555, 10.10.1.777, 10.10.1.222]
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.271Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: ip-10-10-1-777 10.10.1.777
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.271Z [INFO] agent.server.serf.lan: serf: EventMemberJoin: ip-10-10-1-222 10.10.1.222
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.272Z [INFO] agent.server: Adding LAN server: server="ip-10-10-1-777 (Addr: tcp/10.10.1.777:8300) (DC: dc1)"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.272Z [INFO] agent.server: Adding LAN server: server="ip-10-10-1-222 (Addr: tcp/10.10.1.41:8300) (DC: dc1)"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.274Z [INFO] agent: (LAN) joined: number_of_nodes=3
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.275Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.278Z [WARN] agent.server.raft: failed to get previous log: previous-index=93400 last-index=93262 error="log not found"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.505Z [INFO] agent: Synced node info
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: 2021-02-04T13:33:25.526Z [INFO] agent: Deregistered check: check=vault:10.10.1.555:8200:vault-sealed-check
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: panic: failed to decode request: invalid config entry kind: service-intentions
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: goroutine 37 [running]:
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).applyConfigEntryOperation(0xc000923c80, 0xc0008b0421, 0x150, 0x150, 0x169ef, 0x0, 0x0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /home/circleci/project/consul/agent/consul/fsm/commands_oss.go:453 +0xb27
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.New.func1(0xc0008b0421, 0x150, 0x150, 0x169ef, 0xc000538a50, 0xc000eb9380)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /home/circleci/project/consul/agent/consul/fsm/fsm.go:85 +0x56
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).Apply(0xc000923c80, 0xc0013702d0, 0x0, 0x0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /home/circleci/project/consul/agent/consul/fsm/fsm.go:120 +0x24b
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/go-raftchunking.(*ChunkingFSM).Apply(0xc000984780, 0xc0013702d0, 0x53a3f80, 0xbfff1bc9745fc84c)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:66 +0x5b
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000956fe0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:90 +0x2c1
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0007ec400, 0x40, 0x40)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:113 +0x75
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM(0xc000302900)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:219 +0x42f
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000302900, 0xc00085e340)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:146 +0x55
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: created by github.com/hashicorp/raft.(*raftState).goFunc
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:144 +0x66
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Failed with result 'exit-code'.
As you can see from the second output, the server1 is actually joining the cluster, but then smth happens and it leaves it.
panic: failed to decode request: invalid config entry kind: service-intentions
I tried to modify configuration file in different ways and restart the consul server1 --> without success.
Finally I emptied the consul data_dir and restarted the server1 anew --> without success.
Has anybody experienced something like this and can support here?
question from:
https://stackoverflow.com/questions/66054490/consul-server-fails-to-rejoin-the-cluster-leaves-after-few-seconds