ecs迁移导致rabbitmq集群异常
故障场景
- 2023.02.11.将rabbitmq集群所在的3台ecs做了迁移,迁移之后(机器名变了)集群就异常了,集群状态报错如下:
[root@u-69197-iot ~]# rabbitmqctl cluster_status
Error: unable to perform an operation on node 'rabbit@u-69197-iot'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit@u-69197-iot
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit@u-69197-iot']
rabbit@u-69197-iot:
* connected to epmd (port 4369) on u-69197-iot
* epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic
* TCP connection succeeded but Erlang distribution failed
* Node name (or hostname) mismatch: node "rabbit@iZbp12le2q9si3aqiyr4zfZ" believes its node name is not "rabbit@iZbp12le2q9si3aqiyr4zfZ" but something else.
All nodes and CLI tools must refer to node "rabbit@iZbp12le2q9si3aqiyr4zfZ" using the same name the node itself uses (see its logs to find out what it is)
Current node details:
* node name: 'rabbitmqcli-1739-rabbit@u-69197-iot'
* effective user's home directory: /root
* Erlang cookie hash: 2Hi4RI3kn7cLq4+Xw+tVlQ==
核心报错
TCP connection succeeded but Erlang distribution failed
Node name (or hostname) mismatch: node "rabbit@iZbp12le2q9si3aqiyr4zfZ" believes its node name is not "rabbit@iZbp12le2q9si3aqiyr4zfZ" but something else.
解决方案
- 第1步,[rabbitmq集群的所有节点],修改主机名
#.节点1:hostnamectl set-hostname u-69197-iot
#.节点2:hostnamectl set-hostname u-248214-iot
#.节点3:hostnamectl set-hostname u-10099-iot
- 第2步,[rabbitmq集群的所有节点],修改
vi /etc/hosts
192.168.0.61 u-69197-iot
192.168.0.62 u-248214-iot
192.168.0.63 u-10099-iot
- 第3步,[rabbitmq集群的主节点],重启mq服务(若停止失败可尝试 kill)
rabbitmqctl stop
rabbitmq-server -detached
- 第4步,[rabbitmq集群的主节点],其他节点重新加入集群
rabbitmqctl stop_app
rabbitmqctl join_cluster --ram rabbit@u-69197-iot
rabbitmqctl start_app
- 第5步,[rabbitmq集群的任一节点],确认集群状态已恢复
rabbitmqctl cluster_status