redis集群误删目录后重新加入集群

故障上报

  • 2022.08.09,开发同事反馈java服务连不上redis集群,但是重启java服务又短暂正常,查看java日志有提示 No reachable node in cluster
RedisConnectionFailureException: No reachable node in cluster; 
nested exception is redis.clients.jedis.exceptions.JedisNoReachableClusterNodeException: No reachable node in cluster

分析过程

  • 最初的redis角色规划
节点 IP地址 redis端口及角色
节点1 - 故障节点 192.168.220.28 6371为主、6372为从
节点2 - 正常节点 192.168.220.73 6371为主、6372为从
节点3 - 正常节点 192.168.222.40 6371为主、6372为从
  • 所有节点:执行 docker ps -a 发现 节点1的2个redis进程已停,且redis目录 /data/znfk/apps/redis-cluster 已被删
[root@redis01 ~]# docker ps -a
CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS                   PORTS                       NAMES
a8870bc03c7c   redis:7.0.4      "docker-entrypoint.s…"   2 weeks ago     Exited (1) 2 weeks ago   0.0.0.0:6371->6379/tcp      redis-cluster1
529ad06a7940   redis:7.0.4      "docker-entrypoint.s…"   2 weeks ago     Exited (1) 2 weeks ago   0.0.0.0:6372->6379/tcp      redis-cluster2

[root@redis01 ~]# docker inspect a8870bc03c7c | grep Source
        "Source": "/data/znfk/apps/redis-cluster/r1/redis.conf",
        "Source": "/data/znfk/apps/redis-cluster/r1/data",

[root@redis01 ~]# ll /data/znfk/apps/redis-cluster
ls: cannot access /data/znfk/apps/redis-cluster: No such file or directory
  • 正常节点:执行 cluster nodes 发现 节点1的2个redis节点状态为 disconnected,同时4个正常的节点中有3个master + 1个slave,这说明 节点1的master服务已经切换到 192.168.220.73:6372 这个原本的slave节点上了
[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster nodes'
6ef7549247dfd29d5c6ffaa5ca4e225056c44224 192.168.220.73:6372@16372 master - 0 1660040148711 7 connected 0-5460
d45842e53979850ff8c194402078e021ff5d7a21 192.168.220.73:6371@16371 myself,master - 0 1660040149000 3 connected 5461-10922
55187a945811d08dfa7dcd487ddbabfb9bf9525b 192.168.222.40:6371@16371 master - 0 1660040148000 5 connected 10923-16383
f9dda14726f18ab82ba8d522af62a2dfb53c80df 192.168.222.40:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660040149715 3 connected
c94149a9b8021df612f7d8daa5340a889568cadd :0@0 slave,fail,noaddr 55187a945811d08dfa7dcd487ddbabfb9bf9525b 1659491084661 1659491082651 5 disconnected
bf42831856172fe6e04d5eb5221612a4f6d150b9 :0@0 master,fail,noaddr - 1659491091597 1659491089000 1 disconnected
  • 最后与开发同事确认,确实由于他们 误将节点1上的2个redis目录删除,后来由于机器重启导致这2个redis进程从集群中 disconnected 失联。

  • 综上所得,redis集群的角色规划及故障前后的角色变化:

节点 IP地址 redis端口 原角色 现角色 故障后的角色变化
节点1 - 故障节点 192.168.220.28 6371 主分片A1 - 失联...
节点2 - 正常节点 192.168.220.73 6371 主分片B1 主分片B1 角色不变
节点3 - 正常节点 192.168.222.40 6371 主分片C1 主分片C1 角色不变
节点1 - 故障节点 192.168.220.28 6372 副本分片C2 - 失联...
节点2 - 正常节点 192.168.220.73 6372 副本分片A2 主分片A1 slave提升为master
节点3 - 正常节点 192.168.222.40 6372 副本分片B2 副本分片B2 角色不变

修复过程

注:以下操作,故障节点 = 节点1,正常节点= 节点2 | 节点3

1.故障节点 - 启动redis单节点服务

  • 1.1.节点1:创建2个redis目录
mkdir -p /data/znfk/apps/redis-cluster/r1/data
mkdir -p /data/znfk/apps/redis-cluster/r2/data
  • 1.2.节点1:创建2个redis配置文件以及docker-copose.yml,内容参考节点2,注意修改 redis.conf 中的ip
touch /data/znfk/apps/redis-cluster/docker-compose.yaml
touch /data/znfk/apps/redis-cluster/r1/redis.conf
touch /data/znfk/apps/redis-cluster/r2/redis.conf
  • 1.3.节点1:启动2个redis服务
cd /data/znfk/apps/redis-cluster
docker-compose -f docker-compose.yaml down
docker-compose -f docker-compose.yaml up -d

2.正常节点 - 删除已失联的节点

  • 2.1.节点2:查看node节点(有2个节点已失联)
[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster nodes | grep disconnected'
c94149a9b8021df612f7d8daa5340a889568cadd :0@0 slave,fail,noaddr 55187a945811d08dfa7dcd487ddbabfb9bf9525b 1659491084661 1659491082651 5 disconnected
bf42831856172fe6e04d5eb5221612a4f6d150b9 :0@0 master,fail,noaddr - 1659491091597 1659491089000 1 disconnected
  • 2.2.节点2:在2个正常的redis节点中,利用cluster forget删除所有已失联的节点(即状态为 disconnected的节点)
[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster forget c94149a9b8021df612f7d8daa5340a889568cadd'
[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster forget bf42831856172fe6e04d5eb5221612a4f6d150b9'

[root@redis02 ~]# docker exec -it redis-cluster2 bash -c 'redis-cli cluster forget c94149a9b8021df612f7d8daa5340a889568cadd'
[root@redis02 ~]# docker exec -it redis-cluster2 bash -c 'redis-cli cluster forget bf42831856172fe6e04d5eb5221612a4f6d150b9'
  • 2.3.节点3:在2个正常的redis节点中,利用cluster forget删除所有已失联的节点(即状态为 disconnected 的节点)
[root@redis03 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster forget c94149a9b8021df612f7d8daa5340a889568cadd'
[root@redis03 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster forget bf42831856172fe6e04d5eb5221612a4f6d150b9'

[root@redis03 ~]# docker exec -it redis-cluster2 bash -c 'redis-cli cluster forget c94149a9b8021df612f7d8daa5340a889568cadd'
[root@redis03 ~]# docker exec -it redis-cluster2 bash -c 'redis-cli cluster forget bf42831856172fe6e04d5eb5221612a4f6d150b9'
  • 2.4.任意正常节点:执行 cluster nodes 确认失联节点已删除(即不存在状态为 disconnected 的节点)
[root@redis03 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster nodes'
6ef7549247dfd29d5c6ffaa5ca4e225056c44224 192.168.220.73:6372@16372 master - 0 1660040148711 7 connected 0-5460
d45842e53979850ff8c194402078e021ff5d7a21 192.168.220.73:6371@16371 myself,master - 0 1660040149000 3 connected 5461-10922
55187a945811d08dfa7dcd487ddbabfb9bf9525b 192.168.222.40:6371@16371 master - 0 1660040148000 5 connected 10923-16383
f9dda14726f18ab82ba8d522af62a2dfb53c80df 192.168.222.40:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660040149715 3 connected

3.正常节点 - 将节点1重新加到集群

  • 3.1.节点1:清空节点1上2个redis的adb和aof文件并重启,否则因为实例非空而添加集群失败
cd /data/znfk/apps/redis-cluster
docker-compose -f docker-compose.yaml down
rm -rf /data/znfk/apps/redis-cluster/r1/data/*
rm -rf /data/znfk/apps/redis-cluster/r2/data/*
docker-compose -f docker-compose.yaml up -d
  • 3.2.节点2:向集群中添加主节点(即节点1的6371),此时4个master + 1个slave
[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli --cluster add-node 192.168.220.28:6371 192.168.220.73:6371'
  # ip1:port1    // 要向集群添加新的主节点
  # ip2:port2    // 原集群中任意的master节点

[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster nodes'
6ef7549247dfd29d5c6ffaa5ca4e225056c44224 192.168.220.73:6372@16372 master - 0 1660049872864 7 connected 0-5460
d45842e53979850ff8c194402078e021ff5d7a21 192.168.220.73:6371@16371 myself,master - 0 1660049873000 3 connected 5461-10922
55187a945811d08dfa7dcd487ddbabfb9bf9525b 192.168.222.40:6371@16371 master - 0 1660049874000 5 connected 10923-16383
f9dda14726f18ab82ba8d522af62a2dfb53c80df 192.168.222.40:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660049874871 3 connected
202efd357b6cac7e10659e2a5f8699caf14c7f90 192.168.220.28:6371@16371 master - 0 1660049873000 0 connected
  • 3.3.节点2:向集群中添加从节点(即节点1的6372),并将其作为master节点3(即192.168.222.40:6371)的从库,此时4个master + 2个slave
[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli --cluster add-node --slave --cluster-master-id d45842e53979850ff8c194402078e021ff5d7a21 192.168.220.28:6372 192.168.220.73:6371'
  # --slave   // 表示要添加从节点
  # --cluster-master-id   // 要添加到哪一个主节点,id是*****
  # ip1:port1    // 要添加的从节点
  # ip2:port2    // 原集群中任意的master节点

[root@redis02 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster nodes'
6ef7549247dfd29d5c6ffaa5ca4e225056c44224 192.168.220.73:6372@16372 master - 0 1660049872864 7 connected 0-5460
d45842e53979850ff8c194402078e021ff5d7a21 192.168.220.73:6371@16371 myself,master - 0 1660049873000 3 connected 5461-10922
55187a945811d08dfa7dcd487ddbabfb9bf9525b 192.168.222.40:6371@16371 master - 0 1660049874000 5 connected 10923-16383
f9dda14726f18ab82ba8d522af62a2dfb53c80df 192.168.222.40:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660049874871 3 connected
202efd357b6cac7e10659e2a5f8699caf14c7f90 192.168.220.28:6371@16371 master - 0 1660049873000 0 connected
e9d81848d497ccac925fa237de0d75f904625f94 192.168.220.28:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660050422549 10 connected
  • 3.4.节点3:登录节点3的6372节点(这个是老的slave后被提升为master),将其调整master节点1(即192.168.220.28:6371)的从库(这样可保证节点1既有master又有slave),此时3个master + 3个slave
[root@redis03 ~]# docker exec -it redis-cluster2 bash -c 'cluster replicate 202efd357b6cac7e10659e2a5f8699caf14c7f90'
  # replicate   // 表示将当前节点作为某个节点的slave

[root@redis03 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli cluster nodes'
6ef7549247dfd29d5c6ffaa5ca4e225056c44224 192.168.220.73:6372@16372 slave 202efd357b6cac7e10659e2a5f8699caf14c7f90 - 0 1660049872864 7 connected 0-5460
d45842e53979850ff8c194402078e021ff5d7a21 192.168.220.73:6371@16371 myself,master - 0 1660049873000 3 connected 5461-10922
55187a945811d08dfa7dcd487ddbabfb9bf9525b 192.168.222.40:6371@16371 master - 0 1660049874000 5 connected 10923-16383
f9dda14726f18ab82ba8d522af62a2dfb53c80df 192.168.222.40:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660049874871 3 connected
202efd357b6cac7e10659e2a5f8699caf14c7f90 192.168.220.28:6371@16371 master - 0 1660049873000 0 connected
e9d81848d497ccac925fa237de0d75f904625f94 192.168.220.28:6372@16372 slave d45842e53979850ff8c194402078e021ff5d7a21 0 1660050422549 10 connected
  • 3.5.节点1:建议再对整个集群做一下 rebalance,将各个节点的slot进行重新均衡分配
[root@redis01 ~]# docker exec -it redis-cluster1 bash -c 'redis-cli --cluster rebalance --cluster-threshold 1 192.168.220.28:6371'
Copyright © www.sqlfans.cn 2023 All Right Reserved更新时间: 2023-07-25 17:48:31

results matching ""

    No results matching ""