azkaban多Executor模式报错找不到文件问题
故障上报
某天收到大数据同事反馈,某个新部署的azkaban任务(执行mysqldump备份)报错提示 No such file or directory,但登录服务器并手动执行该脚本则正常
排障过程
- 1.将该脚本中涉及的 path 全部改为完整路径(比如 sh 改为 /bin/sh)也失败
- 2.日志中发现 effective user is: azkaban,但服务器上并无该用户,随后创建 azkaban 用户并多次重试,有时候ok有时候failed,排除用户问题
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Starting job rizhao_mysqldump at 1712237103312
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - job JVM args: '-Dazkaban.flowid=rizhao_source' '-Dazkaban.execid=3731409' '-Dazkaban.jobid=rizhao_mysqldump'
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - user.to.proxy property was not set, defaulting to submit user azkaban
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Building command job executor.
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Failed with 5 inputs with exception e = null
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Memory granted for job rizhao_mysqldump
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - 1 commands to execute.
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - cwd=/home/fusion_data/package/azkaban-exec-server/executions/3731409/rizhao_import
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - effective user is: azkaban
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Command: sh /data/it_jobs/jobs.sh
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Environment variables: {JOB_OUTPUT_PROP_FILE=/home/fusion_data/package/azkaban-exec-server/executions/3731409/rizhao_import/rizhao_mysqldump_output_382962359329691643_tmp, JOB_PROP_FILE=/home/fusion_data/package/azkaban-exec-server/executions/3731409/rizhao_import/rizhao_mysqldump_job_props_8893110997836978845_tmp, KRB5CCNAME=/tmp/krb5cc__rizhao_import__rizhao_source__rizhao_mysqldump__3731409__azkaban, JOB_NAME=rizhao_mysqldump}
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Working directory: /home/fusion_data/package/azkaban-exec-server/executions/3731409/rizhao_import
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Spawned process with id 225312
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - sh: /data/it_jobs/jobs.sh: No such file or directory
04-04-2024 21:25:03 CST rizhao_mysqldump INFO - Process with id 225312 completed unsuccessfully in 0 seconds.
04-04-2024 21:25:03 CST rizhao_mysqldump ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException: Process exited with code 127
- 3.登录azkaban数据库,发现有2台 executor,而脚本只部署在 task-1 这1台机器
mysql> use fusion_metabase_azkaban
mysql> select * from executors limit 20;
+----+----------------+-------+--------+
| id | host | port | active |
+----+----------------+-------+--------+
| 29 | task-4.bigdata | 12321 | 1 |
| 30 | task-1.bigdata | 12321 | 1 | #.脚本部署在这台
+----+----------------+-------+--------+
mysql> select executor_id,count(1),FROM_UNIXTIME(max(update_time)/1000) from execution_flows where executor_id in (29,30) and update_time>=1645459200000 group by executor_id;
+-------------+----------+--------------------------------------+
| executor_id | count(1) | FROM_UNIXTIME(max(update_time)/1000) |
+-------------+----------+--------------------------------------+
| 29 | 118342 | 2024-04-04 21:44:32.9970 |
| 30 | 122038 | 2024-04-04 21:46:33.0120 |
+-------------+----------+--------------------------------------+
- 4.登录另外一台 executor 服务器,配置azkaban任务所用到的脚本
[root@task-1 ~]# ping task-4.bigdata
PING task-4.bigdata (192.168.1.4) 56(84) bytes of data.
64 bytes from task-4.bigdata (192.168.1.4): icmp_seq=1 ttl=64 time=0.165 ms
[root@task-1 ~]# ssh -p1618 root@192.168.1.4
[root@task-4 ~]# mkdir -p /home/azkaban/it_jobs/dump
[root@task-4 ~]# vi jobs.sh
- 5.再次执行azkaban任务,成功~
注:第2步多次执行有成功过,应该是调度到 task-1.bigdata 这台executor.
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Starting job rizhao_mysqldump at 1712238580810
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - job JVM args: '-Dazkaban.flowid=rizhao_source' '-Dazkaban.execid=3731477' '-Dazkaban.jobid=rizhao_mysqldump'
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - user.to.proxy property was not set, defaulting to submit user azkaban
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Building command job executor.
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Failed with 5 inputs with exception e = null
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Memory granted for job rizhao_mysqldump
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - 1 commands to execute.
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - cwd=/home/fusion_data/package/azkaban-exec-server/executions/3731477/rizhao_import
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - effective user is: azkaban
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Command: /bin/sh /home/azkaban/it_jobs/jobs.sh
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Environment variables: {JOB_OUTPUT_PROP_FILE=/home/fusion_data/package/azkaban-exec-server/executions/3731477/rizhao_import/rizhao_mysqldump_output_2013999851185447972_tmp, JOB_PROP_FILE=/home/fusion_data/package/azkaban-exec-server/executions/3731477/rizhao_import/rizhao_mysqldump_job_props_3619555286561983768_tmp, KRB5CCNAME=/tmp/krb5cc__rizhao_import__rizhao_source__rizhao_mysqldump__3731477__azkaban, JOB_NAME=rizhao_mysqldump}
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Working directory: /home/fusion_data/package/azkaban-exec-server/executions/3731477/rizhao_import
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Spawned process with id 319255
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Current date and time: 2024-04-04 21:49:40
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - Dumping database: fusiondb2_zzhw
04-04-2024 21:49:40 CST rizhao_mysqldump INFO - mysqldump: [Warning] Using a password on the command line interface can be insecure.
解决方案
- 针对多Executor模式,除了在所有Executor节点部署脚本之外,还可以在执行 azkaban 任务的时候添加变量 useExecutor 指定某个 Executor(比如 task-1.bigdata 对应的Executor的id为30)