断图原因之一:
出现断图时服务器端日志:Zabbix agent item “net.tcp.listen[58535]” on host “syshotname” failed: first network error, wait for 120 seconds
基本都是因为某个监控项执行时间过长(最大时间应该是30S)。net.tcp.listen源码:

int	NET_TCP_LISTEN(const char *cmd, const char *param, unsigned flags, AGENT_RESULT *result)
{
	FILE		*f = NULL;
	char		tmp[MAX_STRING_LEN], pattern[64];
	unsigned short	port;
	zbx_uint64_t	listen = 0;
	int		ret = SYSINFO_RET_FAIL;

	if (num_param(param) > 1)
		return ret;

	if (0 != get_param(param, 1, tmp, sizeof(tmp)))
		return ret;

	if (SUCCEED != is_ushort(tmp, &port))
		return ret;

	if (NULL != (f = fopen("/proc/net/tcp", "r")))
	{
		zbx_snprintf(pattern, sizeof(pattern), "%04X 00000000:0000 0A", (unsigned int)port);

		while (NULL != fgets(tmp, sizeof(tmp), f))
		{
			if (NULL != strstr(tmp, pattern))
			{
				listen = 1;
				break;
			}
		}
		zbx_fclose(f);

		ret = SYSINFO_RET_OK;
	}

	if (0 == listen && NULL != (f = fopen("/proc/net/tcp6", "r")))
	{
		zbx_snprintf(pattern, sizeof(pattern), "%04X 00000000000000000000000000000000:0000 0A", (unsigned int)port);

		while (NULL != fgets(tmp, sizeof(tmp), f))
		{
			if (NULL != strstr(tmp, pattern))
			{
				listen = 1;
				break;
			}
		}
		zbx_fclose(f);

		ret = SYSINFO_RET_OK;
	}

	SET_UI64_RESULT(result, listen);

	return ret;

经过查看可以得知,net.tcp.listen是从

/proc/net/tcp

读取的数据,查看

/proc/net/tcp文件的行数,果然很大,ZABBIX AGENT执行KEY的最大时间为30S。在30S中没有完成,导致了该问题的发现。然后SERVER端会在120S内重新连接ZABBIX-AGEENT,在这120S中

所有的采集项都不会采集,所以出现了断图。

解决方案一:
自定义key用SS来完成。
UserParameter=ss[*],/opt/zabbix/etc/monitor_scripts/ss.sh $1
#!/bin/bash
port=$1
if test -z $port;then
echo 0
exit
fi
function os5()
{
ss -nlp|awk ‘{print $3}’|grep -iv local|awk -F: ‘{print $NF}’|grep -iq “$port$”
if [ $? -eq 0 ];then
echo 1
else
echo 0
fi
}
function os6(){
ss -nlp|awk ‘{print $4}’|grep -iv local|awk -F: ‘{print $NF}’|grep -iq “$port$”
if [ $? -eq 0 ];then
        echo 1
else
        echo 0
fi
}
function tcp_diag(){
lsmod |grep -iq tcp_diag
if [ $? -ne 0 ];then
modprobe tcp_diag
fi
}
function ostype()
{
tcp_diag
grep -iq “6\.” /etc/issue
if [ $? -eq 0 ];then
os6
else
os5
fi
}
ostype

解决方案二
修改AGENT执行key的最大时间,源码当中限制了最大时间30S。得修改源码。并且经过测试只修改AGETN的最大执行时间还不可以,SERVER或PROXY端的最大执行时间也是30S.只修改AGENT端30S,ITEM可以采集ACTIVE的方式就可以了。

发表评论

邮箱地址不会被公开。 必填项已用*标注