We use Zabbix to monitor our servers, but recently the monitoring agent has been causing some problems of it’s own.
About once a week we send a fairly large mailshot out to our users. Zabbix monitors the sizeof the postfix mail queue on each of our mail servers, and then stores this in its database so it can draw graphs and send us an alert if the mail queue gets too big. But here’s the problem: the action of counting the mail queue itself is quite intensive, and it seems to be locking up the server when it runs.
After some investigation I found (in /etc/zabbix/zabbix_agentd.conf) that we were using the following command to measure the mailq:
[root@mx1 ~]# time mailq | grep -c '^[0-9A-Z]'
As you can see it took 6.59 seconds to run on a queue size of about 35,000. You could also run the postqueue command and look at the end of the output:
[root@mx1 ~]# time postqueue -p | tail -5
-- 158346 Kbytes in 34621 Requests.
But, again this takes over 5 seconds for 35,000 mails. So a much quicker way would be to use:
[root@mx1 ~]# time find /var/spool/postfix/deferred/ /var/spool/postfix/active/ /var/spool/postfix/maildrop/ | wc -l
Using find is over 100 times faster than the other two methods. Each of those command reports a slightly different size of the mailq, but they are pretty close. If anyone knows of an even quicker way to measure the queue size then please let me know!