Blogs

   
Michael Petrov
Co-Founder, CEO
11/6/2012
Not Trivial Monitoring Tuning

A client requested to know if a server was down when they noticed a reboot occur in their system. The first resource we use for this is monitoring logs. Monitoring didn’t detect any downtime. To verify whether or not the server was actually rebooted, our technician logged into the server to check the event log. We discovered that the VM server did in fact automatically reboot.

Our client then questioned how often we ping for ICMP monitoring and why we hadn’t detected the reboot initially. For up and down events we have multiple sensors with two probing cycles. Our normal probing cycle for this particular client is 60 seconds for ICMP and 10 minutes for port probing and heartbeat report. Because it is a virtual server, the startup was extremely fast. The boot up was so fast that it happened within a 45 second time frame; therefore our standard 60 second probing cycle could not have detected this.

Our clients count on us to let them know when we will have any unexpected reboots.  We understand the server could go down in the middle of their transactions so any reboots at all are essential.  To resolve this issue we implemented another monitoring sensor. Now we are probing if up-time is smaller than the last heartbeat. Having these two scripts will allow us to detect downtime between the 60 seconds of the first monitor, enable us to check log changes and alert us every 10 minutes.

This case proves that out-of-box solutions may not always work and life forces IT teams to constantly modify and improve their operations and techniques to address day-to-day business needs. 

   

Replies

Leave a reply

Name (required)
Email (will not be published) (required)

Number from the image above
  
Latest blog posts
VNX Versions
11/10/2014
Subscribe to the blog by e-mail

Sign up to receive
Digital Edge blog by e-mail


Subscribe    Unsubscribe