Simple Website Monitoring

How do you easily track the availability of a handful of websites? I have a few personal resource websites hosted in my basement. And a few others out on the web. My aDSL modem had started to intermittently go offline. Since I was not monitoring it, I was unsure how often this was happening. Once every few weeks in the winter could just be due to local power failures in the neighbourhood. Or may the modem was really dying a slow death? What was the uptime of my cloud based sites? I was embarrassed since I could not quickly answer these qiuestions...

My problem was simple. I did not need extravagant high powered monitoring of intricate internal details. Just a simple periodic health check with failure notification would be sufficient. My goal was twofold:

  • learn how often the failures occurred
  • receive proactive notification of a failure so that I was not surprised by an outage

I have had prior Dev Ops experience and was familiar with the Zabbix monitoring tool. It is excellent. Sophisticated. Overkill for my needs. But highly rated.

My needs were simple:

  • periodically perform an HTTP GET of a website home page
  • send me an email on change of status

I reasoned that there must be some low cost (free!) SaaS solutions available.

So I did some searching and quickly found a long list of free website monitoring services. This should be simple to find a service that will meet my needs. In reality it was quite time consuming and frustrating.

All of the services I found did an excellent job of allowing you to sign up and create an account. They mostly made it easy to create a "monitor" or "test". They did a poor job of clearly communicating what features their service provided. And some were just useless as the test was so simple that it was guaranteed to generate false positives. False positives would create more work for me, not less.

My Zabbix experience had quickly taught me that a simple "test & alert" scheme was insufficient for the real world. Failures of the actual test mechanism occur. The monitoring system needs to be resilient to transient network conditions. It needs to be able to reliably distinguish between failures of tests and test failures.

The tests also needed to be useful. A single hourly GET with a 10 second timeout would likely report false positives due to the low timeout value and would not detect short term transient issues due to the slow test rate.

What I wanted was

  • minimum test rate of 5 minutes
  • ability to select a large response time as I was not interested in performance
  • a smart failure detection mechanism that would include some sort of fast retry

So I signed up for a bunch of the free services and let them all run in parallel so I could compare their behaviour. The next post will cover the services and the results.