Confessions of a Monitoring Junkie

You may be asking yourself, “Self, why should I cough up cash for SQL Monitor or spend time (or pay someone to spend time) setting up Nagios?”

The answer, of course, is:  Do you want to know something is wrong now, or when a customer calls you and says their app is broken and they can’t do business?

(By the way, I don’t care if you’re a nonprofit.  They’re still a customer who can’t do business.  The students can’t print their term papers, the web site can’t take donations, they can’t do business.  As you were.)

You’ll also learn unexpected things, like that the last time you patched settings you made were changed by the patch, or that that relatively small customer is beating your server when you’re not looking.

I’m going to briefly discuss the two products I’ve used that I like, and will politely ignore the one that I was less taken with. (To be fair, I think their implementation is less mature and they’ll improve.)

If you don’t have Microsoft SQL Server, you’re going to want to skip down to Nagios.  If you do, read on, McDuff.

Thus far, I’ve found Red Gate’s SQL Monitor to be really nice right out of the box.  You give it credentials, and it starts nomming up your disk space with shiny happy data.  The very first day I had it installed, our sysadmin came over and asked me, “Do you know any reason why this particular server would have high CPU and memory use?” and I was able to tell him that a very large customer was running some really intensive reporting without looking at the server.  Of course, I knew this because they were generating a large number of long-running query alerts.  Oops.  (Those queries have since been optimized.)

Which leads, of course, to a few caveats.  Namely, that if you haven’t done any monitoring before you might receive a large number of alerts.  In fact, you might actually receive an email from SQL Monitor saying, “I’ve sent you 1000 emails today.  You cannot have more emails.  You have an email problem.”  (This isn’t the actual message, but maybe it should be.)  You’ll probably want to up the alert thresholds and slowly work them back down.  You also might want to set up Outlook filters and only send the emails to you and not your boss until things settle down.

The other caveat is:  You’re aware that when an application creates a SQL database, the defaults might not be optimal for your particular situation, right?  SQL Monitor starts off with a 1MB database that autogrows by 1MB.  If you have terabytes of data, you might see a bit of disk I/O from that.  Ahem.  Change those numbers to sane numbers.  (You don’t know what size to make the database?  Guess.  Your worst guess is better than 1MB.  Make the autogrow a large number, too.  And you do know about Instant File Initialization, don’t you?)

And then, there’s Nagios.

I love Nagios because it’s crazy extensible.  There are a lot of great user-contributed scripts over at Nagios Exchange in addition to the built-in stuff, and depending on who you are someone in your organization may already be running it and you might be able to get in on the action for the low price of beer, cookies, or pizza.  If none of the scripts at Nagios Exchange suit your needs, you can write one in any language that will return a valid return code to NRPE and/or NSClient++, including bash, python, perl, VBscript, batch, PowerShell, etc.  It can also read SNMP data.

So, Nagios caveats:  Out of the box, it pretty much tells you if your disk space, CPU, and Memory are okay and whether a service is running (all of which is awesome, but it’s maybe not the entire universe of things you want to know).  There are supplied check scripts for things like SMTP, FTP, HTTP, etc., and user-supplied scripts for all kinds of wild and crazy things (Available security patches!  Oracle!  Weather alerts!  Cat water dishes!).  If you want to know about the battery life of the UPS that doesn’t have a generator, you’re probably going to have to dig around under the hood, so make sure you (or the person running the system) enjoy that sort of thing.  Because, in my not-particularly-humble opinion:

  • If you’re just checking disk space, memory, and CPU, you’re not harnessing Nagios’ true power.  Find new and exciting things to check!  When something breaks, figure out how to monitor it!
  • If you’re sending the Exchange Administrator an email when the Web Server goes down, and the same guy/gal doesn’t manage them both, you’re doing it wrong.  Make contact groups your friend.
  • If you have 100 web servers and 100 service entries for HTTP,  you’re doing it wrong.  Make host and service groups your friends.  (Of course, groups do sometimes assume standards that may not exist at your workplace, but group as best as you can.)
  • Create separate contacts for email-worthy and pager-worthy alerts.  Do not page someone in the night about their scratch disk being 80% full.  That’s bad karma.
  • You want Nagiosgraph.  Trust me.

And now, general monitoring caveats:

Some people just don’t get monitoring.  They see it as “being bugged.”  I see it as “finding out things are about to break before they actually break.”

A paraphrased but otherwise true conversation, which I will call “F my F: Drive”:

Guy:  Can you turn off email alerts for the F: drive?  I don’t care if it fills up.
Me: Even during business hours?
Guy: Yeah. Don’t care.
Me: What’s on it?
Guy: Our logging app writes logs there.
Me: Do you care if the logging app is running?
Guy: Oh, yeah! Make sure I get an alert for that!
Me: Will it stop running if F: fills up?
Guy: Yes…
Me: Then you care if F: fills up.
Guy: No, I don’t.  I only care if our logging app is running.

If you’re the one setting up Nagios for someone else, just smile and nod at these people and let their F: drive fill up.  But you?  You care if your F: drive fills up.

As for which one you want, well, they’re not really equivalent products.  SQL Monitor is only for SQL servers, and is an analysis and performance tuning tool.  It’s really nice for SQL Servers, and it would be really hard to duplicate all of the functionality in Nagios.  On the other hand, if you don’t have the money for SQL Monitor, well, I think setting up Nagios instead is worth it.  Free is everyone’s favorite price, and I want to run both because, you know, crazy extensible.  Besides, Nagios’ main purpose in life is to tell you that something is broken or about to break, which I want to know!

As such, Nagios is smarter than SQL Monitor about alerts.  In Nagios, I can define that my connection to the two instances on SQL Server 1 is dependent upon my ability to get through two routers, and if the network faceplants Nagios will tell me that the server is unreachable, not that it’s down and every service on it is down.  During the last application of Microsoft patches, SQL Monitor told me that the host was down, and the instances were all down, and that all the databases were unavailable, and that I cannot have more emails because I have an email problem.  And then I investigated “scheduled downtime.”

Comments Off on Confessions of a Monitoring Junkie

Filed under monitoring

Comments are closed.