Tag Archives: nagios

Nagios Event Handlers on Windows

Nagios event handlers are WHERE IT’S AT, BABY, YEAH!  There are some services that I can just automagically restart without any problems.  (WSUS, SQL Agent, etc.) This way, instead of notifying me, Nagios can just fix the problem for me and We Need Never Know.

These instructions assume I’m running NSClient++.

The script is

@echo off
net start %1
@exit 0

(This is kept intentionally minimal so it’ll be reusable.)  I’m referring to this in nsclient.ini, under the “; A list of scripts available to run from the CheckExternalScripts module. Syntax is: <command>=<script> <arguments>” header.

restartwsus=scripts\runcmd.bat wsusservice

On the Nagios server, I’ve defined the check in commands.cfg as:

define command{
 command_name restartwsus
 command_line /usr/lib/nagios/plugins/check_nrpe -H '$HOSTADDRESS$' -c restartwsus

and in the service definition as:

define service{
        use                     generic-service
        host_name               wsusserver
        service_description     WSUS
        contacts                me
        notification_options    w,c,r
        notification_period     24x7
        notification_interval   0
        check_command           check_nt!SERVICESTATE!-d SHOWALL -l WsusService
        event_handler           restartwsus

It looks like this is copy and paste-able.

Comments Off on Nagios Event Handlers on Windows

Filed under monitoring

Thanksgiving Gluttony

Yum, Nagios gluttony!

I’m donating Nagios monitoring to a couple of nonprofits, and this brings up how Nagios configurations grow.  In short, you learn over time what you need to keep an eye on.

For example:  On one nonprofit, someone forgot to renew the domain (oops!).  It just so happens that there’s a plugin for that.  Godaddy outage hoses DNS?  Add a check for that.  SSL cert expires (oops!)?  Add a check for that.  The web site returns 200 OK (thereby showing up as okay in Nagios) but no content appears?  Add a check for that.

And then apply all those checks to your other hosts.  So the same thing doesn’t happen to them.

And this is how you end up with so many checks.

define command {
command_name check_content
command_line $USER1$/check_http -r “</body>” -H $HOSTADDRESS

define command {
command_name DNS_resolving
command_line $USER1$/check_dns -H $HOSTADDRESS

define command {
command_name check_domain
command_line $USER1$/check_domain -d $HOSTADDRESS

define command {
command_name check_cert
command_line $USER1$/check_http -ssl -C 14 -H ‘$HOSTADDRESS’

Yes, I added checks on Thanksgiving.  *facepalm*

Comments Off on Thanksgiving Gluttony

Filed under monitoring

Clearly, you’re doing it wrong.

So, I have this friend.  (No, really, it’s my friend, it’s not me, I set up my own Nagios server.)  She’s a DBA with no responsibility for anything outside of a bunch of SQL Servers. Nagios wakes her up in the middle of the night if the web server goes down.

If you page people in the middle of the night over things that aren’t their responsibility, you’re just training them to ignore their pagers.  I once worked with someone who was, according to legend, the only person ever to work at [name of company redacted] ever to successfully flush a pager.  (And they didn’t even have Nagios at that time!)

I feel the same way about people who receive daily “CRITICAL!!!” emails that their servers’ drives are 98% full.   Nagios is supposed to be informing you about things that are unusual.  If your SQL Server typically uses 96% of its RAM (mine do), don’t turn off warnings and only receive notifications for critical, and don’t receive daily emails saying that the servers are using too much RAM.  Up the thresholds to sane numbers that indicate an unusual condition.  What do you think happens if, in the slew of daily emails about “CRITICAL!!!” there’s a disk that usually isn’t 100% full, or a service down, or a memory leak?  No, no.  You don’t want your slew of “Situation Normal:  All Frelled Up” emails, you want to know when something unusual is occurring.

If you’re like me, you resist this. “Dammit, my C: drive should be at least 20% free!”  There comes a time when you have to accept that a number is not an attainable number and work from there.

Comments Off on Clearly, you’re doing it wrong.

Filed under monitoring

Your Servers’ Baby Monitor

Do you know about Write or Die?  It’s described as “putting the prod into productivity” and is for procrastinating writers to force themselves to write.  (Writers procrastinate.  It’s a thing.  You can spend hours surfing the web for baby name pages to come up with the perfect name for your walk-on character.  Or you can name him John Doe and fix it in revisions.  The latter is probably more productive.)  You enter a word goal and a time limit and click “Write,” and any time you stop writing the screen turns red.  If you stop long enough, an annoying sound will play.  You might get RickRolled, or have painfully bad violin practice, but I’ve set my copy of the desktop edition to exclusively play the crying babies sound.

This is the perfect metaphor for my monitoring philosophy.

For this reason, it makes me a little insane that I have 392 new email messages from SQL Monitor today about fragmented indexes.  (My phone said 687.)  That’s a whole lot of crying babies. Apparently, I have some work to do.

I’m much happier when my baby monitor is silent and the monitoring page shows a lot of happy, peaceful servers. You know, when I come in the morning and look and they’re all cheerfully perking away doing their thing.  I used to keep my Nagios screen 100% green, and it made me a little wacky when we merged Nagios servers and I added the servers of the guys who actually like getting their daily, “Yes, your hard drive is still 100% full!” emails.  *twitch*  Ah well, it makes them happy.

There’s a Nagios plugin for Firefox that plays a sound when you have a problem.  I’d like to get that to play the crying babies sound.  The advantage to that would be that if anything of mine ever broke, not only would my sensibilities be offended, but if I didn’t fix it promptly my coworkers would kill me and no one would ever find my body!  Now there’s putting the prod into productivity!

And now?  Apparently, I need to go into the nursery and shut up  calm some babies.  (Not about the indexes, about Something Else.)

Comments Off on Your Servers’ Baby Monitor

Filed under monitoring

Confessions of a Monitoring Junkie

You may be asking yourself, “Self, why should I cough up cash for SQL Monitor or spend time (or pay someone to spend time) setting up Nagios?”

The answer, of course, is:  Do you want to know something is wrong now, or when a customer calls you and says their app is broken and they can’t do business?

(By the way, I don’t care if you’re a nonprofit.  They’re still a customer who can’t do business.  The students can’t print their term papers, the web site can’t take donations, they can’t do business.  As you were.)

You’ll also learn unexpected things, like that the last time you patched settings you made were changed by the patch, or that that relatively small customer is beating your server when you’re not looking.

I’m going to briefly discuss the two products I’ve used that I like, and will politely ignore the one that I was less taken with. (To be fair, I think their implementation is less mature and they’ll improve.)

If you don’t have Microsoft SQL Server, you’re going to want to skip down to Nagios.  If you do, read on, McDuff.

Thus far, I’ve found Red Gate’s SQL Monitor to be really nice right out of the box.  You give it credentials, and it starts nomming up your disk space with shiny happy data.  The very first day I had it installed, our sysadmin came over and asked me, “Do you know any reason why this particular server would have high CPU and memory use?” and I was able to tell him that a very large customer was running some really intensive reporting without looking at the server.  Of course, I knew this because they were generating a large number of long-running query alerts.  Oops.  (Those queries have since been optimized.)

Which leads, of course, to a few caveats.  Namely, that if you haven’t done any monitoring before you might receive a large number of alerts.  In fact, you might actually receive an email from SQL Monitor saying, “I’ve sent you 1000 emails today.  You cannot have more emails.  You have an email problem.”  (This isn’t the actual message, but maybe it should be.)  You’ll probably want to up the alert thresholds and slowly work them back down.  You also might want to set up Outlook filters and only send the emails to you and not your boss until things settle down.

The other caveat is:  You’re aware that when an application creates a SQL database, the defaults might not be optimal for your particular situation, right?  SQL Monitor starts off with a 1MB database that autogrows by 1MB.  If you have terabytes of data, you might see a bit of disk I/O from that.  Ahem.  Change those numbers to sane numbers.  (You don’t know what size to make the database?  Guess.  Your worst guess is better than 1MB.  Make the autogrow a large number, too.  And you do know about Instant File Initialization, don’t you?)

And then, there’s Nagios.

I love Nagios because it’s crazy extensible.  There are a lot of great user-contributed scripts over at Nagios Exchange in addition to the built-in stuff, and depending on who you are someone in your organization may already be running it and you might be able to get in on the action for the low price of beer, cookies, or pizza.  If none of the scripts at Nagios Exchange suit your needs, you can write one in any language that will return a valid return code to NRPE and/or NSClient++, including bash, python, perl, VBscript, batch, PowerShell, etc.  It can also read SNMP data.

So, Nagios caveats:  Out of the box, it pretty much tells you if your disk space, CPU, and Memory are okay and whether a service is running (all of which is awesome, but it’s maybe not the entire universe of things you want to know).  There are supplied check scripts for things like SMTP, FTP, HTTP, etc., and user-supplied scripts for all kinds of wild and crazy things (Available security patches!  Oracle!  Weather alerts!  Cat water dishes!).  If you want to know about the battery life of the UPS that doesn’t have a generator, you’re probably going to have to dig around under the hood, so make sure you (or the person running the system) enjoy that sort of thing.  Because, in my not-particularly-humble opinion:

  • If you’re just checking disk space, memory, and CPU, you’re not harnessing Nagios’ true power.  Find new and exciting things to check!  When something breaks, figure out how to monitor it!
  • If you’re sending the Exchange Administrator an email when the Web Server goes down, and the same guy/gal doesn’t manage them both, you’re doing it wrong.  Make contact groups your friend.
  • If you have 100 web servers and 100 service entries for HTTP,  you’re doing it wrong.  Make host and service groups your friends.  (Of course, groups do sometimes assume standards that may not exist at your workplace, but group as best as you can.)
  • Create separate contacts for email-worthy and pager-worthy alerts.  Do not page someone in the night about their scratch disk being 80% full.  That’s bad karma.
  • You want Nagiosgraph.  Trust me.

And now, general monitoring caveats:

Some people just don’t get monitoring.  They see it as “being bugged.”  I see it as “finding out things are about to break before they actually break.”

A paraphrased but otherwise true conversation, which I will call “F my F: Drive”:

Guy:  Can you turn off email alerts for the F: drive?  I don’t care if it fills up.
Me: Even during business hours?
Guy: Yeah. Don’t care.
Me: What’s on it?
Guy: Our logging app writes logs there.
Me: Do you care if the logging app is running?
Guy: Oh, yeah! Make sure I get an alert for that!
Me: Will it stop running if F: fills up?
Guy: Yes…
Me: Then you care if F: fills up.
Guy: No, I don’t.  I only care if our logging app is running.

If you’re the one setting up Nagios for someone else, just smile and nod at these people and let their F: drive fill up.  But you?  You care if your F: drive fills up.

As for which one you want, well, they’re not really equivalent products.  SQL Monitor is only for SQL servers, and is an analysis and performance tuning tool.  It’s really nice for SQL Servers, and it would be really hard to duplicate all of the functionality in Nagios.  On the other hand, if you don’t have the money for SQL Monitor, well, I think setting up Nagios instead is worth it.  Free is everyone’s favorite price, and I want to run both because, you know, crazy extensible.  Besides, Nagios’ main purpose in life is to tell you that something is broken or about to break, which I want to know!

As such, Nagios is smarter than SQL Monitor about alerts.  In Nagios, I can define that my connection to the two instances on SQL Server 1 is dependent upon my ability to get through two routers, and if the network faceplants Nagios will tell me that the server is unreachable, not that it’s down and every service on it is down.  During the last application of Microsoft patches, SQL Monitor told me that the host was down, and the instances were all down, and that all the databases were unavailable, and that I cannot have more emails because I have an email problem.  And then I investigated “scheduled downtime.”

Comments Off on Confessions of a Monitoring Junkie

Filed under monitoring

Patch Tuesday

If you’re like me, you have Patch Tuesday on your calendar.  In fact, if you’re like me, you’re running something like this at home! with your linux desktop checking itself and your Windows desktop for available patches via Nagios!

No?  It’s just me?

(It’s not just me.  I’m sadly trumped by the man monitoring his cat’s water dish with Nagios.)

Comments Off on Patch Tuesday

Filed under geekiness

Nagios Plugin – SQL Job Status

I checked out Nagios Exchange, and didn’t see anything that checked the status of a job and used Windows/AD credentials/trusted connection.  So I wrote this.

It’s intended to run as an NRPE script, and doesn’t require anything that doesn’t come with Windows and SQL Server.  There’s a stored procedure and a batch file, both of which can be modified to suit your purposes.

You also need to enable nrpe and external scripts in nsc.ini, and define the checks in the [External Scripts] section.  Run the script with the job name as an argument (“scripts\last_sql_job_run_status.cmd some_job_name”).  An optional second argument is an integer representing how many days ago the job ran before Something Is Wrong [TM] (“scripts\last_sql_job_run_status.cmd some_job_name 3”).  If no time period is specified, it assumes a week.  The plugin returns critical if the job failed and warning if it succeeded but is late.

For trusted connections, you can run nsclient++ as a service account and give that account read access to sysjobs and sysjobactivity in msdb and execute on the procedure itself, although it should work out of the box for a default nsclient++ install checking localhost.  You can also use sql server authentication if you edit the line invoking sqlcmd accordingly.


Comments Off on Nagios Plugin – SQL Job Status

Filed under monitoring, scripting