DBAs and Windows Event Logs

While watching a military documentary with my son the other night, I was struck by just how crazy it is that snipers need to account for wind. I mean, yeah: I get it – they’re shooting at targets that are ridiculously far away and I’ve known for a long time that they have to account for wind. But, on the other hand, the notion that a slug of metal hurtling along at roughly 3,300 ft/sec (or rougly 2,250 mph) can be ‘pushed’ off-target by a 3 or 5 mph breeze just seems, well, a bit crazy in some ways.

Then again, if a breeze can have such a tangible impact on how hard it can be to place ordinance on-target, then it probably stands to reason the the operating environment of a SQL Server can play some role on how well it behaves under load or under pressure.

Checking on the Windows System Event Logs

In some organizations, DBAs simply won’t have access to the Windows event logs – for auditing purposes (i.e., to prevent DBAs from potentially ‘covering their tracks’ and so on). And in other organizations, Sys Admins might keep track of Windows Event Logs and things might be ‘sequestered’ enough in terms of responsibilities that DBAs aren’t encouraged to look at the windows event logs. But, on plenty of SMB (small to medium business) servers, IT Folks or Sys Admins might not ever really check into these logs and that then means that DBAs should be looking at them.

More to the point: Regardless of whose responsibility it is (since that can vary from organization to organization), SOMEONE needs to be reviewing event log entries on a regular basis. Because if no one is checking these event logs on a regular or semi-regular basis, it’s all to easy for hardware problems, network/security configuration issues, and other problems to ‘creep’ into production and have what I’d characterize as a much greater impact on production than wind has on bullets.

Things To check For

Happily checking Windows Event Logs is pretty easy – and only takes a few minutes per week per server in situations where DBAs need to check the logs – especially if DBAs are able to ‘clear’ the logs once they’ve reviewed them. (Which just means that if you can clear the logs, then the next time you review them you only have to review what’s in the log instead of trying to review only to the point where you reviewed previously.)

Whenever I’m checking event logs, I keep the review pretty simple – thanks to the ability to easily set filters.

To check Windows Event Logs, just open Server Manager and then navigate into the Diagnostics > Event Viewer > Windows Logs node. Then, to filter, click on the ‘Filter Current Log…’ option to the right of the screen as shown below:

When making my checks of event logs I start by checking the Application Event Log for anything that looks like it might potentially cause problems or being problematic – and I do so by filtering against only Critical and Error events – as way too many applications ‘spam’ pretty useless information in the form of ‘warnings’ to the Application log.

Then, when checking the Security log, unless I’m looking for something in particular, I just look for Audit Failures by using the option from the Keywords dropdown. And, when evaluating failures, I typically just look for any large block or series of failures back to back – as such a block of entries might be part of the signature of someone trying to brute-force their way on to a box or compromise a particular resource:

Then, for the System Event Log, I always make sure to check or review Critical, Error, and Warning messages – simply because imminent hard-drive failures and/or other kinds of warnings and alerts that show up in the System Event Log might warrant further attention or be a decent indicator of potential problems occurring on your box – which might have an impact on how SQL Server operates.

How to Respond to Potential Problems

Of course, if you find anything that looks problematic or worrisome, then you’ll probably need to do a bit more research. Typically I pull this off by dropping the event IDs and other key words/details into Google. And, of course, whenever I do this, I always get ready to treat any responses or ‘hits’ that I find with a great degree of skepticism. However, if you take the approach of looking for CONTEXT on the kind of error and problem you’re seeing – then usually you’ll be able to find some details or links to things that can help enlighten you as to what the error or warning is detailing, and then (once you have a better handle on the existing context) you’ll typically be able to actually make enough sense out of the error message itself to get a good feeling for whether it’s a problem or not – or whether it requires additional attention.

Stated differently: If you go looking for ‘answers’ to what to do about a particular problem or event entry, then you’re likely to find lots of bad advice and guidance from plenty of people who really didn’t know what they were doing (or who knew what they were doing but were dealing with a similar (but not identical) problem. As such, if you instead look for information and understanding, then you’ll typically come up with much better options and insights about how to treat specific log entries as they occur. That, and remember: you’re not looking to solve every problem that ever occurred – trying that would be futile. Instead, the primary goal of evaluating your Windows Event Logs on a regular basis is to try and keep you apprised and aware of any long-term shifts or problems that might be cropping up – as these kinds of issue can/will manifest later on as problems that you’ll see in SQL Server. (Such as if you’ve got a Hard Drive that’s failing, or the box is having problems communicating with the domain controller, or there are RPC failures or DNS issues, and so on.)

Comments

Plain text