NIST CSF: Detect: Signal to Noise Ratio

The Detect phase of a NIST CSF-based Cybersecurity program is about monitoring your Protect phase actions which protect what your Identify phase identified. In practice, this means looking at lots of log entries. Every damn day.

If you look at your logs every day, you will quickly figure out that what you are seeing falls into one of a only a few categories:

Notifications, e.g. "Web server shutdown at {time} {date}"
Warnings, e.g. "process NNNN is unresponsive"
Errors, e.g. "this program tried to use an object before creating that object"

Notifications are, ideally, signposts or checkpoints of things are probably ok. If you restart your web server at midnight every night, then seeing that in the log tells you everything is working normally; if you see a restart at any other time, you might want to make sure that this was legit.

Warnings are, ideally, just that: putting you on alert that something has happened which might be OK by itself, but it might also be the precursor to something bad. A process being unresponsive might simply be a sign of a system under heavy load but it also might be a sign that the process has gone bonkers and needs t be killed. Or, worse, that some system service needed by that process has died or gone bonkers and your system may be about to start behaving erratically. You need to check.

Errors are, to programmers, generally in one of two categories: fatal and non-fatal. We are not a very nuanced bunch. Fatal errors in the log tell you why a process crashed or exited. Non-fatal errors tell you that something bad (or unexpected) happened and that the software was able to recover. But such self-recovery is notoriously unreliable or incomplete, so you probably want to try to avoid these errors.

In the modern age of robust environments, non-fatal error are becoming more common, simply because errors which were fatal in the bad old days can now be recovered from. That doesn't mean that the logic error in the code which led to the error are ok, just that such errors don't crash software as often. When I started programming, I was working in C on a Unix system and pointer mismanagement resulted in a "core dump" which was the operating system detecting a fatal error and dumping the contents of my app's memory space (my chunk of "core memory" which is now RAM) in a file so that I could grovel over that memory image with a low level debugger. Good times. Now you get an error message, which is likely logged, and program execution goes on (until it doesn't).

A few days, when checking the error log on a client's server, I saw a bunch of error messages of this form: some helper app was accessing an object that had yet to be created. I check this log every day. I already have to filter out a few kinds of notification because these notifications do not reflect the server as a whole, but rather what a particular app is doing. I find my daily slog through the log boring enough without having to ignore screenfulls of errors that are specific to one particular application.

So I sent one of the client's programmers a bug report with the errors that I found so annoying. She confirmed that she had been using that application that day and that the application had returned the result she wanted, so from her perspective everything was fine. She wanted to know if I still wanted her to fix the app.

The answer is yes, of course, because filling the log with avoidable, meaningless errors has a number of effects, all of which are negative:

Unusual errors will not stand out, making the next crisis harder to handle
It will become normal for the log to be filled with errors, obscuring new problems
It will become normal for this app to throw errors, obscuring future failures of this app

In other words, when monitoring the logs, the signal to noise ratio will tilt toward noise, obscuring the signal. In this case, the signal is information about new and urgent problems on your system and the noise is everything else in the log.

This programmer's attitude is an example of just how much of cybersecurity is behavioral, but how subtle that part of cybersecurity can be. She clearly isn't committed to cybersecurity. She doesn't care if the logs are hard to read because she doesn't read the logs. Worse, although I know her to be diligent and competent, she does not see how Detect is part of her job, which makes me certain that Protect isn't part of her job either. So her section of the rampart is unguarded. That might not ever be a problem, but in today's threat environment, I wouldn't bet on it.

We have a short video on this topic on our YouTube channel.

Search This Blog

NIST CSF: Detect: Signal to Noise Ratio

Comments

Post a Comment