Estimating where your Prometheus Blackbox TCP query-response check failed
As covered recently, the normal way to check simple services from outside in a Prometheus environment is with Prometheus Blackbox, which is somewhat complicated to understand. One of its abstractions is a prober, a generic way of checking some service using HTTP, DNS queries, a TCP connection, and so on. The TCP prober supports conducting a query-response dialog once you connect, but currently (as of Blackbox 0.28.0) it doesn't directly expose metrics that tell you where your TCP probe with a query-response set failed (and why), and sometimes you'd like to know.
A somewhat typical query-response probe looks like this:
smtp_starttls:
prober: tcp
tcp:
query_response:
- expect: "^220"
- send: "EHLO something\r"
- expect: "^250-STARTTLS"
- expect: "^250 "
- send: "STARTTLS\r"
- expect: "^220"
- starttls: true
- expect: "^220"
- send: "QUIT\r"
To understand what metrics we can look for on failure, we need to both understand how each important option in a step can fail, and what metrics they either set on failure or create when they succeed.
starttlswill fail if it can't successfully negotiate a TLS connection with the server, possibly including if the server's TLS certificate fails to verify. It sets no metrics on failure, but on success it will set various TLS related metrics such as theprobe_ssl_*family andprobe_tls_version_info.sendwill fail if there is an error sending the line, such as the TCP connection closing on you. It sets no metrics on either success or failure.expectreads lines from the TCP connection until either a line matches your regular expression, it hits EOF, or it hits a network error. If it hit a network error, including from the other end abruptly terminating the connection in a way that raises a local error, it sets no metrics. If it hit EOF, it sets the metricprobe_failed_due_to_regexto 1; if it matched a line, it sets that metric to 0.One important case of 'network error' is if the check you're doing times out. This is internally implemented partly by putting a (Go) deadline on the TCP connection, which will cause an error if it runs too long. Typical Blackbox module timeouts aren't very long (how long depends on both configuration settings and how frequent your checks are; they have to be shorter than the check interval).
If you have multiple '
expect' steps and you check fails at one of them, there's (currently) no way to find out which one it failed at unless you can determine this from other metrics, for example the presence or absence of TLS metrics.expect_bytesfails if it doesn't immediately read those bytes from the TCP connection. If it failed because of an error or because it read fewer bytes than required (including no bytes, ie an EOF), it sets no metrics. If it read enough bytes it sets theprobe_failed_due_to_bytesmetric to either 0 (if they matched) or 1 (if they didn't).
In many protocols, the consequences of how expect works means
that if the server at the other end spits out some error response
instead of the response you expect, your expect will skip over
it and then wait endlessly. For instance, if the SMTP server you're
probing gives you a SMTP 4xx temporary failure response in either
its greeting banner or its reply to your EHLO, your 'expect' will
sit there trying to read another line that might start with '220'.
Eventually either your check will time out or the SMTP server will,
and probably it will be your check (resulting in a 'network error'
that leaves no traces in metrics).
Generally this means you can only see a probe_failed_due_to_regex
of 1 in a TCP probe based module if the other end cleanly closed
the connection, so that you saw EOF. This tends to be pretty rare.
(We mostly see it for SSH probes against overloaded machines, where
we connect but then the SSH daemon immediately closes the connection
without sending the banner, giving us an EOF in our 'expect' for
the banner.)
If the probe failed because of a DNS resolution failure, I believe
that probe_ip_addr_hash will be 0 and I think probe_ip_protocol
will also be 0.
If the check involves TLS, the presence of the TLS metrics in the result means that you got a connection and got as far as starting TLS. In the example above, this would mean that you got almost all of the way to the end.
I'm not sure if there's any good way to detect that the connection
attempt failed. You might be able to reasonably guess that from an
abnormally low probe_duration_seconds value. If you know the
relevant timeout values, you can detect a probe that failed due to
timeout by looking for a suitably high probe_duration_seconds
value.
If you have some use of the special labels action, then the presence of a
probe_expect_info metric means that the check got to that step.
If you don't have any particular information that you want to capture
from an expect line, you can use labels (once) to mark that
you've succeeded at some expect step by using a constant value
for your label.
(Hopefully all of this will improve at some point and Blackbox will
provide, for example, a metric that tells you the step number that
a query-response block failed on. See issue #1528, and
also issue #1527 where
I wish for a way to make an 'expect' fail immediately and
definitely if it receives known error responses, such as a SMTP
4xx code.)