Estimating where your Prometheus Blackbox TCP query-response check failed

utcc.utoronto.ca/~ckscks2026年02月02日 04:20

As covered recently, the normal way to check simple services from outside in a Prometheus environment is with Prometheus Blackbox, which is somewhat complicated to understand. One of its abstractions is a prober, a generic way of checking some service using HTTP, DNS queries, a TCP connection, and so on. The TCP prober supports conducting a query-response dialog once you connect, but currently (as of Blackbox 0.28.0) it doesn't directly expose metrics that tell you where your TCP probe with a query-response set failed (and why), and sometimes you'd like to know.

A somewhat typical query-response probe looks like this:

  smtp_starttls:
    prober: tcp
    tcp:
      query_response:
        - expect: "^220"
        - send: "EHLO something\r"
        - expect: "^250-STARTTLS"
        - expect: "^250 "
        - send: "STARTTLS\r"
        - expect: "^220"
        - starttls: true
        - expect: "^220"
        - send: "QUIT\r"

To understand what metrics we can look for on failure, we need to both understand how each important option in a step can fail, and what metrics they either set on failure or create when they succeed.

  • starttls will fail if it can't successfully negotiate a TLS connection with the server, possibly including if the server's TLS certificate fails to verify. It sets no metrics on failure, but on success it will set various TLS related metrics such as the probe_ssl_* family and probe_tls_version_info.

  • send will fail if there is an error sending the line, such as the TCP connection closing on you. It sets no metrics on either success or failure.

  • expect reads lines from the TCP connection until either a line matches your regular expression, it hits EOF, or it hits a network error. If it hit a network error, including from the other end abruptly terminating the connection in a way that raises a local error, it sets no metrics. If it hit EOF, it sets the metric probe_failed_due_to_regex to 1; if it matched a line, it sets that metric to 0.

    One important case of 'network error' is if the check you're doing times out. This is internally implemented partly by putting a (Go) deadline on the TCP connection, which will cause an error if it runs too long. Typical Blackbox module timeouts aren't very long (how long depends on both configuration settings and how frequent your checks are; they have to be shorter than the check interval).

    If you have multiple 'expect' steps and you check fails at one of them, there's (currently) no way to find out which one it failed at unless you can determine this from other metrics, for example the presence or absence of TLS metrics.

  • expect_bytes fails if it doesn't immediately read those bytes from the TCP connection. If it failed because of an error or because it read fewer bytes than required (including no bytes, ie an EOF), it sets no metrics. If it read enough bytes it sets the probe_failed_due_to_bytes metric to either 0 (if they matched) or 1 (if they didn't).

In many protocols, the consequences of how expect works means that if the server at the other end spits out some error response instead of the response you expect, your expect will skip over it and then wait endlessly. For instance, if the SMTP server you're probing gives you a SMTP 4xx temporary failure response in either its greeting banner or its reply to your EHLO, your 'expect' will sit there trying to read another line that might start with '220'. Eventually either your check will time out or the SMTP server will, and probably it will be your check (resulting in a 'network error' that leaves no traces in metrics). Generally this means you can only see a probe_failed_due_to_regex of 1 in a TCP probe based module if the other end cleanly closed the connection, so that you saw EOF. This tends to be pretty rare.

(We mostly see it for SSH probes against overloaded machines, where we connect but then the SSH daemon immediately closes the connection without sending the banner, giving us an EOF in our 'expect' for the banner.)

If the probe failed because of a DNS resolution failure, I believe that probe_ip_addr_hash will be 0 and I think probe_ip_protocol will also be 0.

If the check involves TLS, the presence of the TLS metrics in the result means that you got a connection and got as far as starting TLS. In the example above, this would mean that you got almost all of the way to the end.

I'm not sure if there's any good way to detect that the connection attempt failed. You might be able to reasonably guess that from an abnormally low probe_duration_seconds value. If you know the relevant timeout values, you can detect a probe that failed due to timeout by looking for a suitably high probe_duration_seconds value.

If you have some use of the special labels action, then the presence of a probe_expect_info metric means that the check got to that step. If you don't have any particular information that you want to capture from an expect line, you can use labels (once) to mark that you've succeeded at some expect step by using a constant value for your label.

(Hopefully all of this will improve at some point and Blackbox will provide, for example, a metric that tells you the step number that a query-response block failed on. See issue #1528, and also issue #1527 where I wish for a way to make an 'expect' fail immediately and definitely if it receives known error responses, such as a SMTP 4xx code.)