Grafana: Monitoring Grafana

Created on 21 Nov 2015 · 34Comments · Source: grafana/grafana

It's time to monitor the monitoring! It'd be great to have a /status or /health endpoint that returns grafana health data as json.

Things I'd like to get from a status endpoint are:

configured sources are reachable (when I configure a new graphite source I can test the connection, I'd love to have that via the /status API)
DB is available
configured authorization sources are reachable
version

e.g:

/status

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

help wanted prioritimportant-longterm typfeature-request

Source

jaimegago

👍30

Most helpful comment

Added a simple http endpoint to check grafana health:

GET /api/health 
{
  "commit": "349f3eb",
  "database": "ok",
  "version": "4.1.0"
}

If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the database field. Grafana will still answer with status code 200. Not sure what is correct in that case.

The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth).

torkelo on 25 Apr 2017

🎉20 👍18

All 34 comments

anryko on 21 Nov 2015

:+1:

kjedamzik on 27 Nov 2015

make sure the health url does not generate sessions

torkelo on 8 Dec 2015

👍2

:+1:

mattttt on 8 Jan 2016

+1 , this would be very useful to run grafana behind loadbalancer, loadbalancer will call the /health HTTP to verify is grafana returns HTTP 200 OK.

williamjoy on 11 Jan 2016

I've put together something dead simple, but I'm not particularly happy with it at the moment.

If anyone would like to take a look at current state vs master: https://github.com/grafana/grafana/compare/master...theangryangel:feature/health_check

It returns something like:

{"current_timestamp":"2016-06-04T18:43:49+01:00","database_ok":true,"session_ok":true,"version":{"built":1464981754,"commit":"v3.0.4+158-g7cbaf06-dirty","version":"3.1.0"}}

The database check I was originally returning some stats, but I've cut that out. I could switch the query to something much simpler like "select 1" and checking it doesn't error. Not sure if it's worth it.

The session check I'm not particularly happy with either. There doesn't seem to be an easy to test without standing up a test macaron server and recover()ing from the panic that it would throw when starting a session provider, or modifying macaron/session to add a test feature to each of the providers. As it is right now it irritating returns a Set-Cookie header, which I don't particularly want. I'd appreciate some input where to take this from someone more experienced with macaron 😞

Checking for data sources doesn't seem particularly sane to try through this given how grafana is written. Probably more sane to add to your regular monitoring system.

theangryangel on 4 Jun 2016

I was facing the same issue and as a workaround, I use an API call from the load balancer with a dedicated authentication API key. I'm using HAProxy, which has some useful "hidden" feature of setting custom HTTP headers in option httpchk:

option httpchk GET /api/org HTTP/1.0\r\nAccept:\ application/json\r\nContent-Type:\ application/json\r\nAuthorization:\ Bearer\ your_api_key\r\n

(I need to use HTTP/1.0 rather than 1.1, since the latter requires setting Host header and I can't get it dynamically in HAProxy config).

/api/org seems to be the simplest request with little overhead and returns HTTP 200, which is exactly what the load balancer needs -- and does not create any new sessions.

wpt1313 on 10 Jun 2016

Any progress or PR on this issue?

iceycake on 7 Jul 2016

tuxtek on 29 Sep 2016

I would split this into a separate /liveness and /readiness endpoint as is best practice in kubernetes. /liveness only indicates whether grafana itself is up and running, /readiness indicates whether its ready to receive traffic and will check whether its dependencies are reachable.

In kubernetes the liveness endpoint will be probed and when failing a number of times to respond with 200 ok the container will be killed and replaced with a new one. The readiness endpoint is used to make the container part of a service and send traffic its way. Like adding and removing it from a load balancer.

JorritSalverda on 29 Sep 2016

marco-hoyer on 12 Oct 2016

what about adding a /metrics Prometheus endpoint?

bigkraig on 3 Nov 2016

👍6

vinhlh on 8 Nov 2016

For whoever needs health checks on some services like Amazon ECS:
Use this hack: Path /public/img/grafana_icon.svg, HTTP Code: 200.

vinhlh on 8 Nov 2016

philip-wernersbach on 14 Nov 2016

In the mean time if you're only looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.

envintus on 5 Dec 2016

👍2

Think everyone knows how to technically imply this but the point is to explicitly support monitoring of service health including external dependencies.

Sent from my iPhone

On Dec 5, 2016, at 4:09 PM, Hunter Satterwhite notifications@github.com wrote:

If you're looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

andyfeller on 5 Dec 2016

I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as /login to work around the absence of this, but we really need something that checks that the Grafana internals are running as expected. We don't necessarily need status checks for retrieving data from data sources, as we have separate health checks for those.

philip-wernersbach on 6 Dec 2016

+1 to this.

On Mon, Dec 5, 2016 at 11:51 PM, Philip Wernersbach <
[email protected]> wrote:

I'd like to add our specific use case: we need a simple HTTP endpoint for
checking if a user can login and display graphs. I know that we can use the
static resources and endpoints such as /login to work around the absence
of this, but we really need something that checks that the Grafana
internals are running as expected. We don't necessarily need status checks
for retrieving data from data sources, as we have separate health checks
for those.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/3302#issuecomment-265060171,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AIESgm7BZw3jqs8ElVWU9v7CjtcXBYFwks5rFOm-gaJpZM4Gm4T8
.

[image: TransLoc_logos_gear-blue_600x600.png]

Hunter Satterwhite

Lead Build & Operations Engineer, TransLoc

Cell: 252.762.5177 | http://transloc.com http://www.transloc.com/

[image: social media icons-03.png] https://www.facebook.com/TransLoc/ [image:
social media icons-04.png] https://www.linkedin.com/company/transloc [image:
social media icons-02.png] http://www.twitter.com/transloc [image: social
media icons-01.png] http://www.instagram.com/transloc_inc

envintus on 6 Dec 2016

So there is currently in 4.0 a /api/metrics endpoint with some internal metrics.

But the issue requests something like this

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

Would be good with a more detailed description for what is expected here. Should the API health call do a live check with all data sources in all orgs? should it be done on the fly as the /health api call is made?
What does authorization ok mean?

torkelo on 14 Dec 2016

@torkelo going to toss out an idea but definitely think /health should allow for both grafana-server as well as installed plugins to register arbitrary things to report on:

{
    "ok": false,
    "items": [
        "datasources": {
            "ok": true,
        },
        "database": {
            "ok": false,
            "msg": "Cannot communicate ###.###.###.###/XXXXXXX"
        },
        ...
    ]
}

By default, health checks perform live checks of all things when endpoint is called. If people want to isolate health checks to specific things, you can do something like elasticsearch does for cluster health. When thing is an external service (authorization, database, etc), then connectivity test is done at the minimum and any other sanity check that is reasonable for thing (e.g. SELECT 1 for database, LDAP bind test for authorization, etc).

Having output like this will allow monitoring checks to check holistically for issues while finding specific problems and output accordingly.

andyfeller on 14 Dec 2016

👍1

aseppala on 24 Jan 2017

@torkelo sorry for the delayed answer just saw your questions.

TL;DR
@andyfeller Did a good job in his comment and it's pretty much what I had in mind

The end point (or end points) used to monitor Grafana should answer 2 questions with details:
A) Is this Grafana instance up and ready ?
B) Is this Grafana instance running as expected according to its configuration intents?

"configuration intents" is key here, what I mean by intent is that when for example the admin adds as a data source she expects it to be available regardless of whether or not the saved configuration is right. Thus if a configured data source is not available to Grafana the monitoring end point should say so and why, in the same fashion the extremely useful "test" button works.

It helps me think in terms of a plane taking off, first I need to know the plane has finished taking off and is in the air, then I need to know the plane is flying towards its destination as expected (let's not get into what "reaching cruise altitude" means ;-) )

This can be somewhat be compared to the /live /ready others have pointed out or /health (1) /state (2) of the Elasticsearch model or /health and /info of Sensu (3).
IMHO one endpoint is enough but seeing 2 endpoints in most modern tools is _kinda_ changing my mind; let's just say I'm not persuaded yet as I think B is a subset of A so I'd make the JSON returned reflect that instead of having 2 end points. Then one day when Grafana can be clustered a "/cluster_state" can be added.

Now regarding the details of each answer, here are my -non exhaustive- initial thoughts:
A details :

Status (e.g. red/yellow/green)
Status comment (e.g. "All is good"/"Couldn't start component Foo"/"Starting")
Version (e.g. v4.1.1-1)

B details:

DB Status (e.g. red/yellow/green)
DB details (e.g. "couldn't connect, bad auth", or connection ok to mySQL v4.1 at xxx.yyy.zzz:3306, schema version v34132, yes SQL schemas should be versioned (4) )
Authentication/Authorization (e.g. LDAP connection to xx.xx.xx:389 ok)
Data sources (e.g. Datasource 1, type Graphite, status Red, status comment "auth failure, Datasource 2, type Elasticsearch, status Green, status comment "all good")

There is much more that can go in B which is why breaking the monitoring into 2 end points might make more sense, meh.

As to how to go about what happens when the end point is being queried (on the fly, APIs ,etc), I would defer to who ever ends up implementing.

A couple of - obvious?- advices though:

be very mindful of resources used to collect monitoring data and be very "protective" with the instrumentation code, help Grafana admins avoid "my monitoring of Grafana took Grafana down" or "Grafana has slowed down by X % since I started monitoring it" situations.
be as certain as you can on the provided monitoring data, alert fatigue is a plague

(1) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
(2) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html
(3) https://sensuapp.org/docs/0.23/api/health-and-info-api.html#the-info-api-endpoint
(4) https://blog.codinghorror.com/get-your-database-under-version-control/

jaimegago on 24 Jan 2017

So 4.2.0 just came out and there still is no way to probe the service? (think k8s cluster)

dynek on 23 Mar 2017

👍5

@torkelo I think @dynek has a point, this is not optional anymore. Whether it's a new section in the docs dedicated to "how to monitor Grafana" where what can be done today with the existing instrumentation (e.g. leverage admin or metrics page) is documented or a full fleshed dedicated API like in this proposal we need something yesterday.
Please don't take this the wrong way, I don't mean to tell you what the priorities should be, It's just that it's a tough sell for an application to be "Enterprise Ready" without a dedicated part to how to monitor it.

jaimegago on 23 Mar 2017

👍6

al-joshwilliams on 7 Apr 2017

Added a simple http endpoint to check grafana health:

GET /api/health 
{
  "commit": "349f3eb",
  "database": "ok",
  "version": "4.1.0"
}

If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the database field. Grafana will still answer with status code 200. Not sure what is correct in that case.

The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth).

torkelo on 25 Apr 2017

🎉20 👍18

Wouldn't it be best to return with status code 503 when the database is unreachable?

ConorNevin on 25 Apr 2017

Kubernetes uses:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

adamcstephens on 25 Apr 2017

Yes, I think 503 status code when db status failed is best, will update

torkelo on 25 Apr 2017

The 503 means the /api/health endpoint is best only used for the readiness check in Kubernetes. If this check is used for liveness a database issue will lead to all pods getting killed. Is there a query parameter to leave out the database check?

JorritSalverda on 26 Oct 2017

@JorritSalverda you could probably use tcpSocket check in livenessProbe

bedrin on 1 Nov 2017

/metrics will not create sessions or issue a db request.

bergquist on 1 Nov 2017

👍2

we typically have agressive readiness checks and relaxed liveness checks, 1 second, 1 fail for readiness, while it's 60 seconds 10 fails 1 success for liveness, this allows for responsive rerouting when there is an issue, but at the same time if self recovery is possible, prevents unnecessary pod restarts. But a persistent DB issue would cause restart which might actually help if it was due to some bad container state.