Grafana: building alerting system for grafana

Created on 22 Jun 2015  ·  294Comments  ·  Source: grafana/grafana

Hi everyone,
I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.

From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana.
I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us:
we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.

First of all, terminology sync:

  • alerting: executing logic (threshold checks or more advanced) to know the state of an entity. (ok, warning, critical)
  • notifications: emails, text messages, posts to chat, etc to make people aware of a state change
  • monitoring: this term covers everything about monitoring (data collection, visualizations, alerting) so I won't be using it here.

I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.

General thoughts:

  • integration with existing tools vs built-in: there's some powerfull alerting systems out there (bosun, kale) that deserve integration.
    Many alerting systems are more basic (define expression/threshold, get notification when breached), for those it seems integration is not worth the pain (though I won't stop you)
    The integrations are a long term effort. I think the low hanging fruit ("meet 80% of the needs with 20% of the effort") can be met with a system
    that is more closely tied to Grafana, i.e. compiled into the grafana binary.
    That said, a lot of people confuse seperation of concerns with "must be different services".
    If the code is sane, it'll be decoupled packages but there's nothing necessarily wrong with compiling them together. i.e. you could run:

    • 1 grafana binary that does everything (grafana as you know it + all alerting features) for simplicity

    • multiple grafana binaries in different modes (visualization instances and alerting instances) even highly available/redundant setups if you want to, using an external worker queue

That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")

  • polling vs stream processing: they have different performance characteristics,
    but they should be able to take the same or similar alerting rule definitions (thresholds, boolean logic, ..), they mostly are about how the actual rules are executed and don't
    change much about how rules are defined. Since polling is much simpler and should be able to scale fairly far this should IMHO be our initial focus.

Current state

The raintank/grafana version currently has an alerting package
with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications.
It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc).
This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is

  1. an interface to create and manage alerting rules
  2. state management (acknowledgements etc)

these are harder problems, which I hope to tackle with your input.

Requirements, Future implementations

First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization)
You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right.
And it has a good state machine.
In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and
for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration
may look different down the road based on experience and as we figure out what we want our alerting to look like.

Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage
your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:

  1. some visualized metrics (metrics plotted on graphs) are not alerted on
  2. some visualized metrics are alerted on:

    • A: with simple threshold checks: easy to visualize alerting logic

    • B: with more advanced logic: (e.g. look at standard deviation of the series being plotted, compare current median against historical median, etc): can't easily be visualized nex

      to the input series

  3. some metrics used in alerting logic are not to be vizualized

Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap.
I need to think about this a bit more and wonder what y'all think.
There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.

There's a few more complications which I'll explain through an example sketch of how alerting could look like:
sketch

let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot.
we then use fields C,D,E to put stuff that we don't want to alert on.
C contains the formula for ratio of error requests against the total.

we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also
if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.

notes:

  • some queries use different timeranges than what is rendered
  • in addition to processing by tsdb (such as Graphite's sum(), divide() etc which return series) we need to be able to reduce series to single numbers. fairly easy to implement (and in fact currently the bosun library does this for us)
  • we need boolean logic (bosun also gives us this)
  • in this example the expression only uses variables defined within the same panel, but it might make sense to include expressions of other panels/graphs.

other ponderings:

  • do we integrate with current grafana graph threshold settings (which are currently for viz only, not for processing) ? if the expression is a threshold check, we could automatically
    display a threshold line
  • using the letters is a bit clunky, could we refer to the aliases instead? like #requests and #errors?
  • if the expression are stats.$site.requests and stats.$site.errors, and we want to have seperate alert instances for every site (but only set up the rule once)? what if we only want it for a select few of the sites. what if we want different parameters based on which site? bosun actually supports all these features, and we could expose them though we should probably build a UI around them.

I think for an initial implementation every graph could have two fields, like so:

warn: - expression
         - notification settings (email,http hook, ..)
crit: - expression
        -notification settings

where the expression is something like what I put in E in the sketch.
for logic/data that we don't want to visualize, we just toggle off the visibility icon.
grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.

Thoughts?
Do you have concerns or needs that I didn't addres?

arealerting

Most helpful comment

The alerting branch have now been merged to master. :raised_hands:

We appreciate all the feedback that we have received from this issue. Thanks to all of you!
For future discussion and feedback, please post in corresponding alerting issue or create a new one. This helps us organize and prioritize our future work. I'm closing this ticket in favor of the new ones. But feel free to keep up the discussion in this issue.

So whats next?

  • Alpha release (docs and blogpost)
  • Gather feedback from the community.
  • Keep working on the remaining issues for alerting
  • Release Grafana 4.0 with alerting.

Try it out?

  • You have to enable alerting in the config.
  • You can now find alerting in the side menu.
  • You can add an alert by going to a graph panel and selecting the alert tab.
  • Use the _Test alert_ button to verify your alert.
  • To save the alert you just have to save the dashboard.
  • Set up notification on /alerting/notifications to be notified about firing alerts.
  • Add the notifier to an alert in the alert tab.

Current limitations

  • So far we only support graphite.
  • For this release only graph panel has support for alerting.

Example dashboards

You can find example dashboards in the examples folder.
The example dashboards are based on the data from our fake graphite data writer. You can start graphite and the fake-data-writer from our docker-compose files.

cd docker/
./create_docker_compose.sh graphite
docker-compose up

This should only be considered a rough guide and we will add more documentation about alerting in the following weeks.

Happy alerting! :cocktail: :tada:

All 294 comments

I'd love to help out with this! My suggestion would be to stick with the nagios-style guidelines. That way the tools could easily be used with other monitoring tools. e.g. Nagios, Zenoss, Icinga, etc..

The biggest thing about this feature is getting the basic architecture right.

Some questions i would like to explore
1) What components are required how are they run (in proc in grafana, out of proc),
2) How should things be coordinated.
3) Should we ignore "in stream" alerting, (only focus on pull based)

Going more in depth into 1)
I am worried about making grafana-server into a monolith. Would like to find a way to seperate grafana-server into services that are more isolated from each other (and can be run either inproc or as seperate processes). This was kind of the plan with the bus abstraction. Another option would be to have the alerting component only speak to grafana via the HTTP api, might limit integration, not sure.

I agree with torkelo. In my experience with other projects with everything "built-in" it can get quite cumbersome to troubleshoot. I like the idea of the service running externally, but a nice config page in grafana that talks to the service through the HTTP api to handle managing all the alerts. Also, for large scale deployments this would probably end up being a requirement as performance would eventually degrade (I would at least have this as a configuration option).

do we integrate with current grafana graph threshold settings (which are currently for viz only, not for processing) ? if the expression is a threshold check, we could automatically display a threshold line

I think that could be a good place to start. Alert if its set, don't if its not.

Back to number 1. I think that if the bosun service could run separately but still have the ability to completely configure everything through grafana that would be, in my opinion, ideal.

Keep up the awesome work.

The only shortcoming I have seen with bosun is the data sources it can use. If you could leverage the language for expressing bosun alerting but also integrate with existing data sources that are configured via the regular grafana UI it would certainly be ideal.

Being able to represent alerting thresholds, when you are close to them, as well as automatically push annotations for when they have triggered in my mind make an ideal single pane UI.

Looking forward to the work that will be done here!

  1. It should use the thresholds defined in the Dashboard to alert on
    Let's keep it simple; if the Dashboard shows the color for warning it should be alerting.
  2. This likely be something outside of the grafana-server process itself.
    ... Something that would use the rest api to scrape the dashboards and it's settings and render them and alert using an external command.
  3. Alerting level; just a box to drop in the editor that this Dashboard should be monitored; and it should be checked every minute. If there's no data it should still for a period it should still alert? (checkbox)

Lastly; as we depend on Grafana more I admit i'm willing to say 2. could be something i'd be willing to pay for.

I'm curious why people think this should be included into Grafana at all?
Grafana neither receives nor stores that actual data but "only" visualizes it. Any alerting system should instead be based on the data in the metric store.
If this is really integrated into Grafana I hope this can be disabled because over here we already use Icinga for alerting so any kind of alerting in Grafana would only clutter the GUI more even though it wouldn't be used at all.

Absolutely correct @dennisjac; Grafana only renders things.

But as we've moved things server side it's no longer just client rendering; the possibilities of a worker process that could check your metrics and alert; is less difficult.

Data is in a database; provided it's sprinkled with the data that tells it to check the metric ...

Some people may agree or disagree that we should not cross the streams and make Grafana do more than visualize it (roughly) but I'm not them.

I'm not really opposed to the feature for people who want it to be integrated but I hope it will be made optional for people who already have monitoring/alerting systems available.

The new Telegraf project (metric collector from the influxdb guys) also is looking at monitoring/alerting features which is dislike for the same reason. I elaborated on this here:
https://influxdb.com/blog/2015/06/19/Announcing-Telegraf-a-metrics-collector-for-InfluxDB.html#comment-2114821565

I think torkelo has done a really good job at giving us features in Grafana2 that we don't have to enable.

As far as influxdb they're going to have to make some money somehow; either off of support of influxdb and professional services or products for it.

The latter sounds much more viable

Another angle on this. There seems to be upcoming support for elasticsearch as a metric storage for grafana. Bosun can right now query elasticsearch for log data.

Would it make sense when designing the alerting system to allow for alerts from log data as well? Maybe not a feature for the first version, but something that can be implemented later.

Also I agree with the idea of splitting the processes. Have Grafana the interface to view and create alerts, have something else handle the alerting. Having the alerting part api based would also allow other tools to interface with it.

+1 to Alerting. Outside DevOps usage, applications built for end users need to provide user defined alerts. Nice to have it in the visualization tool...

+1 this will close the loop - the propose of getting metrics.

+1 Alerting from Grafana + a Horizontally Scaling Backend from InfluxDB will make them the standard to beat for Metrics Alerting Configurations

+1 I'd love horizontal scaling of the alerting on multiple grafana nodes.

It would be great if one could associate a "debounce" like behavior with an alert. For example, I want to fire an alert only if the defined threshold exceeds X for N minutes.

I have seen this with some of the alerting tools, unfortunately we are currently using Seyren which doesn't appear to provide such an option. We are using Grafana for our dashboard development and are looking forward to pulling the alerting into Grafana as well. Keep up the good work.

We have two use cases:

  • infrastructure team creates alert through provision tools as usual into common monitoring stack (common cluster check or system checks in nagios friendly system )
  • software developers create app metrics via Grafana

We would love to have an unified alerting system handles alerts, flap detection, escalation and contacts. That helps us recording and correlating events/operations in the same source of truth. A lot of system has solved the alerting problem. I hope Grafana can do better at this in long term, short term not to reinvent existing systems would be helpful in terms of deliverables.

One suggestion is Grafana can provide API for extracting monitoring definition (alerting state), third party can contribute configuration export plugins. This would be very ideal in our use case exporting nagios configuration.

More importantly, I would love to see some integrated anomaly detection solution too!

On 15 Jul 2015, at 17:40, Pierig Le Saux [email protected] wrote:

+1 I'd love horizontal scaling of the alerting on multiple grafana nodes.


Reply to this email directly or view it on GitHub.

I agree with @activars. I don't really see why a dashboard solution should handle alerting which is a more or less solved problem by lots of other tools, mostly quite mature. Do one thing and do it well.

IMHO it would make more sense to focus on the _integration_ part.

Example: Define dynamic warn/crit thresholds in grafana (e.g. like in @Dieterbe example above) and provide an API (REST?) that returns the state (normal, warn, crit) of exactly this graph. A nagios, icinga, bosun etc. could request all the "monitoring" enabled graphs (another API feature), iterate through the individual states and do the necessary alerting.

In our case service catalogs and defined actions are the hard part - which service is how business critical, where to send emails to, flapping etc. Also you would not have to worry about user / group management in grafana which most companies already have in a central place (AD, LDAP, Crowd etc.) and integrated with the alerting system.

Also we have to consider that unlike a dashboard solution the quality requirements for an alerting tool can be considered much higher in term of reliability, resilience, stability etc. which creates (testing) effort that shouldn't be underestimated.

Also what about non-timeseries related checks, like calling a webservice, pinging a machine, running custom scripts...would you want that in grafana as well? I guess the bosun adoption would provide all this but I'm not really familiar with it.

On the other hand I can image how a simple alerting system would make a lot of users happy that don't have a good alternative in place, but this could maybe be resolved with some example integration patterns for other alerting tools.

As much as I want Grafana to solve all of my problems, I think falkenbt hit the nail on the head with this one.

An API to expose the mentioned data, some plumbing in bosun, and some integration patterns with common alerting platforms makes a lot of sense.

Congratulations on your new job at raintank @Dieterbe! I have been reading your blog for a while and you have some really sound ideas on monitoring, particularly regarding metrics and its place in alerting. I am confident that you will find a good way implementing alerting in grafana.

As you probably would agree upon, the people behind Bosun are pretty much doing alerting the right way. The lacking thing with Bosun is really the visualizations. I would like to see Bosun behind the Grafana UI. Combining Grafanas dashboard and bosuns alerting behind the same interface would make for an awesome and complete monitoring solution.

Also i think it would be a shame to fragment the open source monitoring community further, your ideas on monitoring seem to be really compatible with the ideas of the people behind Bosun. If you would unite i am sure the result would be great.

Where i work we are using Elastic for logs/events and have just begun using InfluxDB for metrics. We have been exploring different solutions for monitoring and are currently leaning towards Bosun. We are already using Grafana for dashboards, but would like to access all our monitoring information through the same interface, it would be great if Grafana could become that interface.

Keep up the great job, and good luck!

On a related tangent, we got the alerting part working alerting working by integrating grafana with riemann. Was a nice exercise getting to know the internals of grafana :).

This was easier with riemann as the config is just clojure code. I imagine this integration is going to be harder in Bosun.

Here are a couple of screenshots of it in action
screen shot 2015-07-21 at 7 14 25 pm

screen shot 2015-07-21 at 7 18 52 pm

screen shot 2015-07-21 at 7 30 36 pm

The changes to the grafana part included adding an "/alerts" and a "/subscriptions" endpoint and have it talk to another little api that sits on top for riemann to do the crud.

The nice thing is the fact that the changes in the alert definitions are reflected immediately without having to send a SIGHUP to riemann. So enabling/disabling, time period tweaks for state changes is just a matter of changing it in the UI and have that change propagate itself to riemann.

Still haven't benchmarked this integration but I don't think it's going to be that bad. Will blog about it after I cleanup the code and once it goes live.

The whole reason we did this was because people can just go ahead and set these alerts and notifications from a very familiar UI and not bother us to write riemann configs :).

@sudharsh your implementation sounds really interesting. Are you planning on releasing this to the wild?

lots of good ideas, thanks everyone.
Inspired by some of the comments and @pabloa's https://github.com/pabloa/grafana-alerts project we decided to focus first and foremost on the UI and UX for configuring and managing alerting rules as part of the same workflow of editing dashboards and panels. Grafana would save those rules somewhere and provides easy access to it so that other scripts or tools can fetch the alerting rules.
Perhaps via a file, an API call, a section in the dashboard config, or an entry in the database.
(I like the idea of having it as part of the dashboard definition itself, so that open source projects can come with grafana dashboard json files for them which would have alerting rules included though not necessarily active by default. on the other hand having them in a database seems more robust)
Either way, we want to provide easy access so you can generate configuration for whatever other system you want to use that actually executes the alerting rules and processes the events. (from hereon I'll refer to this as a "handler").
Such a handler could be nagios, or sensu, or bosun, a tool that you write or the litmus alert scheduler-executor which is a handler that you could compile into grafana which provides a nice and simple integration backed by bosun but we really want to make sure you can use whatever system you want.

As long as your handler supports querying the datastore you use. we would start off with simple static threshold but later also want to make it easy to choose reduction functions, boolean expressions between multiple conditions, etc.

@sudharsh that is a very nice approach. I like how your solution can talk directly to a remote API, bypassing the intermediate step described above (of course this does imply it only works for 1 given backend which we try to avoid), and that it can automatically reload the configuration. (you're right, bosun currently does not support it, it might in the future. FWIW the litmus handler does handle this fine and it uses bosun's expression evaluation mechanism). I never really got into riemann much. Mostly I've been concerned about adding such a different language to the stack that not many people understand or can debug when things go wrong. But I'm very curious to learn more about your system and about Riemann's CLJ code. (I'ld love it if my suspicions are incorrect)

@dennisjac yes it would be optional.
@elvarb there is a ticket for ES as a datasource. in fact the goal is that if grafana supports rendering data from a given datasource it should also support composing alerting rules for it. As for query execution/querying this of course depends on what handler you decide to use. (for the litmus handler we'll start out with the most popular ones like graphite and influxdb)
@rsetzer : agreed, it's a good thing to be able to specify how long a threshold should be exceeded before we trigger
@falkenbt : I believe many things can be phrased as a timeseries querying problem (for example the pings example). But you're right, some things aren't really timeseries related and those are out of scope for what we're trying to build here. And I think that's OK: we want to provide the best way to configure and manage alerting on timeseries and aim for integration with other systems that are perhaps more optimized for the "misc scripts" case (such as nagios, icinga, sensu, ...). As for concerns such as reliability of delivery, escalations etc, you could hook in a service such as pagerduty.
@activars & @falkenbt does this seem to match your expectation or what do you think could be improved specifically?
@jemilsson thank you! and that's exactly how i see it: bosun is great at alerting but not good at visualization. Grafana is great at visualization and UX but has no alerting. I'm trying to drive a collaboration which will grow over time I think

Does anyone have any thoughts on what kind of context to ship in notifications like emails?
At the very least, the notification should contain a viz of the data you're alerting on, but it should imho be possible to include other related graphs. Here we could use grafana's png rendering backend when generating the notification content. I'm also thinking about leveraging grafana's snapshot feature. like when an alert triggers, take a snapshot of a certain dashboard for context.
and maybe that snapshot (html page) could be included in the email, or that might be a bit too much data/complexity. also the javascript features would be unavailable in mail clients anyway (so you wouldn't be able to zoom on graphs in an email). Perhaps we could link from the email to a hosted dashboard snapshot.

I like the general approach of docker - batteries included, but removeable. So a basic alerting implementation that can be swapped out would be a good approach imho.

influxdb will be supported for alerting ? or only graphite ?

One thing I would like to see is the idea of hierarchical alert trees. There's simply too many facets being monitored and stand alone alert states have an unmanageable cardinality. With a hierarchy tree, I can define all these low level alerts which roll up to medium level alerts which roll up to high level ......

As such, each rolled up alert automatically assumes the high severity of all the children below it. In that way, I can get an impression of [and manage] system health accurately with a much lower surface area of analysis.

This is an example I have borrowed from an old document I wrote a while ago. Yes, please chuckle away at the use of the word "Struts". It's OLD ok ? This presents a very simple hierarchy for one server.

image

At some point, the server experiences sustained 75% CPU utilization, so this trips these alerts into a warning state: CPU-# --> CPU --> Host/OS --> System

image

If one really applied themselves, one could keep an eye on an entire data center with one indicator. (yeah, not really, but this serves as a thought excercise)

image

Why do not use graphite-beacon? I think you can merge graphite-beacon that is very light with grafana.

@felixbarny I like that terminology. we'll likely adopt that wording.
@JulienChampseix yes the standard handler would/will support influxdb
@nickman that's interesting. it actually falls in line with the end-goal we have in mind, of being able to create very highlevel alerts that can include / depend on more fine grained alerting rules and information. bosun already does this, and long term we want to make this functionality available through a more user friendly interface, but we have to start more simple than this.
@amirhosseinrajabi looks like a cool project and I think making graphite-beacon into a handler for the alerts configured through the grafana UI would make a lot of sense.

@Dieterbe is it possible to have an update of the current status ? for alerting system
in order to know which system is comptatible (graphite/influxdb) ?
which subscribtion available ? which alert type available ?
thanks for your update.

we're currently prototyping the UX/UI. so we're quite a ways from this being usable.

Hi @Dieterbe

is there any updates on the progress of the alerting system ??

It would be awesome to have alerting in Grafana! Looking forward to this feature. Any updates now?

@mattttt can you provide an update re your UX work?

Yes, absolutely. Will upload some screens/user flows tomorrow.

We need alerting: UI for rule definition, API for rule definition, and API for alert notifications. Will watch this thread with interest. We have a multi-tenant system and love the Grafana UI and back-end.

Yes, I'm also very interested and impatient for seeing this new feature!
Thanks a lot Matt ! ;)

2015-08-27 6:49 GMT+02:00 andyl [email protected]:

We need alerting: UI for rule definition, API for rule definition, and API
for alert notifications. Will watch this thread with interest.


Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-135290295.

There's a lot of items falling into place internally, but I didnt want to leave this thread neglected.

Here's one of the mockups of a panel I've been working on. This illustrates the historical health over time, incorporating the status into the tooltip and using existing thresholds defined in the panel config to configure alerting.

In this example, this is alerting on a single query w/ multiple series. Tooltips are extended to show the status at the hover-time.

image

_A couple small outstanding questions_: How much information about the alert notification should go into the tooltip, if any - or should this information be accessed elsewhere in a more detailed view? I believe the latter at this time, but it's worth asking aloud.

Configuration, alerting screens, user flows are forthcoming slowly. Lots to come.

@mattttt nice!

Love the green and red line below the chart!

That ties to uptime calculations, would love to be able to see a that as a number somewhere. Totals for all queries and for each metric.

About the tooltip are you talking about the stats that come up when you hover over the lines?

Damn @mattttt that looks awesome. I wouldn't even worry about putting anything in the tooltip. The threshold line and alert status health bar at the bottom are more than enough.

Can't wait to see this when it's done!

I am excited to see this is progressing well!

We currently use Grafana+Bosun+OpenTSDB as our monitoring&alerting stack. I definitely agree it would be awesome to have the power of Bosun with the great UX of Grafana.

Here is an example of where Grafana's configuration UX is better than Bosun's:

Context

Our monitoring stack is shared between multiple teams and their services. A different set of services is deployed to different clusters/sites based on project specifications. Each team/service should take responsibility for its own dashboards/alerts.

Comparison

With Grafana's HTTP API teams can PUT their own dashboards their when deploying their service. Bosun currently only has a single file for storing configuration; this makes it difficult to share between different teams and across different projects.

@mattttt @torkelo @Dieterbe any idea for a release the alerting piece (or beta release) ?

echo ^. Is their a beta or alpha release for this? I am researching alerting solutions, but I would love to have something built in grafana. I could provide a lot of testing feedback.

Alerting feature is still some months in the future, we are still prototyping the UI and thinking about different ways to implement it, but progress should move more quickly in the next 2 months so we will know more then

@mattttt do you intend to make the colors of the historical health bar configurable? Green and red don't really go well with the color blind ;)

Regarding the alerting: I'm quite interested in how this plays out. We've been collecting and visualizing data for a while now, and alerting is something we're currently trying to figure out. Grafana could have a nice place there, especially since the visualizations are already in place. What I do wonder though: how much should Grafana be more aware of 'entities' rather than metric series for alerting? I can imagine myself wanting to automatically toggle a visual state change (ping or http check fails) or manually (doing maintenance) for what in my case would be a server, in addition to metric based alerting.

I'm interesting to see where alerting in Grafana goes, but for those of you that need something now, there are nagios plugins like https://exchange.nagios.org/directory/Plugins/System-Metrics/Others/check_graphite_metric/details that can trigger alerts when thresholds are crossed.

@baaym

What I do wonder though: how much should Grafana be more aware of 'entities' rather than metric series for alerting? I can imagine myself wanting to automatically toggle a visual state change (ping or http check fails) or manually (doing maintenance) for what in my case would be a server, in addition to metric based alerting.

this is a good question and also something we've been talking about a bit.
the solution we want to go with for the near (and possibly also long) term is to make grafana not aware of such higher level concepts. i.e. as a user you have the power to set alerts on metric series, and from those alerting rules, alerting outcomes will be generated (likely including attributes or tags from the series names) from which you can then construct whatever entities you please. this is something we have to think about a bit more, but for example

say you set an alert along the lines of movingAverage(cluster1.web-*.cpu.idle,10) < 20 -> warn. this would verify the threshold on all your web servers in the given cluster, and generate alerts for any violations such as movingAverage(cluster1.web-123.cpu.idle,10) is currently 3!.
possibly we could enable you to say "the first field is the cluster name, the 2nd is the hostname" etc, so that alerts can contain nicer information.
but the point is, the alerting _outcome_ contains the information you need to identify which entity is having issues, but it falls outside of grafana's scope. Grafana would be more the source of the configuration of alert rules, and grafana dashboards could be configured to load annotations and what-have-you to visualize state of the alerts, but it wouldn't have a notion of highlevel concepts such as hosts or clusters. I think this is something that could be handled in the alerting handlers

@Dieterbe

There are two type of user/organization concerns when building alerting feature:

  • Startup like, where they don't generally have time to build their own alerting solution. Everything would relying on Grafana to alert on metrics
  • Established engineering organization, they have existing alerting tools build in house, alerts for business rules are built based on other granular alerting signals (Grafana would be one of it).

Grafana should work with existing well-established operation practices, having it outside the cycle is ignoring the goal of alerting - having clear view of business critical entity healthiness. Alerting is better to be centrailzed to allow building clear state of the environment. It would be critical to allow power users using grafana API (or any other solution) to export alerting rules to other systems.

When saying operational, each alert should optionally contain a documentation/link field to explain the purpose of the alerts & historical behaviour.

@activars i think i agree with all of that. in my view we are taking an approach that fosters plugging grafana into the rest of the environment (mainly thanks to separation of concerns, with pluggable handlers). do you think the proposed design can be improved in any way?

I think @deebs031 makes a good point that hasn't been addressed much "applications built for end users need to provide user defined alerts".
IMHO the graal is self service metrics based monitoring, in my case Grafana being the main front end for folks that want to look at metrics it would make sense to enable them to create alerts for themselves within the same -let's face it AWESOME- UI.
I've personally done Sensu alerting based on metrics but providing it as a self service is really not a piece of cake compared to how seamless it'd be if integrated with Grafana. I've also looked at Cabot because it had visualization capabilities but it hasn't been built with self service in mind so couldn't use it as is.
I'm on the "do one thing well" side but I think in the particular case of self service alerts based on metrics pairing that capability with the metrics visualization layer makes a lot of sense:

  • The user is already familiar with the UI
  • The user is authenticated so she can create alerts for herself or whatever permissions schema that authentication enables
  • The user can see the graphs that are typically quite useful when creating those metric based alerts

slides of my grafanacon presentation about alerting:
http://www.slideshare.net/Dieterbe/alerting-in-grafana-grafanacon-2015
they're kinda hard to understand without context, the video should be online in about a week, i'll post it when it's ready.

we've now started prototyping ways to implement the alerting models/UI/definitions/etc. we have a pretty good idea of the main workflow, though 1 big point we're still trying to figure out is how integration with 3rd party alerting handlers should look like.
our current thinking is that you will be able to use grafana to set thresholds/alerting rules/define notifications and to visualize the historical and current state of the alerting rules.

the assumption is that you want to use your alerting software of choice (bosun/cabot/sensu/nagios/...)
so there would be a separate tool that queries grafana over its http API to retrieve all the alerting rules. that tool could then update your bosun/cabot/sensu/nagios/... configuration, so that you can use your alerting software of choice to run and execute the alerts and send out the notifications.
but we want to be able to properly visualize current and historical state, so either your alerting program would need to be able to call a script or a webhook or something to inform grafana of the new state, or grafana would have to query for it. (which seems yucky, given most tools don't seem to have great api's)
this is all a bit complicated but it would have to be this way to support people for whom it is important that they stay able to use their alerting software of choice, while using grafana for defining the alerting rules and visualizing the state.

Is it important to you that you're able to use your current alerting tool of choice? what would be your alerting tool of choice?

the other thing we'd like to do, is also write a simple alert executor ourselves, that queries the grafana api for alerts, schedules and executes them, does the notifications (it would support email, slack, pagerduty, a custom script, and probably a few others) and updates the state in grafana again.
it would be fairly easy to write for us, easy for you to use and we could have great interoperability.

Is the built-in alerting executer (see above) something that you think would be sufficient to handle all the alerting rules you set up in grafana? or do you feel strongly that you want to keep using nagios/bosun/... and that we have to build a bridge between them?

also do you want to be able to use multiple alerting handlers? say built-in + bosun + nagios, or something? which ?

@jaimegago amen ;)

For me number 2 seems better in that you can really minimize the number of things you have to configure for everything to work smoothly. In our current setup we would go with that.

Just so it has been said if everybody else disagrees ;)

Quick edit: Awesome slides. If the ending product comes out looking half as good as that then it's amazing.

+1
I agree that internal notification handler with this integrations is perfect! for most common use case.

I'll be happy to be Beta Testing :) and slides be amazing!

I think @Dieterbe's last post clears things up quite a bit, but I wanted to post this quick diagram to further clarify.

Alerting in Grafana is really two things, the self-service alert config (thanks @jaimegago, couldnt have said it better myself) and the handler itself.

We'll be shipping a Grafana alert handler, but also providing the framework to integrate with your alerting software of choice:

alerting-structure-layout

+1 for building a kind of bridge to other alerting systems (maybe we could think about implementing some generic alerting plugin system :-) )

You could add Prometheus also on the "External Alert Handlers" part. The first Prometheus alertmanager version is in production in several companies and a complete rewrite is currently in the works. SoundCloud might use Grafana to configure alerts, but very certainly only if the Prometheus alertmanager is used as alert handler.

@grobie, good catch, fixed in original comment.

@mattttt @Dieterbe that's great ! Looks like you're on the path of "batteries included but removable" which is IMHO the best of both worlds. Have you already thought about how you are going to pass authorization data to the alert handler? I'm thinking a story like this:
As a Grafana user I'd like to be alerted via _email_ and/or _pagerduty_ and/or _foo_ when (some condition built via Grafana alerting UI) happens.
That user should only be able to send alerts to the notifications system he is authorized for, this is a requirement for self service and will need to be addressed someway somehow. Since Grafana 2 we have a SQL backend + users authentication/authorization with LDAP integration, so it doesn't seem too far fetch to have that capability from day one of alerting?
With Sensu (which is the tool I'd be plugging in) passing the alert target (e.g. email address) via the handler should be quite straight forward, can't say about the others.

Hi all,
I am happy to see that this afford is getting pushed forward, since I love the self-service alert configuration approach.

Personally, I don't need a specific alert handler. I would like to see a generic HTTP POST handler, that is just triggered as soon as an alert it thrown. I think most admins can quickly build something that is capable of accepting HTTP and then doing whatever they need to do with it (sending it to nagios, riemann, younameit). So I would be happy with an HTTP handler that sends all informations about an alert as JSON data.

Talking about alerting via grafana, are you planning to add something like flapping detection? Or is this something that the external monitoring system should take care of?

Keep up the good work guys!

Cheers

I would like to see a generic HTTP POST handler, that is just triggered as soon as an alert it thrown. I think most admins can quickly build something that is capable of accepting HTTP and then doing whatever they need to do with it (sending it to nagios, riemann, younameit)

so if an alert fires (say "web123 has critical cpu idle!, value 1 lower than threshold 15" ) and we do an http post of that data, how would you handle that in nagios? you mean nagios would take it in as a passive service check, and then nagios would send the notifications?

Talking about alerting via grafana, are you planning to add something like flapping detection? Or is this something that the external monitoring system should take care of?

this is also something we need to think about more. This can get messy and if people use something like pagerduty or flapjack then they can use that to aggregate events/surpress duplicates so we're looking if we can avoid implementing that in the grafana handler, though we may have to. Note also that because you'll be able to set alerts on arbitrary metrics query expressions, you'll have a lot of power to take past data into account in the actual expression, and so you can create a more robust signal in the expression that doesn't change state as often.

so if an alert fires (say "web123 has critical cpu idle!, value 1 lower than threshold 15" ) and we do an >http post of that data, how would you handle that in nagios? you mean nagios would take it in as a >passive service check, and then nagios would send the notifications?

Kind of. I'm actually looking forward to grafana alerting to get rid of nagios. Using the HTTP handler, you need to configure passive checks for nagios to be able to submit the results there. But I would like to have grafana as the one source where you can configure alerting. In our case, the people who are allowed to add alerts are the actual sysadmin who would also configure the checks in nagios.

With the http handler grafana would have everything we need for that: a dashboard for real time monitoring, an API, easy alerting configuration and a http handler where we can forward alerts to our internal notifcation system.

Cheers

Although I can see the logic in this integration strategy, I cannot help but think it's a little bit overkill. To what I understand (and what I could read in the thread), the only reason most Grafana users keep using a standalone alerting technology is that Grafana doesn't propose one. So, wouldn't be more "lean" to focus first on the Grafana Alerting part, and develop it as a component that communicates with the rest of the stack through the API, so that other contributors would be able to mimic the behaviour and create specific adapters later?

tl;dr: by focusing on building its own "batteries" first, Grafana would have a full-featured alerting system, that can later evolve into a service for integration with 3rd party alerting tools.

Minor concern, if this hasn't been addressed. The traditional alerting system does not scale well for cloud infrastructure, because resources are very dynamic (provisioned & destroyed). Alerting on metrics should support tempting or grouping feature (with exceptions override, sometimes workloads are different). A templated or grouped alert should be able to discover new group set.

Thanks for the update! In my use case built-in alerting in Grafana is all I would need at this time. I have been patiently impatiently waiting for Grafana alerting.

As I promised in IRC, here is our use case for this:

We have a legacy Rails-app which searches our logs for patterns and has an
HTTP API to answer if a certain pattern has crossed it's thresholds and
thus has a status of {OK,WARNING,CRITICAL}.

Threshold can be either:

  • a status of CRITICAL if pattern exists at all.
  • that status is WARNING if pattern is found more than X times
    and status is CRITICAL if found more than Y times.
  • if pattern is less than 1 hour old then the status is OK,
    less than 3 hours status is WARNING and otherwise status is
    CRITICAL.

If I understand this feature correctly, Grafana will support this usage
pattern (via Logstash and Elasticsearch of course) when this feature and
the Elasticsearch data source is fully implemented?

@Dieterbe @mattttt your slides and mock-ups look absolutely amazing! This is truly a game changer.
To me the internal Grafana alert handler would suit our needs the best.
Reasons:

  • Self-service - Very important. Our users said loud and clear they want to create alerts themselves end-to-end inside Grafana.
  • Unified workflow - I want to minimize moving parts not increase them. As @Dieterbe pointed out, a 3rd party alert handler would require at least 4 steps where an internal alert handler would require just 1 (maybe 2 if you need to define notification method for each threshold? - unsure from the presentation).
  • Tight integration and no dependency on 3rd party alerting infrastructure.

A few concerns:

  • What is the frequency checking thresholds?
  • How does it handle polling frequency that is too fast for it to get the data back? Log, alerted and queued or dropped?
  • For scaling, concerned that Grafana may not be able to keep up with the sheer number of checks, fast frequency and especially with latency between datasources that we will likely need to add/scale Grafana servers to support internal alerting. I know this because we need several 3rd party alert handler instances now. In this case how would we be able to seamlessly assign or queue threshold checks among a cluster of Grafana servers, especially if checks are from the same datasource? From a user experience I'd like users to seamlessly create thresholds through a load balanced, cluster of Grafana servers without having users go to a particular "assigned" instance of Grafana for a particular check.
  • For notifications, would this support some kind of plugin architecture so notifications can easily be developed and integrated? In general we need something that can perform HTTP POSTs. This is most common with the likes of PagerDuty, xMatters, VictorOps, Opsgenie, etc. Each one requires a slightly different format, authentication, etc. As previously mentioned in this thread, perhaps a generic HTTP POST handler would work that would send to a custom HTTP service capable of doing whatever you want with it. Alternatively a custom script capability should work too.
  • I assume thresholds could be set, retrieved and get violations through an API? I think this would be helpful

I think it's ideal to be able to integrate alerting into existing alert systems. There are some tough and ugly problems like flap detection as mentioned that have been dealt with and it seems wasteful to reinvent everything from the start. I'd hate to see this buried under the weight of feature creep.

But I don't think this really needs to be a tight integration into all of these alert handlers. A good, well documented api should allow people familiar with these system to integrate with little effort. So the slide with 'grafana api -> handler' is what looks attractive to me.

Scott

Hi all -- I'm coming in late to this discussion, but I have some expertise on this topic, and I'm the lead developer of one of the tools that has attempted to solve the alerting problem. Our tool, StatsAgg is comparable to programs like bosun. StatsAgg aims to cover flexible alerting, alert suspensions, and notifications & is pretty mature/stable at this point (although the API is not ready).

Anyway, some thoughts on the subject of alerting:

  • Alerting by individual metrics sucks. I work at a company that manages thousands of servers, and having to create/configure/manage separate alerts for each metric series of 'free disk space %' is logistically unfeasible. Enterprise monitoring tools often tie multiple metric-series together with regular expressions (or just wildcarded expressions). StatsAgg was built on the same premise; multiple metric-series are tied together, and then the group of metrics has the alert threshold checks executed against it by a single 'alert'. At scale, this sort of capability is a necessity.
  • If one accepts my previous assertion that an alerting tool should not alert off of individual metrics, then it follows that the tool must have a mechanism to quickly get a list of qualifying metrics & of the metric values. Many tools rely on the querying data-stores to get the list of metrics & metric values, and this solution frankly doesn't work very well at scale. Alerting logic, by its nature, needs to run often (every X seconds, or as each new qualifying datapoint rolls in). The datastores (graphite, opentsdb, influxdb, etc) were just not built to handle constant querying of 'give me the current list of metrics that conform to this pattern' and 'show me the last X values for these Y metrics'. They either don't have the appropriate API/query language, or they simply can't handle the load. To be clear, I'm talking about scales of running alert logic against 10,000 metric-series when there are 10,000,000 available metric-series in the datastore. This is not everyone's use-case, but it is my company's.
  • I've found that tackling the problem via stream processing is the only viable way to address the issues raised by my last bullet point. That's why StatsAgg was built to sit in front of the datastore. The alerting logic can run against the metrics without touching the datastore, and the metrics just pass through to the datastore. The main conceits of this approach are that 1) newly created alerts can't/won't use old/archived metric values for alert evaluation 2) if the stream processing program (ex- StatsAgg) crashes, then datapoints don't make it into the datastore 3) metric values that are needed for alert evaluation are stored in memory, which could be a server stability concern. 4) the stream-processing program must be able to deconstruct & reconstruct the incoming metrics (which InfluxDB hasn't made easy over the last year...). Even with these conceits, this solution has worked very well for my company, and at a very large scale. At times we've had 200,000+ live metric-series, averages of 30k+ incoming metrics/sec, hundreds of alerts that evaluate thousands of metric-series, and a server running StatsAgg that barely breaks a sweat. All the while, the datastore doesn't get queried at all.

Those are the main things that I wanted to mention. There are lots of other important aspects to alerting too (notifications, suspensions, etc), but those solutions are easy to bolt on once one has the architecture of core problem solved. I realize that the scale of our needs are not the same as the average user, but hopefully you all can appreciate this perspective.

I'd like to suggest launching with a notification handler that can send data to Alerta: https://github.com/guardian/alerta

Alerta has a very sane REST API for receiving notifications.

I prefer a lean grafana only implementation!
I think its worth re-evaluating after everyone has had experience with this feature in the typical fantastic Grafana UX.

There will be many complex cases and/or custom backend systems people will want to integrate with. You can see many on this thread, mostly open source, but there are so many commercial products out there as well ! Don't bother with individual handlers - it will be a rat whole and you always will be in a catching mode

I'd strongly advise to implement only two types of handlers. One is definitely HTTP POST, it will be the most versatile and flexible tool. The other one is custom scripting, so the users can implement integration with their specific tool of choice. Plugin model is not bad, but it forces to use specific plugin language which is limiting. External scripts are better - as long as you pass to a script all the details the script can be written in any language - shell script, Python, etc.

I'm with @007reader

I agree. As long as common integration methods provided, custom implementation can be separate project or deployment.

For example, the recent CloudWatch release is decent, however I would love to make it as a separate project by only synchronous selected metrics to alternative storage. It will give us full retention instead of only 2 weeks data.

hey everyone,
my grafanacon presentation video is online!
it's at https://www.youtube.com/watch?v=C_H2ew8e5OM
i think it will clear up a lot, though as you can see. the specifics of integrations are still to be figured out and was also a topic a lot of people wanted to discuss. (though there was limited time and i asked people to continue the conversation here so everybody can participate)

@simmel yes exactly. you'd use an ES query and set a rule on that.
@activars re grouping and discovery, i think a lot of that will depend on your datasource, but most common requirements should be addressed if you use something like graphite or ES who i know are very good at "auto discovering" previously unseen metrics/series/documents that match the given expression (with wildcards) for graphite or query (for ES). not sure about the other sources. your comment about needing to apply exceptions to rules is a trickier one, we'll probably need to address that at some point but I think that can wait until things are clearer and more settled down. maybe we can avoid needing that somehow.
@mgravlin frequency will be a setting in the rule, dealing with too slow datasources, not sure yet. but don't do it ;-) . also handler dependent. scale-out deployment should be possible, also with included handler so pretty sure we'll look at that. but maybe not a priority for first release. yes, notification plugins will be key and we'll make sure you can use what you want/need. re api: yes :)
@sknolin if i understand correctly, in your view, grafana would do the alert scheduling, execution, trigger notification hooks etc, even when using another system like nagios/bosun. then what exactly would be the role of the external system (nagios/bosun/...) . this seems also similar to what @Crapworks was talking about.
@jds86930 StatsAgg looks quite interesting. I think here also an integration with grafana would make sense. I think stream processing is a valid approach that has a place as an alternative to repeated querying. but the latter is just simpler to get started with and just simpler in general I think. But both should be supported. So yes in grafana you will be able to set up patterns/queries that match a spectrum of data, and potentially cover new series/data as it becomes live. in our view though, you would just leverage whatever functionality your datasource has (for example graphite is pretty good at this with its wildcards, glob expressions etc, and elasticsearch rich data and query model), or if somebody would use grafana+StatsAgg they could just use StatsAgg to solve that. Are you saying grafana itself should do anything specific here? I think if your datasource is not fast enough, solve the datasource problem. get something faster, that has caching for metric metadata, maybe a memory server in front or stream processing. but either way as far as Grafana is concerned, not much would change that I can think of?
@blysik yes looks interesting. so many alerting tools out there we should integrate with :) to be clear, do you like the idea of managing alerting rules and visualizing them in grafana the way it's been presented so far, but you want to use alerta to take care of the notifications? would alerta be the primary place where you go look at the state of your alerts (that seems like a reasonable thing to do), but want to make sure i fully understand how you see the integration.

@007reader , @shanielh , @activars just to be clear, this integration via a generic HTTP post or script, what would be the goal. to tell the external system "there's a new rule, here's the query, the thresholds, frequency, etc. now go please execute it" ? or would grafana be the thing executing the rules and then updating the external systems with new state?

@blysik yes looks interesting. so many alerting tools out there we should integrate with :) to be clear, do you like the idea of managing alerting rules and visualizing them in grafana the way it's been presented so far, but you want to use alerta to take care of the notifications? would alerta be the primary place where you go look at the state of your alerts (that seems like a reasonable thing to do), but want to make sure i fully understand how you see the integration.

Correct. Alerta is shaping up to be our notifications hub. All sorts of things sending alerts into it. For example: Custom scripts, Cabot, Zenoss, vCenter, and maybe Grafana. That gives ops a single place to see all the alerts. And then that is the single place which drives notifications to the oncall engineer.

@sknolin https://github.com/sknolin if i understand correctly, in your
view, grafana would do the alert scheduling, execution, trigger
notification hooks etc, even when using another system like
nagios/bosun. then what exactly would be the role of the external system
(nagios/bosun/...) . this seems also similar to what @Crapworks
https://github.com/Crapworks was talking about.

I guess I didn't explain well. That's not what I'd want, grafana not
doing all that stuff. @Crapworks (that's fun to type) is talking passive
service checks, I'd just use active polling.

So all I would want is an api where I can read grafana alerts status.
External systems do everything else.

That doesn't mean if it didn't develop somehow into a great general
alerting tool I wouldn't use it. Just what I'd do now.

Scott

@sknolin

So all I would want is an api where I can read grafana alerts status.

how would that status be updated in grafana? what process would be executing alerts and updating the status in grafana?

Each time it's polled grafana updates alert status, with some kind of caching interval to deal with multiple systems polling it.

I see the point that this still requires grafana to do logic for the alerts and present it. So thinking about it, no I don't need alerts of any kind.

I think I could do whatever alerting needed if I were able to retrieve the current value for a metric on a graph panel. For example where we derive a rate from the sum of several counter metrics and graph it, would be nice to poll for the current value with the monitoring system. Maybe that's totally doable now and I'm just being obtuse.

Scott

@Dieterbe The latter :

grafana be the thing executing the rules and then updating the external systems with new state

@Dieterbe I agree that polling the datasource (Graphite, OpenTSDB, etc) using the datasource's native query syntax is simplest/easiest & is probably the quickest way to get some form of alerting natively into Grafana. For a lot of people, this sort of solution will meet their needs, and I think this is the best solution for the initial Grafana implementation (in my opinion). My main point was that there is a ceiling on alert configurability & performance that will be hard to work past with the 'polling the datasource' model.

In terms of directions Grafana could go for long-term alerting solutions, I could see a few options:

  • Work with the datastore maintainers to build a faster/better purpose-built APIs for the alerting use-case. I dislike this option because many of those projects move at a slower pace (months to years) & they may not accept some/all of the enhancement requests. They'd probably also want to code in the native language of their datastores, which aren't always fast languages (ex- Graphite in python).
  • Build stream-processing/caching layers for each datastore as separate raintank projects. I think this would ultimately have a better outcome than trying to coax the various datastore maintainers to build solutions for their projects. This would also allow you to continue expand on the work you're already doing (using the existing datastore query mechanisms). You could even build your own custom APIs into the stream-processing/caching layers that could simplify the Grafana's query syntax to the datastore.
  • Stick with the native solution you're working toward & make it work well. 3rd party tools like StatsAgg, bosun, etc will be around for use-cases that are more demanding/specialized/complex. Having Grafana integrate with these tools would definitely be an added benefit for the user, but it would add non-trivial complexity to Grafana. That said, it looks like you may end up doing this anyway (I'm looking at 'Alerting Backends' on slide 35 of your presentation right now). I'm personally open to implementing a Grafana-friendly set of APIs in StatsAgg; we'd just have to figure how what the APIs are & get some API protocol documentation drafted. Feel free to message/email me if you'd like to discuss any of that.

Hi all,

@Dieterbe I just watched your presentation and the stuff looks awesome. I really appreciate that you are trying to build an alerting system in the "right" way, using a lot of community input! Thanks!

I also want to make my point a bit clearer, since I don't think it was obvious what I was trying to say. I _DO NOT_ require grafana to be aware of any other monitoring system as Nagios, Icinga, Bosun, etc. I actually only need this:

  • The awesome user interface you showed off in your presentation or whatever it looks like when it's completely finshed
  • A generic HTTP POST handler (like some other people here also suggested) that is completely configurable (I will give you an example later on)

The event-flow I am thinking of:

  • You visualize your data in grafana
  • You add thresholds for alerting in grafana
  • As soon as a threshold is exceeded, the HTTP POST handler is triggered
  • From that point on, grafanas work is done

Like @mgravlin and @007reader mentioned, most notification and alerting services use HTTP POST, requiering diffent kinds of data. So the most generic thing I could think of, is let the user define their POST data and headers, so you can feed several systems with one handler, using different templates. Pseudo code example:

"notificator": {
    "httppost": {
        "data": {
            "host": "$hostname",
            "alert": "$alertname",
            "state": "$state"
        },
        "header": {
            "content-type": "application/json"
        }
    }
}

Giving the user enough variables to use here, will be powerfull enough to feed a ton of backends.

And again, with that kind of handler, any sysadmin with some coding knowledge can build it's own http post receiver and transform how he likes to, for example, feed backends that doesn't understand http post.

Since this is stateless, it also scales out. Just place a load balancer in front of the backend/api/whatever and you are good to go.

At least, this would solve most/nearly all of my problems ;)

Cheers

Thanks for building this feature. Is there an aproximate release date for it?

torkelo said ROUGHLY 3 months on IRC.
If I understood him correctly that's a really rough estimate and should be treated as such.

I am excited for the ability to do alerting with grafana. I think that this is the one feature that is keeping grafana from being the ultimate monitoring tool.

If you have an early alpha/beta release, I would love to test & give early feedback with production data.

++

Me 2 lol

+1

Em seg, 16 de nov de 2015 às 21:03, Jing Dong [email protected]
escreveu:

If you have a early alpha/beta release, I would love to test & give early
feedback with production data.


Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-157202686.

+1
If you have an early alpha/beta release, I would love to test & give early feedback with production data.

+1 me 2

2015-11-21 14:44 GMT-02:00 chaosong [email protected]:

+1
If you have an early alpha/beta release, I would love to test & give early
feedback with production data.


Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-158661279.

+1

great to see all the +1's but FWIW they're not really needed. we already know it's the most eagerly awaited new feature, and once we have tangible progress, the code will appear in a separate branch that anyone can play with. BTW we're also bringing in more people to work full-time on grafana so stay tuned everyone :)

Yes please, there are 484 people "watching" this issue. Each time you "+1" it, 484 people get email notification. Just subscribe to notification and it'll be indication of your interest to the issue.

+1 >; P

On Mon, 2015-11-30 at 10:33:52 -0800, Vadim Chekan wrote:

Yes please, there are 484 people "watching" this issue. Each time you "+1" it, 484 people get email notification.

Sory, I know you guys are working very hard on this. Is there any time line for first release?

I would be more than happy with being able to configure alerting metrics and thresholds (either through the web interface or API) and a Grafana cronjob/daemon that checks these metrics and does a HTTP POST with JSON or invokes a script with JSON in stdout. It would be _extremely_ simple for individuals to write a simple python script that passes this information on to Pagerduty, Slack, IRC, SMS, email or whatever.

While I would highly appreciate the convenience, I don't think it's Grafana's job to tightly integrate with third-party tools, and would rather see a minimalist implemetation sooner than a well fleshed out one later.

I completely agree with @anlutro . Please provide something simple to get started. The most interesting thing for me is to enable the people to set simple alerts themselves (self-service). Grafana shouldn't try to replace existing alerting/escalating solutions.

I agree with @anlutro as well. But instead of just providing a simple api rather have the alerting part able to handle custom plugins that interact with the api. That way the base package could include email, pagerduty and a few others then the community could add to it as needed. Similar to how Logstash plugins are handled now.

+1

Any news on alerting system? Any estimates?

+1

+1
Worth to mention the hits and hysteresis mechanism work on collectd alerts as a concept to consider.

Have you thought about advanced alerting features such as anomaly detection, correlation detection, root cause detection, etc?

+1. Alerting as a plugin subsystem - this would be most flexible solution. No need to add so many features inside grafana if there is many projects that can do this better in backend.

@Dieterbe @torkelo It'd be great to have even a very rough "guesstimate" on this. Personally I have been holding on since metrics based self service alerting is a much needed feature in my case and I'm convinced Grafana is the right user front end for it. Problem is, it's now been 6 months and there has been no ETA update or even a comment by one of you for quite a while so I'm starting to have "I'm just going to have to hack something" counter productive thoughts...If I could just know if it's going to be another 6 months versus a few more weeks I could make a much better informed decision.

Thanks!

+1
On Jan 18, 2016 9:54 PM, "Jaime Gago" [email protected] wrote:

@Dieterbe https://github.com/Dieterbe @torkelo
https://github.com/torkelo It'd be great to have even a very rough
"guesstimate" on this. Personally I have been holding on since metrics
based self service alerting is a much needed feature in my case and I'm
convinced Grafana is the right user front end for it. Problem is, it's now
been 6 months and there has been no ETA update or even a comment by one of
you for quite a while so I'm starting to have "I'm just going to have to
hack something" counter productive thoughts...If I could just know if it's
going to be another 6 months versus a few more weeks I could make a much
better informed decision.

Thanks!


Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-172722684.

+1

+1

@jaimegago really sorry for not updating here on our progress or lack of progress on this issue. We thought we would have time to spend on this but it always ended being pushed because something with higher priority came in the way.

Back in September I began working on Elasticsearch datasource support which became the foundation for a datasource focused 2.5 release, after that the top rated issue in Grafana since v1 was a table panel, and especially after the Elasticsearch support I felt a small release with a table panel was more important than alerting so that became the foundation for 2.6.

Lately we have a lot of users and companies wanting to integrate more with Grafana which has prompted us to work on improving the plugin api and capabilities, which resulted in another postponement for this issue. I am really sorry that we have so poorly communicated this. We always had the ambition to start SOON, but SOON became pushed again and again :(

But there is hope! We have expanded the full time Grafana focused team, in December @bergquist joined and in February we will be getting reinforcement once again. I can't offer an ETA but this is still a high priority issue and we want to start ASAP. We want this feature to be the headline feature for Grafana 3.0.

@torkelo Thanks for the update; I can't say I am happy, but at least new we know where we stand.

One remaining question is if 2.x will be getting more point releases or if 3.x will be the next release. ;)

@RichiH regarding another point release, not sure but it is likely that another point release before 3.0 is going to be released in February.

@torkelo Thanks a lot for taking the time to provide a detailed update.

maybe this is already on the roadmap, if not, please consider adding the "POST" as of the notification.
So we can sending the alert to different system to process them, like kafak

+1 for SNMP notifications!

+1 I think this is the biggest missing feature from Grafana to make it a viable (and best-in-class) monitoring tool for production.

+1

Any admin (@Dieterbe?) available to lock commenting on this issue from non-collaborator? So we'll only get interesting content on the feature advancement, and not those useless +1...

In case you don't know this feature, here is the link to the ad hoc GitHub doc.

Thanks :heart:

@Mayeu uh, as one of the "non-collaborator" that has contributed with more than a +1 to this issue and in other places I don't think locking this issue to collaborators is the way to go. Just create a smart filter on your email ;-).

I also think the +1's fill a purpose, showing the amount and spread of the interest for this (and elsewhere).
What is missing, perhaps, is a +1 button in the UI that would fill the same role, but without all the notifications to all subscribers.. so a feature request for @github.

We are drifting off topic and this is the first & last time I will write regarding this.

Anyone interested in this issue should subscribe in the upper right; it will keep you informed and you won't send email to everyone.

As to a voting system to prevent accumulation of +1, see https://github.com/dear-github/dear-github - 27 days stale and no reaction from GitHub.

+1

Any news about this ?

I don't think so there is any news yet on this issue. But a good thing about Grafana's next release is:

Grafana will be capable of handing custom apps/plugins. Then we can write our own custom alerting plugins/apps and import them into Grafana. Writing these small apps/plugins will be a quick win while waiting for a big alerting feature.

I like the idea of configuring alerts in same place as visualizing. Amazing mocks on https://www.youtube.com/watch?v=C_H2ew8e5OM but is there any dates on when will it released ?

nice video, thanks!

some feedback.

i'm happy with the idea of simple linear thresholds, and advanced custom queries

notifiers most helpful:

  • exec - could use something like ssh or sendmail
  • webhooks - user could stands up a webcgi to pickup web-hooks to do things...
  • email - fire a simple email with a json dump of the notification data.
  • plugins... not really needed

api to pull state of alerts... feels like a bad idea,
however api to pull alert config in a basic json format could be nice.
this json dump could be transformed to something that other systems might find helpful to transform.

Not sure if this is frowned upon or not.. Couldn't find a donate link anywhere but what kind of contribution would be necessary to get this into v3 by the end of the month.. We could really use this feature allot but our resources are tied up ATM

+1

+1

This is a much needed feature for us here at Work Market.

Is the alerts featured launched ?

No
On Thu, Feb 25, 2016 at 11:13 PM kskaran94 [email protected] wrote:

Is the alerts featured launched ?


Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-189143056.

Would it be safe to assume the alerts feature will be released in the summer?

_suspense intensifies_
On Feb 26, 2016 10:23 AM, "Ian Ha" [email protected] wrote:

Would it be safe to assume the alerts feature will be released in the
summer?


Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-189320869.

any news about this?

+1

+1 would be nice to have it already, people are waiting whole year already or even more.

:+1: I like it. Thanks for the video and presentation, @Dieterbe. Is it available for testing/early adopters?

@torkelo You announced

We want this feature to be the headline feature for Grafana 3.0

But looking at 3.0 unreleased branch changelog (1) no mention is made of alerting, should I start crying or is the plan still to have Alerting the 3.0 headline feature?

(1) https://github.com/grafana/grafana/blob/master/CHANGELOG.md

We made the decision to work out the plugin system for grafana 3 so that we can release grafana 3, and then start working on alerting, instead of needlessly postponing grafana 3.

@Dieterbe Can't say I am happy, but that does make sense. The obvious follow-up is if there's anything ETA-ish for alerting; and if it's a confirmed and committed feature for 3.1.

Also, as a work-around, http://alerta.io/ does part of what I want Grafana to do; people waiting for this feature might want to give it a try.

Is there a spec for the plugin? Might be good to build something in the
community to handle alerting to coincide with the release of version 3?

Beth
On 16 Mar 2016 8:44 a.m., "Richard Hartmann" [email protected]
wrote:

@Dieterbe https://github.com/Dieterbe Can't say I am happy, but that
does make sense. The obvious follow-up is if there's anything ETA-ish for
alerting; and if it's a confirmed and committed feature for 3.1.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-197214149

@Dieterbe I think It also would be nice to have ability to create notification on client side. For example voice messages on public monitor with dashboard, so you don't need to look at dashboard to know that there is some trouble. Like Zabbix sound alerts. For this purpose I wrote simple JavaScript code that scans particular dashboard and if there is some trouble it notifies me using Web Speech API. For now it works great for me.

What about using kapacitor as backend for alerts, their scripting lauguage is simple and really powerfull? Or what about support for multiple alerting backends, and having abstraction over that.

Now that 3.0 is out I am really excited to hopefully have alerting in grafana. Alerting will make grafana the ultimate tool.

Hi @Dieterbe,

As I can see from this version https://github.com/raintank/grafana (which you said it has the alerting package), the repo is now deprecated and it says all the new development is going in https://github.com/raintank/worldping-api. That makes me wonder if this alerting feature is still being developed or has been planned and changed for something else (like worldPing which doesn't look like what we've been discussing here).

Hi @minhdanh, the goal has always been to add alerting "properly" into grafana, not just as a hack in a raintank-specific fork, which is the repo you're referring to (and that only covered a very narrow domain anyway, though it may make sense to reuse some of that code once we start work on a scheduler/executor, which that repo had). So that's why we've been working hard on making grafana so pluggable for the upcoming grafana 3 release. (and as a result of that, allowed us to move our own needs into a standalone app, which is the worldping-api you're referring to).
It's become very clear that as a first step we should just build the UI to manage the rules from inside your grafana dashboards and panels and exposing them through some kind of api, so that plugins can use them to execute them. This will be the quickest way to get alerting going. the "batteries included scheduler/executor" would then come later and may reuse some of the code you referred to.

anyway, we'll first do the management UI in grafana and expose the rules through the API, and we'll take it from there.

Thanks @dieterbe.

As always, there's the question of a rough timeline, even if it's only "not
before X".

I do understand how annoying this question can be thus the phrasing in the
second part. I hope you understand how frustrating waiting on the other
side of the fence can be.

Richard

Sent by mobile; excuse my brevity.

Hi all,

i hope its ok for raintank to say it here, but we very recently ordered almost a month worth of dedicated coding hours by raintank to work on alerting. So why this will not result in final alerting feature just yet we should see something coming up soon to lay the base for alerting in grafana. If other companies would follow our approach or also individuals investing some cash into this matter that should speed up thinks and priorities even more.

@flyersa, thanks so much for the contribution! How can we put cash down as well?

Jon

Hello all,

I know many are anxious for this feature, and I'm pleased to report that the team has started working on it. We explained our reasons for the delay in the Grafana 3.0 beta announcement blog

We will be releasing alerting in two phases. The first phase will allow users to define their alerts and thresholds within the Grafana UI. Grafana will also expose these alert definitions over an HTTP API to third party schedulers and alerting backends. In the second phase, we will provide the backend service to consume and act on these definitions, for a completely integrated solution.

We hope that the first phase is released in a matter of weeks.

We are trying to balance profitability with speed, and we GREATLY appreciate the commercial support of our customers such as @flyersa. If others want to support development of this feature and Grafana in general, please consider purchasing a support plan. It will help us develop great software that is 100% open source.

We will be working closely with all supported customers as we roll out the feature, and making sure that it meets their needs well.

-raj dutt | ceo/cofounder | raintank

Hi @nopzor1200 ,

Thanks for your update. Do you have an estimation when alerting will be available?

Obviously, it is impossible to commit on a specific date, but a time frame will be much appreciated (weeks, months etc).

10x!

Hi guys, really excited for this. Here's how I am envisioning to use this, if someone can spot check that it's a standard/supported pattern I'd appreciate it.

  • Each host I want to monitor emits "Checks". A "Check" consists of:

    • the Hostname

    • the Timestamp

    • the State, which is either 0=OK, 1=WARN or 2=CRITICAL

  • These checks can come from a variety of arbitrary sources (shell scripts + cron, statsd/collectd, Nagios checks, etc.) and will be accumulated in Elasticsearch. The same check may have different configurations on different hosts, but this will be opaque to Grafana.
  • I will configure Grafana connect to Elasticsearch and to alert when any check coming from any host has State value >= 1.
  • If new hosts join the cluster, there is no configuration required in Grafana; Grafana will simply see any data-point in state 1 or 2 regardless where it came from.
  • If a host dies suddenly and stops transmitting checks, we need to detect this. To handle this, when a host starts up it will register a master check as "ON" status, and the value will only go to "OFF" when it is stopped normally. This way I can look for any "ON" hosts which have not emitted checks in the last X seconds.
  • In general I will not use threshold based alerts on time-series data in Graphana. In other words, I will not do "check if CPU Usage > 80%" within Grafana itself, but rather, Grafana will receive a "CPU Usage State" check (0 / 1 / 2) and alert in 1 or 2 state.

Hey @johnnyshields,

That looks pretty good, but instead of "0=OK, 1=WARN or 2=CRITICAL" why not use a standard level definition? The one used by syslog is pretty much a de facto standard for these things:

  • Value -> Severity
  • 0 -> Emergency
  • 1 -> Alert
  • 2 -> Critical
  • 3 -> Error
  • 4 -> Warning
  • 5 -> Notice
  • 6 -> Informational
  • 7 -> Debug

And have a (global?) configuration to tell grafana which level to consider as an alert threshold.

Given this, I would add the following changes to your post:

  • alert when any check coming from any host has State value >= CONFIGURABLE_ALERT_LEVEL.
  • Grafana will simply see any data-point in state >= CONFIGURABLE_ALERT_LEVEL regardless where it came from
  • Grafana will receive a "CPU Usage State" check level and alert if configured accordingly.

@brunorey thanks, makes sense!

Log levels and states are two different things. You could have a 6-Informational log message, but how can anything be in a 6-Informational state?

States of OK, WARN, and CRITICAL are fine, and may be too fine for those who only care about OK and CRITICAL. Adding more states adds confusion unless their meaning is universally understood, and I suggest capping at 3.

Regarding only alerting on "CPU state >= WARN" vs. "CPU > 80%", I submit that some people will want to keep their 3-level states in a time series DB so they can see how the state changed over time. Those people will alert based on their state time series data. Others will want to alert on the raw CPU value being over 80%. The point is, alerting off time series data is the only thing needed.

The reason I'm choosing integer log states ration than using the timeseries data directly is that I want to be able to be able to control what is considered an alert on each node.

For example, worker servers routinely have CPU near 100%, and it's not a problem -- I want them firing full throttle on all cores. But web servers should not have CPU above 20%. So if I were to make a generic CPU > 80% it would be too high for the webs and too low for the workers. (This is just one case).

@johnnyshields I don't understand why you would not use threshold based alerts on time-series data, IMO that's where the really strong (only?) value of adding alerting to grafana/graphite is. Your "checks" style stuff sounds better suited to something simple like monit - have I missed something here?

As explained above, I have lots of servers with different roles and the thresholds are different per server. Ultimately it's a question of whether thresholds are defined within Grafana or on the server itself, I think the server is easier in my case.

In addition, some checks are "yes or no", e.g. is process X running, does a ping to port Y respond, etc.

Understood. Sometimes determining these states is simple (>80%), and sometimes it's complex. When they're complex, some code will determine the levels and send the level into a TS database. That's a common practice, where data is transformed into information. My point is, there's no difference from an alerting sense.

If you need complex rules for your alerts, don't build the rules into the alert engine, build the rules into the TS pipeline to create new TS data, and alert off that.

Simplify the alerting system. Hide complexity in the TS pipeline.

The benefit of creating new TS data in a pipeline vs. a rules based alerting system is it keeps alerts visual and simple for the people setting and getting the alerts. There's a visualization that can be sent via email or sms showing just the thing that's alerted on - even if it's a simple state chart where they see the state went from WARN to CRITICAL 20 minutes ago.

I think if you want to control what's considered alert-worthy on a per-host/role basis you're just as well off adding logic to what's considered WARN and what's considered CRIT as you are adding 8 layers of granularity to the severity of the alert.

Almost all other modern alerting systems seem to have converged on OK/WARN/CRIT, and although it's probably partly to wanting compatibility with Nagios checks, I think the idea of just wanting to keep it simple is more important. If Grafana does the same, integration with other alerting/monitoring services will be easier. For example, in my case I'd probably end up feeding Grafana alerts to Sensu, which then would send out an email, slack message or whatever. Sensu only has OK/WARN/CRIT so any more granularity would be wasted.

Agree log alert level seems over-engineering. OK, Warn, Crit likely do the job for most cases.

On alerts thresholds, I'd love to be able to do standard deviation based alerts. They are most useful in practice imo.

On May 12, 2016, at 8:49 AM, RWhar [email protected] wrote:

@johnnyshields I don't understand why you would not use threshold based alerts on time-series data, IMO that's where the really strong (only?) value of adding alerting to grafana/graphite is. Your "checks" style stuff sounds better suited to something simple like monit - have I missed something here?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

Personally I've been looking forward alerting using existing TS data being fed into graphite as input, especially aggregating stats from application metrics from (via StatsD) within specified time ranges.

Also, it would be nice to have an option where-by alerts could be triggered at the threshold and specified intervals past the threshold - e.g. set "rpl_delay" alert threshold 200 int 50 - would cause alerts at 200, 250, 300 etc without the need to manually specify additional threshold levels.

@johnnyshields I don't get the difference between 1=WARN or 2=CRITICAL. The alert is either triggered or not triggered. You are either above 80% or not above 80%. So I only see two states 0 and 1.
It would also be nice to have smarter alerting where you can detect that you have been above 80% for 5 minutes straight so you are no alerted on temporary spikes. Even more advanced are things like moving thresholds, where for example you monitor your website traffic, and you get X amount of traffic and it slowly increases say 1% a month, but all of a sudden you get a 10% spike in traffic in a hour. You would also want to be able to monitor for the opposite, of a sudden drop in traffic. Something similar to https://github.com/etsy/skyline since Skyline is defunct.

Guys, my post here was not about the precise number of alert states to use--I was asking more generally "will Grafana will support enumerated states as an alerting use case?"

Since there is disagreement on the optimal number (Nagios uses 3, syslog uses 7, some people like 2, etc.) Grafana should support an arbitrary number of states.

Just restating what I said before that I believe that there should only be two states for each alert triggered 1 or not triggered 0. If you want to know if you are getting close to a threshold, then make a additional threshold for the lower value.

The reason for WARN vs. CRITICAL is that the actions you take are different. One group of people and actions are generally notified on WARN, and a different group / different actions on CRITICAL.

That's a pretty valuable differentiation, that I wouldn't want to go away with a 0-1 system.

@lorenwest if you want a different check for a different threshold, make a additional threshold for that separate group.
so each threshold is either 0 or 1.
For example another reason you would want alerting set up this way is when you want a E-Mail when the threshold is greater then 70% but a page when you are over 80%. Just like how you want separate groups. WARN vs. CRITICAL has too much ambiguity.

@doanerock that makes sense. Allowing an arbitrary number of alerts on any one TS metric or event allows the most flexibility. That simplifies the definition of alert by not having multiple actions for multiple levels.

so:

  • metrics can have an arbitrary number of states (including decimal / timeseries values)
  • metrics can have a multiple alert actions attached to the same metric
  • each alert is a boolean true or false -- it's either triggered or not triggered.

to give an example:

  • I have a certain metric with values 0 = OK, 1 = WARN, 2 = CRITICAL
  • I configure 3 alerts:

    • if value = 1, show a yellow flag in my dashboard

    • if value = 2, show a red flag in my dashboard

    • if value >= 1, send me an email

Hello All,

I don't know if its the right place to ask about this topic ,but i will try any way regarding upcoming Grafana alerting module.
In our organization we have all our security alerts sensors feeding Logstash/Elasticsearch for events and we using Yelp/elastalert to execute alerts with certain patterns with following criteria :

"Match where there are X events in Y time" (frequency type)
"Match when the rate of events increases or decreases" (spike type)
"Match when a certain field matches a blacklist/whitelist" (blacklist and whitelist type)
"Match on any event matching a given filter" (any type)
"Match when a field has two different values within some time" (change type)

In addition when alert criteria detected we executing external python script with arguments that passes arguments from Elastalert to the script with information like source/destination IP field , event and time stamp field, and our NAC system take care from there.

Now looking on Grafana Upcoming Alerting module and with Elasticsearch as data source , i wonder if Grafan Alerting module will be possible to cooperate same as Elastalert and eventually to replace with given above information?
Please advise

Thanks

I know the grafana team are hard at work at this, and this thread is long but I want to point out that Kapacitor just merged a feature that will greatly ease the development of frontend alert-configuring applications: influxdata/kapacitor#577

As far as I understand the goal on the Grafana side is to make the alerting backend pluggable (same as how Grafana supports several TSDB stores) but I wanted to mention in hope that Kapacitor gets first-class support when Grafana's alerting functionality is released. It looks like a great fit, as are InfluxDB + Grafana.

@thom-nic thanks for the tip Kapacitor is exactly what I'm looking for...

Riemann is also great and very powerful. telegraf -> riemann (alerting) -> influxdb <- grafana

We are making progress in the alerting_definitions branch.

We now have a simple alert rule model that you can define in the UI and via HTTP api. You can fetch rules, rule changes, rule states via HTTP api. Simple alert rule scheduling and query execution and rule execution is also starting to come together.

One thing that is a big question mark for me now is if the current alert model is too simple and a dead end. By that I mean that extending the alert rule model in the future is going require extensive changes.

Current rule model:

description
query: #A  (referencing a query in the metrics tab)
aggregation: avg
timerange:  3600   (seconds from now to look back when fetching data)
frequency: 60  (how often to execute alert rule query and evaluation rule)
critLevel: 80
warnLevel: 50

This storage model is both represented in the UI and in the actual database table. My fear is that this simple rule model does not take advantage of time series data well enough. You cannot specify dynamic thresholds (where the thresholds themselves are results of a query). Of course this
could be added later, but would require a very different rule model and execution engine.

So my idea is to scrap this simple model and to come up with a new more complex and dynamic model that can later support multiple queries for different time ranges.

Simple Query:

"alert": {
   "name": "CPU usage last 5min above 90%",
   "frequency": "1m",      
   "expr": "query(#A, 5m, now, avg)",
   "operator": ">",
   "critLevel": 90,
  },

// now to an alert that uses a dynamic threshold based on it's passed values

"alert": {
   "name": "CPU usage last 5m is 20% higher compared to last 24hours avg",
   "frequency": "1m",
   "expr": "query(#A, 5m, now, avg) => percentDiff() => query(#A, 1d, now, avg)",
   "operator": ">",
   "critLevel": 20,
  },

Now you might question this by stating that we are reinventing Graphite here, and expression like this should be handled by the TSDB. But no TSDB supports calculations with queries for different time ranges (timeShift only shifts the same time span). Some problems with dynamic thresholds is how to visualize it. It can also make the alerting rule more divorced from what is actually visualized in the panel.

I am quite unsure how the GAL (Grafana Alerting Language) should look like. Should it be just expression chains where each part can either be a query that returns one or more series (each series aggregated down to one point), and then an optional subtract or percent function that can compare against another query. The whole expression results in a value that can then used with operator and crit/warn levels to get the alert state.

Or should the expression contain the operator and levels?

Other options would go full programming language and do:

expr: "
let last5mAvg = query(#A, 5m, now, avg)
let last24hAvg = query(#A, 1d, now, avg)

return percentDiff(last5minAvg, last24Avg)
"

@torkelo:

  1. are you architecting this as a standalone component? Ultimately you're building a signal processor similar to Kapacitor for Influxdb** which itself is emitting a signal (0 = "ok", 1 = "warn", 2 = "crit"). will it be possible to send this signal somewhere other than Grafana, e.g. A) to Nagios, or B) pipe it back into the DB?
  2. similarly, will grafana have an option to not use you signal engine above, but rather receive a 0/1/2 signal from a third-party source such as a Nagios plugin, of which many existing in the wild already?

** = granted Kapicator uses timeseries stream processing whereas your is a polling-based engine, but still it emits a signal.

Thank you for soliciting input.

My opinion is to keep grafana alerts simple, and the best gauge for simplicity is visualization. If you can't visualize the alert as a line on an existing TS graph, it's too complex.

Leave the complexity to the TS graph. If the alert has greater needs, build another set of TS data based on those needs, and place the alert on a graph of that data.

If you have only one guiding principle it's to require a simple visualization of the alert.

The other issue is "how many alerts should I configure"? This topic has been discussed on this thread, and I'm of the opinion that as soon as you start putting multiple alerts into one alert (warn, error, high warn, low error, etc), you start to lose flexibility. Warnings and errors are different things - they have different levels, different people care about them, and they have different notification methods.

Keep alerts simple, and let people put multiple alerts onto a graph.

I think that #3677 (Generic Transforms on Time Series Query Results) would come in really handy here. With those TSDB independent functions, you can create a complex "alerting graph" where you can use simple fixed value thresholds for warn, crit, etc.

The simple alert rule model would be all that is needed then. The complexity is then "hidden" in the creation and combination of the graphs.

I'm all for keeping it simple. I'm no dev, I'm more light-touch-dev-ops, and I'd like to be able to hand over my Grafana/Graphite platform to my team of admins to manage. This being the case, then an alert builder similar to the existing one is going to be much easier going. Not too fussed if it introduces a load of new instructions as long as rules can still be constructed in the same way as queries for graphs currently are, it'll be easy to get to grips with.

tl;dr a whole new language may be overkill and too complex. Building rules with mouse=good.

Short of building a whole new language, I assumed this would largely be a frontend to existing alerting platforms such as Kapacitor, Reimann & Bosun, similar to how Grafana provides a frontend to compose InfluxDB queries. e.g. the heavy lifting is done by a third-party alerting system and Grafana provides the UI. Maybe that's not the case?

IIRC, Grafana wants to go the "batteries included, but removable" way. I.e. it should work standalone with a included alerting engine but should also be pluggable to existing platforms.

I'd say it needs to come with a couple of inbuilt methods - email (supply SMTP host) and WebAPI/Webhook. Then the rest can come with plugins, such as integration into PagerDuty.

@felixbarny could you describe what you mean pluggable to existing platforms? Of course the alerting notifications will integrate with many existing alerting tools. But for other systems to handle alert rule scheduling and execution could be tricky, it would be possible, just read the rules from the Grafana HTTP API. But it would require a lot of code to handle the rule scheduling and execution. But we will of course provide an option to only define the rules in Grafana, and for another system to constantly read the rules and execute them

@GriffReborn you are thinking at a different level. Existing alerting backends that I've mentioned _already_ support outputs such as SMTP, PagerDuty, etc:
https://docs.influxdata.com/kapacitor/v0.13//introduction/getting_started/#a-real-world-example
http://riemann.io/api/riemann.pagerduty.html

These products _already_ do complex, dynamic alerting well. What they don't have is a nice visual frontend for configuring and managing alerts, visually identifying which alerts are active, etc. What I would have liked to have is a frontend UI which basically pushes configurations to e.g. your (Grafana-supported) alerting system of choice which then actually does all the work.

@thom-nic I agree. The main focus should be building alerting dashboard that can use existing alert info feeds ("feed agnostic"). Making a Grafana-sponsored lightweight signal-processing engine (ideally as a standalone) should be a secondary concern.

@johnnyshields making new panels that show info from existing alerting backends is easy anyone who whish can do that. What we are trying to do is to make easy for Grafana users to define alerting rules on their metric queries they define in the graph / singlestat panels. Then have an alerting engine in the grafana backend that schedules and executes and evaluates those rules, updates alert state, triggers notifications etc.

I also think the simple model should be sufficient enough and will also result in having the long awaited feature in as soon as possible. At all grafana is for metrics, basic alerting should be sufficient enough.

@torkelo to be honest, I'm not very familiar with alerting platforms like bosun and I don't know how a proper integration could look like specifically. I was referring to things @Dieterbe said, for example in his Grafanacon presentation: http://de.slideshare.net/Dieterbe/alerting-in-grafana-grafanacon-2015#50

@felixbarny well that is what we plan to do as well. To have API's for other alerting backends to use in order to read the rules defined in Grafana. But we will not provide the bridge that reads the alert rules from Grafana and translates them to another rule execution engine.

So one idea we have now is to define simple rules like this

image

But also be able to have dynamic thresholds and compare against another query or the same query but different time range and aggregation.

image

Another complex "forecast" query. Where a query is used to get a trend line, then forecast that forward in time and alert on that.

image

Seems like the best of both worlds. Love that idea! Are the 'Evaluate Against' functions part of Grafana or are they TSDB specific?

@felixbarny they are part of the Grafana alert rule model an will be processed by the Grafana alerting rule evaluation engine.

Will you be able to attach multiple rules to a single graph? I like the simplicity of warn/critical levels in one rule, and some graphs have both high and low thresholds which would either require multiple levels in one alert, or multiple alerts on one graph.

And while I like the complex rule functionality, this can all be achieved by building a different graph and alerting on that graph with a simple rule. The benefit of keeping the complexity out of the alerting system is the history of circumstances causing the rule to fire is kept in the TSDB.

This lets you visualize an alert as a simple horizontal line on a graph, and see how that rule would have (or did) fire over time.

It keeps alerting simple for the average person, complex enough for everyone, and accessible for those who understand things visually.

@lorenwest yes, we will keep things simple and and only allow one alert rule per panel. But a rule can use a query that returns many series, which will basically split the rule into multiple (so you can have a single rule that will check each server, if the query returns a series per server).

And while I like the complex rule functionality, this can all be achieved by building a different graph and alerting on that graph with a simple rule.

Not sure what you mean here. Another graph does not at all solves the scenario where you want to alert on a query compared to itself over a different time range, over compared to another query entirely (maybe the other query is another data source fetching dynamic thresholds from a database). That scenario cannot be solved in the TSDB or by just splitting the rule into two rules in two separate panels.

But out main goal is to solve the simple case and make that easy and intuitive but we also want to, at least later, support some more complex alerting rules that really take advantage of the fact that you are dealing with TSDB data and also the fact that different queries can target different data sources

i think the point @lorenwest was making is that with alerting rules being simple thresholds, the rules are applied to the data that is being visualized in the graph. So if you overlay the thresholds you can clearly see where in the past the alert would have triggered based on current thresholds

With a more complex alerting model there is no longer a visible indicator to where the thresholds would result in an alert.

Sticking with the simple model, you could achieve many of the complex monitoring requirements provided the datasource provided the capability. For the "percent change comapared to", you could build a graphite query (different graph) that compared the the current day with the previous, and then set simple thresholds on that. It is certainly much more complicated process to create alerts, but it does work.

image

Glad we're on the same page @torkelo. This fits in with the description in the original post.

I don't fancy creating a whole new alerting platform to tie in to Grafana. What I'm hoping for from Grafana alerting is something to replace NewRelic, but with the awesome power that Grafana brings. Being able to trigger an alert (whether email, API of whatever) when one of my graphs hit's a threshold...that's GOLD. Life changing stuff.

Even simple threshold alerts would be a nice simple solution.

grafana-threshold-alerting

If you follow this one rule, you'll be golden:

Never allow an alert that can't be visualized by overlaying on a panel.

If you can't visualize it, it's too complex. Build a chart that embodies this complexity, and alert on that chart. This forces us to build visualizations that embody that complexity (a good thing) while keeping it easy for the alert builder (and consumer) to see what they're getting themselves into.

@woodsaj I agree that we want to encourage the link between what you alert on and what you see, that is not something we have ever discussed abandoning. What are trying to brainstorm is how far does single query static thresholds go, are they good enough for v2 of Grafana Alerting or v3? And to spark a discussion on what limitations in the kind of alert rules that are possible with single query and static thresholds.

Currently TSDBs are very inflexible in what kind of nested queries you can do (compare a series against itself for example). Graphite is the only one supporting nested queries. But even Graphite cannot compare two queries that target different time windows (time shift just shifts the same window, but not differently sized time window). But the more I think about this the more I agree that most of this can be solved in the TSDB query given it is powerful enough.

The main reason for raising this discussion is to brainstorm how to model the rule, what are the components that make up a rule, what abstractions does it contain (query, time window, aggregation, levels etc). How can we possibly support dynamic thresholds in v2 or more feature rich alert queries that forecast trends into the future. How would the model and rule evaluation engine need to change?

In regards to "Should alerts map to panels" - I think that may be a useful option but would be a bad design constraint, even for v1.

I think one of the more tricky aspects of alerting is scope, and once you start to talk visualization then the problem becomes apparent.

I think of scope as the surface area / depth of a system that an alert covers as the scope. So for example, your alerts might be scoped:

  • Services (application metrics)
  • Whole clusters that make up a service
  • Individual nodes in a cluster
  • Hosts / processes in a cluster
  • Subsystem of processes / applications (middleware metrics)
  • Subsystems of hosts (i.e. disk, cpu) (system metrics)

I don't believe there is a "correct" single answer on what layer one should alert on. Sometimes it depends on teams, importance of the service, general infrastructure (i.e. cloud vs hardware, cluster vs monolith), etc... So given layered scopes an alerting heirarchy seems like a good. But I don't think defining those heirarchies is generally maintainable. It is a lot of work, changes, and there are often relations that don't make for pretty trees in real world systems. Google's SRE book aggress:

"""
Google SRE has experienced only limited success with complex dependency hierarchies. We seldom use rules such as, "If I know the database is slow, alert for a slow database; otherwise, alert for the website being generally slow." Dependency-reliant rules usually pertain to very stable parts of our system, such as our system for draining user traffic away from a datacenter. For example, "If a datacenter is drained, then don’t alert me on its latency" is one common datacenter alerting rule. Few teams at Google maintain complex dependency hierarchies because our infrastructure has a steady rate of continuous refactoring.
"""

Also related to scope is the type of alert (i.e. send an email vs log it / show on the dashboard for someone to deal with when they are doing their morning rounds)

So for Grafana my alerts might map to:

  • A Panel
  • A Group of Panels
  • A Dashboard
  • A Group of Dashboards (that I imagine would have drill downs)

Sometimes I will want those alerts to send a notification, other times I will want them just to be a visual indicator somewhere in Grafana at one of the scopes (i.e. crossed threshold, or state changes as annotation markers). It is going to be different for different companies and even different groups/services within a company.

@kylebrandt the whole idea with alerting in Grafana is to tie it to panels and to visualizations. Where you can have graphs and panels that visualize metrics with different scope (like Services, clusters, individual hosts) and by having that you can alert on any level or scope.

Not seeing how linking an alert to a panel and something that can be visualized will stop defining alerts on different levels. And of course you will specify per alert what notifications should be used.

@torkelo The decision to alert will always boil down to a (true/false) bool. There could be different levels (warn, crit, etc...) or even fuzzy logic but the end it is a boolean decision. Graphing that bool by itself in a single panel won't always be helpful.

So, $metric > $threshold is the most basic alert, and it returns true of the metric exceed the threshold of course. That fits nicely into a panel (visualize the metric and visualize the threshold within a panel). But, in order to eliminate alerting noise the scope and conditions tend to grow beyond that in the majority of cases (when we started to work on Bosun, I thought these cases would be the minority, not so much it turns out if you want to control the noise). So you might say something like:

Alert if:

  • CPU is above 80% for X minutes
  • Job A is not running (that we know to raise CPU and we don't care) and Job A has not being running for more than an hour
  • Dieter has had more than 3 cups of starbucks in the last 24 hours ( because when he has more he does silly things that raise the CPU and we don't want to alert on those)

So to visualize just the alert is (True / False) when there are multiple conditions is not that useful. We need to visualize each condition (and then maybe even some more for supporting info).

Making all those conditions into a new metric doesn't really help with visualization in the moment because it would just be True/False and what you really need to see is all the underlying info. So what we have instead of visualizing metric+threshold is visualizing metric(s) + threshold(s) which could be at different scales.

So in this case, yes the Alert _can_ map to single a panel, but depending on the visualization and the alert there many cases where that isn't really what one would want. I would want a panel for each of the boolean items that make up the alert, to see which of those tripped - but in order to avoid alert fatigue I only want one notification for the combination of all the conditions.

seems some sort of alert joiner with simple Boolean logic might make this easy.

alert1:
  select: CPU is above 80% for X minutes
  output: null
alert2:
  select: Job A is not running
  output: null
alert3:
  select: Job A has being running for more than an hour
  output: send alert
alert4:
  select: Dieter has had more than 3 cups of starbucks in the last 24 hours
  output: null

(alert joiner does simple true/false logic and perhaps can graph it.)
alert5:
  database: alerts
  select: alert1 & alert2 &!alert4
  output: send alert

@torkelo I pull alerting_definitions branch from Github and built it according to the instructions. But unfortunately I can't see any "Alerting" tab (presented above) in the Graph panel.
In addition I found "alerting: enabled=false" within the "Server setting" of "Server Administration". Does that affect the alerting feature? Is there any build or runtime flag I should use?
please advice.

I had a try with latest code (ebada26b85d8410142c2942ef7ba78b17e88313c), enabled alerting and got UI.

But got tons of errors

EROR[06-17|14:38:23] Failed to extract alerts from dashboard  logger=alerting.extractor error="missing query.query"
EROR[06-17|14:38:23] Failed to save alerts                    logger=context userId=1 orgId=1 uname=admin error="Failed to extract alerts from dashboard"

I tried with InfluxDB datasources, in proyy and direct mode.

Is it something expected ?

Yes, it's not ready to test yet.

Ok good to know.

I'll track updates from time to time.
May be better to wait this branch to be merged in master so quite ready to use ?

yes, we hope to merge it to master maybe mid July there about

Do you have a progress update on this?
Are you still going to hit mid-July?
Having this feature in production ASAP would really be a huge help!

Even a light version with like email only alerting would be so great!
An update on your progress would be great (I need to choose between implementing a custom alert system or relying on Grafana, and I definitly prefer 2nd option!).
Thank you guys

Winter has come, so will alerting :)

On Tue, Jul 12, 2016 at 1:41 AM, c-val [email protected] wrote:

Even a light version with like email only alerting would be so great!
An update on your progress would be great (I need to choose between
implementing a custom alert system or relying on Grafana, and I definitly
prefer 2nd option!).
Thank you guys


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-231975390,
or mute the thread
https://github.com/notifications/unsubscribe/AAY0-eQ6jCI8a-k_U05xbcfFcYuGy4YVks5qU1NDgaJpZM4FJUTl
.

Deepak

I would consider this a "business requirement" and would advise that it be evaluated at an "enterprise architecture" level. By applying _some_ of the practices and patterns used for Enterprise Software Architecture you will be able to communicate your ideas through agile modeling which in turn promotes a higher quality of understanding for both the stakeholders and the development team.

Before we can start talking technology and architecture secret sauce we need to agree on at least the following:

  1. We think of our features in terms of "Business Process Management (BPM)"; and
  2. We use the "Business Process Modeling Language (BPML)" so we can begin modeling the requirements & implementations in the same place with UML.
  3. We define our architecture with an enterprise level discipline.

Now the fun part! Having extensive experience with monitoring at a global scale I recommend that we take into consideration the following:

  • Leave grafana alone, it is the presentation layer! If you want to add a workflow for modeling & defining rules for generating alerts that is okay, but leave it at that. After all, that is why the panels and plugins was implemented right?
  • Leave the data where it was destined to be. The metrics that are phoned home should be treated as first-class citizens and storing their values persistently is the TOP priority. Whether it is in a cache, documentdb, tsdb, or a sql server, it doesn't matter. Fault tolerance is implied.. with the right "technology selections" being made for the architecture of course).
  • To get us setup for availability + scalability we need to use the right frameworks that were designed specifically for this: meet Service-Oriented Architecture ("SOA"). At a very high level we can use the message queue protocol to transmit and receive events & messages over the "AMQP" protocol. Forget REST & HTTP... for now. Using a message queue server like RabbitMQ or ZeroMQ we have a distributed, fault tolerant, highly available, communication pipeline that both publishers/senders of data and workers/receivers to process it can both use. This is the "enterprise service bus". (Check out this slide deck explaining ZeroMQ).
  • Use a query language created specifically for disparate, un-linked, composite data models. By using a "Graph Database" and the "SparQL" query interface:

SPARQL allows users to write queries against what can loosely be called "key-value" data or, more specifically, data that follows the RDF specification of the W3C. The entire database is thus a set of "subject-predicate-object" triples. This is analogous to some NoSQL databases' usage of the term "document-key-value", such as MongoDB.
[..]
SPARQL thus provides a full set of analytic query operations such as JOIN, SORT, AGGREGATE for data whose schema is intrinsically part of the data rather than requiring a separate schema definition. Schema information (the ontology) is often provided externally, though, to allow different datasets to be joined in an unambiguous manner. In addition, SPARQL provides specific graph traversal syntax for data that can be thought of as a graph and fig.
..
https://en.wikipedia.org/wiki/SPARQL

Remember, what Grafana has given us that Nagios never did boils down to a single point of failure: lack of scalability. Grafana is "fast" as you say but you are not taking into account the fact that you are only storing and processing time series data -- not the metadata layer(s) as well! We need the semantics of SparQL and the power of Elasticache + graphing database engine(s).

It may sound complex .. which it can easily get wayyyyy more complex than these two pages but I saved you from years of brute force, trial & error and weeded out the noise (i.e.: there are 30 design patterns for enterprise architecture, 12 for uml, etc.., we just need to talk about 3 to be able to knock this out -- for now)

This should get the gears turning.. I need to get some sleep (pulled an all nighter) and I will work on Part 2. Feel free to ping me @appsoa on skype or yomateo on IRC.

Some treats in the mean time:

@talbaror Ideally you would capture the NAC's log messages using an agent like with a PIX firewall and have them simply shipped out/replayed over rsyslogd or whatever protocol the event processing server uses.

If you don't have an event processing service setup you can use the rules processing of Snort - Network Intrusion Detector. Ping me if you need help .. I spent 4 years at a security-as-a-service company ;)

Can you integrate Anomalie detection like banshee?
With visual markers and alerting.

@torkelo please give us a mark-to-market on the timeline for shipping this?

@johnnyshields I am working on this now every day. It's tricky stuff and really want to get the fundamentals right so the alert system can evolve and become more rich in the future. The current model I am working with is looking really good, will post update next week on the new conditions based alert rule model.

Hope to merge it to master and have it available (behind feature toggle) within 2 weeks if things go smoothly. We do not have a set date yet for next version of Grafana, either a 3.2 release in september or a bigger 4.0 release in end of October.

@torkelo Hope we get the alerting as soon as possible. Waiting for it.
Using grafana for kubernetes.

For other folks that already have statsd/graphite/grafana in place and are just waiting the Grafana Alerting System to be ready to do the first alerts, I found a great alternative to use in the meanwhile, Seyren: https://github.com/scobal/seyren

It integrates easily with PagerDuty and you can just copy the graph targets that you already have in your grafana dashboards for alerting specifying the warning and error thresholds.

Looks like the team has been making great progress on the alerting feature. I believe in "doing just one thing but doing it well" philosophy. Not very sure if putting the whole alerting logic inside Grafana is the best idea. Anyway, I just wrote a small node js daemon "flapjack-grafana-receiver' to post grafana events to flapjack. I will probably opensource it. Anyone interested?

https://github.com/Charles546/flapjack-grafana-receiver

Progress update!

At least one person has been working full time on alerting since April, progress has not been as quick as we would have liked due to many rewrites. Even though we are aiming for basic alerting features for the initial version we feel that it is important to get the fundamental alert rule model right so we can expand the alert rule definition and alert rule evaluation engine in future releases without a major overhaul.

The goal of starting with very simple alerting has taken us down some dead ends that did not feel right and has required some big rewrites. But we are now back on track and making good progress on a conditions based rule model that we are much happier with.

image

Rule definition

The new alert rule model is composed of one or more conditions. Conditions can be of different types. Right now there is only a query type. But we can later add Conditions like Time of day, or Day of week, or more interestingly Other alert (so you can include the state of another alert rule as a condition).

The query condition is composed of a query and time range, a reducer that will take all the data points returned for each series the query returned and reduce it down to a single value to be used in the threshold comparison. The reducer could also in the future be a "forecast" that does a linear regression on the data and forecast a future value.

The evaluation part of the query condition can either be greater than, less than, between etc. You will be able to drag handles in the graph to set thresholds.

The conditions based model provides lots of exciting possibilities for making alert rules more powerful in the future without a total engine overhaul, also the query condition has these pluggable components that will allow for extension (reducer with params, and evaluator with params).

Notifications

This passed week we have been working on notifications and things are starting to come together!

image

We have, email, webhook and slack notification types. The slack notification is looking pretty good :)
image

Want to help?

You can test & give feedback already, the code lives in the alerting branch, and you also need to enable it in the config file with.

[alerting]
enabled = true

Merge to master

We are very close to merging this to master and continuing the work there. I had hoped to do this before my summer vacation (just gone 1 week) but there are still some minor SQL schema changes I would like to do before merging to master. Merge to master WILL happen by August 19, I promise :) After that alerting will be in the latest 4.0 nightly build, so will be easy for you to test & report bugs and feedback.

What remains?

There are a number of features that are missing that we want for a beta release.

  • More reducers & ability to change reducer (only avg now)
  • Email notification looks like crap
  • Lock down schema for webhook
  • Design for alert list page
  • View alert history
  • View alert history as annotations on graph
  • Alert scheduler & engine stability
  • Alert scheduler improvements to spread out load (so alerts are not executed at the same time)
  • Alert scheduler clustering

I am really sorry that this feature is taking so long.

@torkelo please have the ability to put machines in maintenance mode for a set period in the beta.

@torkelo Thanks for the update. From what I can see, this is geared towards having alerting within Grafana. Are you still following the modular course laid out in https://github.com/grafana/grafana/issues/2209#issuecomment-149351263 ?

Also thanks to whoever the hidden elves working on this are. I suspect @Dieterbe , but I don't know.

@RichiH we are not sure how that will work out, we have tried to figure out how to do an system like in that comment but not sure how it will work. We are focused now on a strong out of the box alerting experience that can become better over time. Users with existing alerting handler could potentially disable the alerting executor in Grafana and have Grafana send the alert that needs to be evaluated to another system. It would require a lot of work on third party systems to implement that integration though.

@torkelo My thoughts were along the same lines, which is why I decided to ask.

Personally speaking, I care about Prometheus' alerting, but would appreciate nice visual integration with Grafana. I don't care too much where I define rules as long as they are stored and executed by Prometheus.

@bergquist As you will be at promcon, sitting down and talking about possible approaches might make sense. If you want, I will poke Prometheus devs about what time would fit best. There might or might not be some quiet time to sit down on the evening before and/or after cleanup; I can let you know if you want.

Hello @torkelo - this is looking great.

I have just pulled your branch and when I test an alarm for ElasticSearch I get the error

firing false
timeMs 0.225ms
error tsdb.HandleRequest() error Could not find executor for data source type elasticsearch

...does this mean ElasticSearch isn't support yet :cry:

p.s in the process output I get this:

EROR[08-04|09:15:00] Alert Rule Result Error                  logger=alerting.engine ruleId=1 error="tsdb.HandleRequest() error Could not find executor for data source type elasticsearch" retry=nil LOG15_ERROR="Normalized odd number of arguments by adding nil"

@Workshop2 we only support graphite for alerting so far but we will support Elasticsearch eventually :) I'll add a better error message for this.

How will the alerting system behave if the query returns no data? Will it trigger an alert by default?
Also, a simple count reducer would be cool which simply returns the number of datapoints returned by a query.

@bergquist I thought that alerting will be transparent with respect to the used data source. How long before we can start preview / testing alerting feature on other than graphite data source? ( I realize 'how long ...' questions no one likes, sorry )

@RichiH One option is to create a grafana app like bosun does. https://grafana.net/plugins/bosun-app But that does not enable query / dashboard reuse in a simple way. Lets talk more about it at promcon. Looking forward to meeting you! :)

No influxdb support initially also?

i didnt know its specific bound to graphite :( We are also using influx and
elasticsearch ;)

On Mon, Aug 8, 2016 at 2:18 PM, elvarb [email protected] wrote:

No influxdb support initially also?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-238218714,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKf_4yp6-34PaOE2z4ynSriRxQpjKcvks5qdx59gaJpZM4FJUTl
.

Enrico Kern

Lead System Engineer

glispa GmbH
Sonnenburger Straße 73
10437 Berlin, Germany

tel: +49 30 5557130-17
fax: +49 30 5557130-50
skype: flyersaenrico.[email protected]


Sitz Berlin, AG Charlottenburg HRB 114678B

Just the initially, we will likely add Prometheus before release. Maybe InfluxDB or Elasticsearch as well, since the alert scheduling and execution is going on in the backend the request and response code has be written from scratch (in Go), the frontend data source plugin code (written in js) cannot be reused.

We are using influx, I think we may forego the grafana integration and use Kapacitor with a simple web front-end for creating and managing alerts.

+1 Alerting + InfluxDB.

On Mon, Aug 8, 2016 at 6:01 AM, Thom Nichols [email protected]
wrote:

We are using influx, I think we may forego the grafana integration and use
Kapacitor with a simple web front-end for creating and managing alerts.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-238228133,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAY0-VP--Ysoxu5IV0hslQrP8cvF5ePSks5qdyi_gaJpZM4FJUTl
.

Deepak

It's unfortunate the work we put into building datasource plugins is only useful on the client.

Considering the immediate and long term work supporting alerting for different data sources, building a go plugin architecture etc, wouldn't it be almost the same amount of work (if not less) to build the alerting server in NodeJS, so it could use existing datasource plugins?

Opinions about go vs. nodejs aside, this could significantly reduce code duplication for alerting on different data sources.

And if you really don't like node, I'll bet there's a callout mechanism in go for loading and executing JS.

+1 Alerting for ElasticSearch

Hi, we have been waiting for the alerting system for... OpenTSDB! Can we
hope to get it for OpenTSDB soon? (Maybe when?)

Thanks a lot to the team!

2016-08-08 17:28 GMT+02:00 Slava Vishnyakov [email protected]:

+1 Alerting for ElasticSearch


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-238273405,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ARsY50v7meI_EuzSAJGtvDMDareYKSDhks5qd0sggaJpZM4FJUTl
.

+1 Alerting for ElasticSearch
Would it have the possibility to execute a script upon alert?

Do you guys have the alerting branch in a docker image yet?

  1. Are alert queries only working for query "A"? Is this hard-coded?
  2. When can we expect a fully working alerting version? (19th still the target)
  3. When can we expect Elasticsearch to work with alerting?

Edit:

  1. Can I add more then one alarm rule per graph?
  2. Can I add some information about the alarm to HTTP message? (dashboard/graph/observed_query/alarm_config/alarm_query/threshold/warn_or_crit/value/observed_timeframe/time_of_occurence)

@DEvil0000

1) You can change to any query you have in the metrics tab.
2) Fully working, depends on what you mean. We plan to merge it to master this week. Then people can start testing the nightly build and give feedback. Alpha release within 2-3 weeks, beta & stable release depends on feedback and how quickly it can become stable
3) Elasticsearch is tricky, requires a lot of code to query and parse response into time series, so will likely come after support for Prometheus and InfluxDB is added

@torkelo
I am new to elasticsearch, grafana and go lang. And I think you already searched for clients but have you seen those?
https://github.com/olivere/elastic
https://github.com/mattbaird/elastigo
Those libs might reduce the effort.

Also thanks to whoever the hidden elves working on this are. I suspect @Dieterbe , but I don't know.

Alerting is now mainly @torkelo and @bergquist (and @mattttt ). I have switched focus to our upcoming graphite backend, https://github.com/raintank/metrictank

I am very glad to see this feature making headway. Would love to have support for OpenTSDB as other alerting solutions (Bosun) would not be user-friendly enough for regular use here.

希望能够在下一个正式版中看到报警,向那些辛苦编码的程序员致敬。

希望能够在下一个正式版中看到报警,向那些辛苦编码的程序员致敬。

@superbool sorry can't read this and google translation was not very helpful

Merge to master WILL happen by August 19, I promise :)

@torkelo hehe next time I bet. Is there a new date?

Can we expect the alerting for OpenTSDB to be scheduled? We may find
(modest) founding to encourage dev.

2016-08-22 10:05 GMT+02:00 A. Binzxxxxxx [email protected]:

希望能够在下一个正式版中看到报警,向那些辛苦编码的程序员致敬。

@superbool https://github.com/superbool sorry can't read this and
google translation was not very helpful

Merge to master WILL happen by August 19, I promise :)

@torkelo https://github.com/torkelo hehe next time I bet. Is there a
new date?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-241340597,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ARsY59771TaHEIaqCHbf-4TKWc4OdjVXks5qiVhdgaJpZM4FJUTl
.

@DEvil0000 Hope to see the alerting feature will be able to published in the next stable Grafana version ,and I wish to pay high tribute to all the ones who develop the tool.
Sorry ,my English is not very well ,hope you can understand my words

@DEvil0000 The plan was to merge last Friday but due to some unplanned events ( https://twitter.com/torkelo/status/766514688997732352 ) we had to postpone it a little bit :) We still have some minor stuff to do.

@torkelo Congratulations!
@bergquist @torkelo I need alerting with elastic before October (alpha would be fine for me). Can I help you guys implement it? You just need to provide me some information to have a starting point ;)

The alerting branch have now been merged to master. :raised_hands:

We appreciate all the feedback that we have received from this issue. Thanks to all of you!
For future discussion and feedback, please post in corresponding alerting issue or create a new one. This helps us organize and prioritize our future work. I'm closing this ticket in favor of the new ones. But feel free to keep up the discussion in this issue.

So whats next?

  • Alpha release (docs and blogpost)
  • Gather feedback from the community.
  • Keep working on the remaining issues for alerting
  • Release Grafana 4.0 with alerting.

Try it out?

  • You have to enable alerting in the config.
  • You can now find alerting in the side menu.
  • You can add an alert by going to a graph panel and selecting the alert tab.
  • Use the _Test alert_ button to verify your alert.
  • To save the alert you just have to save the dashboard.
  • Set up notification on /alerting/notifications to be notified about firing alerts.
  • Add the notifier to an alert in the alert tab.

Current limitations

  • So far we only support graphite.
  • For this release only graph panel has support for alerting.

Example dashboards

You can find example dashboards in the examples folder.
The example dashboards are based on the data from our fake graphite data writer. You can start graphite and the fake-data-writer from our docker-compose files.

cd docker/
./create_docker_compose.sh graphite
docker-compose up

This should only be considered a rough guide and we will add more documentation about alerting in the following weeks.

Happy alerting! :cocktail: :tada:

@bergquist Congratulations.

Is there an issue we can follow about the future of this feature? E.g. ElasticSearch?

There is only an "AND" and not an "OR" in the Alert Conditions to add "is above" OR "is below" in one Panel or is there a other way to support this ?

I think there is a "is outside range"/"is in range" option. But I would also like to see an "or".

Hello all! Thanks very much for your contribution in this useful functionality.

It is really interesting for me, but in many cases I would need an "OR" in the Alert Conditions because there's no possibility to create more then one alert in a gragh.

I think that without that "OR" I won't be able to create alerts for this kind of graphs:

image

Any idea? Are you planning to add an "OR" option?

BR

@jmgonzalezp yes, we hope to support OR as well (not sure about mixing AND and OR yet)

We have 2 remaining design decisions for alerting that we would love some feedback on (Categorization, and Severity/State).

Here is the issue with our current thoughts and would really appreciate your feedback.
https://github.com/grafana/grafana/issues/6007

Hi all! Thanks for this such great feature in grafana!

I have a question regarding this alerting system. Currently, we are using auto scaling group in AWS for deploying grafana, would that be a problem if I run grafana in multiple machine? The problem I'm referring is, would there be multiple same alerts from multiple grafana machines? Or grafana has already handled that?

@torkelo I have the same question as @akurniawan. Lets consider this setup: 1 Load balancer, 3 Grafana instances behind the load balancer, 1 Mysql DB which all 3 instances share. How will Grafana servers handle alerts in this type of setup? Should we enable alerting only on 1 instance then or Grafana keeps track of alerts so that multiple nodes don't check and send same alerts?

@utkarshcmu @akurniawan alerting within grafana does not support HA yet. Our plan is to add support to partition alerts between servers in the future.

@bergquist Thanks for the answer. :)

@bergquist Any ETA on when InfluxDB support will be added for this?

@thisisjaid judging by this https://github.com/grafana/grafana/milestone/40 it should be here on the 10th.

@Dieterbe Any ETA for alerting support for OpenTSDB?

@sofixa Thanks, should have looked at the roadmap myself, case of not RTFMing. Appreciated nonetheless.

@Dieterbe Any ETA for alerting support for OpenTSDB?

i don't work on alerting anymore. maybe @torkelo or @bergquist can answer.

@torkelo @bergquist

Any ETA for alerting support for OpenTSDB

@LoaderMick @naveen-tirupattur OpenTSDB alerting is added to Grafana, should be a part of the next release. Also, the alerting for OpenTSDB is working in the nightly builds.

Any ETA for alerting support for influxDB and prometheus too?

@nnsaln alerting for both data sources is already in master branch.

I cant seem to get the alerting working with OpenTSDB with (Grafana v4.0.0-pre1 (commit: 578507a)). I tested the email system (working) but the alerts just don't fire even when I have a very low threshold. Is there anyway to run the queries manually and see the data that it is pulling?

alerting

Grafana v4.0.0-pre1 (commit: 9b28bf2)
error tsdb.HandleRequest() error Influxdb returned statuscode invalid status code: 400 Bad Request

@torkelo
Can the 'webhook alert notification' post the alert metric ,json or form type?

Hi guys, will Grafana support alerting for queries using template variables or is there a target release for this?

All, please try 4.0 beta; if something is missing, open new issues.

Richard

Sent by mobile; excuse my brevity.

I've tried 4.0 beta, but I still got this error
error tsdb.HandleRequest() error Influxdb returned statuscode invalid status code: 400 Bad Request

I cannot save alert notifications - send to, after I saved, row send to is become blank again

@nnsaln You're supposed to fill notification target there, not email address. Open the grafana side menu and hover over the Alerting menu option, then hit the Notifications menu options. There you can setup a notification target that you can use from your alert rules.

Is there any plan to support template variables along with alerting ? I do
understand each graph generated by a (or set) template variable corresponds
to a different graph and hence generating alert against a static value is
not correct.

On Mon, Dec 5, 2016 at 2:06 AM, Tomas Barton notifications@github.com
wrote:

@nnsaln https://github.com/nnsaln You're supposed to fill notification
target there, not email address. Open the grafana side menu and hover over
the Alerting menu option, then hit the Notifications menu options. There
you can setup a notification target that you can use from your alert rules.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-264813888,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAY0-X4UkyVE0MeBlSiYD9892OuruGcVks5rE-I6gaJpZM4FJUTl
.

--
Deepak

No, there is currently no support to do this. Maybe in far future but

99% of dashboards use template variables. They were designed with template
variables to avoid "dashboard explosion" problem.

On Mon, Dec 5, 2016 at 8:20 PM, Torkel Ödegaard notifications@github.com
wrote:

No, there is currently no support to do this. Maybe in far future but


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-265056805,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAY0-T9iFrqUcq4KbIECDe526040U6DHks5rFOJ4gaJpZM4FJUTl
.

--
Deepak

Yes, but a generic exploration dashboard is not the same as a dashboard design for alert rules.

So far there has not been a proposal for how to support template variables in a intuitive / understandable way. What should alert query with variable do? Interpolate with current saved variable value, with all? Should it treat every value as separate rule and keep state for every etc. Supporting templating variables opens up a can of worms for complexity and potentially confusing behavior. might e added some day if someone comes up with a simple and understandable way.

In the meantime nothing stops you to create seperate alert dashboards.
Alerting is new and a huge addition to grafana. It will evolve within time,
but in the short time it was implemented it added huge value to grafana,
and thanks to all contributors for that!

Am 06.12.2016 11:14 nachm. schrieb "Torkel Ödegaard" <
[email protected]>:

Yes, but a generic exploration dashboard is not the same as a dashboard
design for alert rules.

So far there has not been a proposal for how to support template variables
in a intuitive / understandable way. What should alert query with variable
do? Interpolate with current saved variable value, with all? Should it
treat every value as separate rule and keep state for every etc. Supporting
templating variables opens up a can of worms for complexity and potentially
confusing behavior. might e added some day if someone comes up with a
simple and understandable way.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-265290049,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEKf_5VldwX2fG-USjnmlMH2qOZIDdKpks5rFd5DgaJpZM4FJUTl
.

+1 Torkel.

It does make alerting fairly complicated.

On Tue, Dec 6, 2016 at 2:14 PM, Torkel Ödegaard notifications@github.com
wrote:

Yes, but a generic exploration dashboard is not the same as a dashboard
design for alert rules.

So far there has not been a proposal for how to support template variables
in a intuitive / understandable way. What should alert query with variable
do? Interpolate with current saved variable value, with all? Should it
treat every value as separate rule and keep state for every etc. Supporting
templating variables opens up a can of worms for complexity and potentially
confusing behavior. might e added some day if someone comes up with a
simple and understandable way.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-265290049,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAY0-UgrMH9u7sI-FmPVgFhMVXJBvzTvks5rFd48gaJpZM4FJUTl
.

--
Deepak

@bergquist regarding this comment

alerting within grafana does not support HA yet. Our plan is to add support to partition alerts between servers in the future

Is there a ticket to track the progress? Any branch to contribute?

And big thanks for the nice job!

Kern,

<3 grafana.

I was just trying to share thoughts around alerting with template
dashboards.

On Fri, Dec 9, 2016 at 2:53 AM, Dmitry Zhukov notifications@github.com
wrote:

@bergquist https://github.com/bergquist regarding this comment

alerting within grafana does not support HA yet. Our plan is to add
support to partition alerts between servers in the future

Is there a ticket to track the progress? Any branch to contribute?

And big thanks for the nice job!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/grafana/grafana/issues/2209#issuecomment-265986808,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAY0-aQXFZUeEfVl0MSQP7FQpMZGIh0mks5rGTMsgaJpZM4FJUTl
.

--
Deepak

@torkelo @Dieterbe It's awesome to finally have alerting built into Grafana ! What is the recommended way (if any) to create alerts programmatically?

@jaimegago to create alerts programmatically use the dashboard api, alerts are saved along with a panel & dashboard.

@torkelo How about notifications targets (e.g. create a new notification email via API) ?

edit: Answering to myself here, I found the api/alert-notifications endpoint. I think it just needs to be documented

Of course there is an http api for that, just go to alerting notifications page, add a notification and check the http api call grafana makes

@torkelo ,Is there any api can be used to create alert (not create alert notification ) programmatically

@CCWeiZ Alerts is a part of the dashboard json. So you can only create dashboard that contains alert not alerts only.

You can read more about the dashboard api on http://docs.grafana.org/http_api/dashboard/

is this available: I want to setup an alert for if a value compare to 3 days ago, the value is not increasing. (says the requests, if now value - 3 days ago requests < 100, then we say there are no much requests.). How to do this?

Was this page helpful?
0 / 5 - 0 ratings