Grafana: ElasticSearch 7.x too_many_buckets_exception

Created on 28 May 2019  ·  54Comments  ·  Source: grafana/grafana

What happened:
Upgraded to ES 7.x and Grafana 6.2.x. Some panels relying on ES datasource was showing "Unknown elastic error response" in top left corner.

Query inspector displayed this error:

caused_by:Object
type:"too_many_buckets_exception"
reason:"Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting."
max_buckets:10000

What you expected to happen:
Graph to display 3 hours of data from front end proxy logs stored in ElasticSearch 7.x.

How to reproduce it (as minimally and precisely as possible):
Query a lot of data
image

Environment:

  • Grafana version: 6.2.1
  • Data source type & version: ES 7.0
  • OS Grafana is installed on: Ubuntu 18.04
  • User OS & Browser: Win10/Chrome
datasourcElasticsearch prioritimportant-soon typfeature-request

Most helpful comment

Surely Grafana can do something here.

I've noticed that since Elasticsearch 7.x Elasticsearch now counts the terms aggregation towards bucket size, rather than just the date historgram. Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Panel time range -> Date historgram resolution
15 minutes -> 10 second
30 minutes -> 15 seconds
1 hour -> 30 seconds
4 hours -> 1 minute
12 hours -> 1 minute
24 hours -> 5 minutes
48 hours -> 10 minutes
7 days -> 1 hour

It appears although Grafana can automatically widen the date historgram time range, it is still making elasticsearch return too many buckets.

Maybe there could be a way for us to specify time resolutions based on our date pickers time range?

All 54 comments

As the error message from Elasticsearch says "This limit can be set by changing the [search.max_buckets] cluster level setting.". I don't see how Grafana can do something to resolve this.

To minimize these either add change min time interval on datasource or panel level or either add min doc count on date histogram to 1.

Surely Grafana can do something here.

I've noticed that since Elasticsearch 7.x Elasticsearch now counts the terms aggregation towards bucket size, rather than just the date historgram. Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Panel time range -> Date historgram resolution
15 minutes -> 10 second
30 minutes -> 15 seconds
1 hour -> 30 seconds
4 hours -> 1 minute
12 hours -> 1 minute
24 hours -> 5 minutes
48 hours -> 10 minutes
7 days -> 1 hour

It appears although Grafana can automatically widen the date historgram time range, it is still making elasticsearch return too many buckets.

Maybe there could be a way for us to specify time resolutions based on our date pickers time range?

I'm guessing I'm one of very few either experiencing this issue, or not many are running ES 7 yet.

Changing the min doc count to much higher has little effect, and changing the minimum time interval works fine if you are only looking at an hour of data, but as you expand the time range then fails. I also changed the ES setting to 100k, but Grafana is still requesting too fine a time grain.

If there was an option to set not only the minimum time value, but the full time range to histogram resolution it would probably work.

Grafana should be using elasticsearch's scroll API (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html) for this. Increasing search.max_buckets above 10000 has no effect because elasticsearch hard-caps it at 10000

I'm surprised how this issue underrated, faced with same problem, changing interval in panel or data source helps. But usually we look at metrics daily and want to see it with small granularity, we also want to look at metrics weekly/monthly etc., to achieve this I have to change min interval in datasource/panel or have different dashboards with different interval set - this is not convenient.

Seems to be more and more hitting this problem so reopening the issue.

Not exactly sure though that it's as simple as extending the automatic intervals to solve this problem. As far as I understand this also depends on how many terms aggregations and buckets you get in total so not easy to solve in Grafana.

Some context to why they added the search.max_bucket setting: https://discuss.elastic.co/t/requesting-background-info-on-search-max-buckets-change/130334

To me it sounds like you still should be able to configure search.max_bucket to -1 in ES7 similar to how it per default behaved in ES6, but haven't had time to confirm this. Please try this out and let me know the result.

Looking at Kibana seems like they still have similar problems in at least some parts: https://github.com/elastic/kibana/issues/36892

One of the commenters suggest

Run the aggregation via a composite aggregation in order to be able to paginate through results.

Kibana have this related issue open regarding composite aggregations: https://github.com/elastic/kibana/issues/36358

I have never used composite aggregation and I currently know too little about it and why that would be better alternative than the regular aggregations. Also seems like composite aggregations is only supported from ES 6.1 and forward.

Just to verify, does changing the max concurrent shard request setting to 5 makes this better?

It seems I have the same issue "Unknown elastic error response".
I have events from 27.05 until now in ES.
And if I set Quick Range _Last 90_ days (in grafana) I get error but If I set _Last 30 days or Last 6 months_ there is no an error.

@marefr -1 isn't a valid option for search.max_buckets setting. It returns an error that says it needs value >= 0.

Setting it to 0 is possible but than nothing seems to be getting returned.

I'm facing the same issue with a count panel grouped by two terms and Date Histogram
It works fine up to the last 5 hours, when I attempt to view the last 6 hours it give the issue regarding the 10000 buckets.
I attempted to change my search.max_buckets setting on my cluster to 15000, but then the error said
Must be less than or equal to: [15000] but was [15001], still 1 more than my cluster setting.

Setting the max concurrent shard to 5 did not help.

It does appear that setting a higher min time interval allows the graph to work, but it also groups more points together and reduces the precision of the data. I have the default set as 30s changing to 60s lets the last 6 hours work.

Hello everyone. Also faced with such problem after we start the migration to 7.1 version of ELK.
Increasing the search.max_buckets value doesn't help much - it will always result in an error that it's over the limit.

+1.

Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Grafana does the exact same thing if you set the date histogram interval to auto.

Having the same problem. Setting the date histogram interval to auto does not help. Cannot create a histogram that aggregates data from the last few days while before the update it was basically possible to view arbitrary time ranges. Interestingly enough a table panel with the exactly same data source does work.

Having the same problem too.
Grafana - 6.2.5
ES - 7.3.0

Increasing the Min time interval works but when you increase your time range, you must change the Min time interval value again.

Also reporting the same problem as described here. Expanding the timerange on a high time resolution dataset (fine grain) will cause this error. Perhaps Grafana should adapt the group by time to a wider window as the user expands the time range, in order to bring the data more aggregated.

If my data has a min resolution of milisseconds, there's no need to bring millions of documents to be displayed in a 3 month chart. Data should be aggregated at ES at query level.

I'm also seing this issue in the explore panel for the version 6.4.1 of Grafana.
When choosing greater time range (more than 1 hour) i got an Unknown Elastic Error Response because the query return too many buckets to aggregate or display. But that occur only in the "Logs" tab and not the "Metrics" tab.

There is also something to do there to help display the informations without having to modify options into elastic

Having the same problem too.
Grafana - 6.4.1
ES - 7.4.0

Also think Grafana should use scroll API

It may be possible to wrap the date histogram aggregation in a composite aggregation then paginate between the results and combine them client side.

Same problem here.
ES - 7.2.0
Grafana - 6.4.2

Same problem here:
ES - 7.3.2
Grafana - 6.4.2

Same Problem here
ES 7.1
Grafana 6.4.2

Same issue..
ES 7.3.2
Grafana 6.4.0

Same issue.
ES 7.3.0
Grafana 6.4.3

Same issue.
ES 7.1
Grafana 6.4

Same issue in the explore panel
ES 7.4.1
Grafana 6.4.3

Having the same issue with ES 7.4.1 and Grafana both in 6.4.3 and 6.5.0-pre (Docker image grafana/grafana:master) when trying to browse logs in Explore view.

However, there is no problem with the same version of ES and Grafana 6.3.6.

Is there any update on this issue? Grafana 6.4 has introduced handy Logs panel, but this bug makes it impossible to use 6.4+ properly.

Same issue here in Kibana but however and in whatever stage this issue currently recognize @ es I feel castrated due to limitate this way.
Slightly more boost on buckets would end with more granulized results in timeline visuals etc.
Please we need a way to controll to our needs!
@ ES - thats our pain not yours if we crush our systems!
You can distance yourself from it and point out "own danger"!

@YoullNeverSee Grafana cannot change how Elasticsearch is implemented. Please open issue at https://github.com/elastic/elasticsearch

@YoullNeverSee Grafana cannot change how Elasticsearch is implemented. Please open issue at https://github.com/elastic/elasticsearch

But it sure can query ES (of version 7.x) differently. There have always been API changes and the requirement for clients to adjust for them. This is such a case. The number of buckets must be kept lower than before, so the data fetching query has to be taking that into account relative to the timeframe of the current graph. If you do that manually (change the "granularity" / interval) ES is very happily throwing some data back at you.

Also hitting this issue! Interested to see what the outcome is

For all that having this problem, I solved my problem by increasing the bucket size in Elasticsearch:

curl command:
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{"persistent" : { "search.max_buckets" : "30000" }}'
Response:

{
  "acknowledged" : true,
  "persistent" : {
    "search" : {
      "max_buckets" : "30000"
    }
  },
  "transient" : { }
}

Hopefully this is working for you as well :).

@TomKeur
I did setup 30000

curl -s -X GET "https://elastic:pass@domain/_cluster/settings?include_defaults" | jq | grep bucket                                                               

"max_buckets": "30000"

Before I had default 10000. I was able to see logs for last 1 hour only.
Now I cat see logs from 1, 3, 6 hour.
But on 12 hour I have same error as earlier.

Trying to create too many buckets. Must be less than or equal to: [30000] but was [30001]. This limit can be set by changing the [search.max_buckets] cluster level setting.

image

@optimistic5 I hate to quote myself, but as I said in https://github.com/grafana/grafana/issues/17327#issuecomment-550161141 there needs to be a mechanism in Grafana to choose a time interval for the buckets relative to the time frame selected for the graph (just what you ran into). And quite honestly it absolutely makes sense for ES to have things this way - Why on earth would you create more buckets than you have pixels / datapoints in your graph to show. Doing so would only slow ES down, see https://www.elastic.co/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries about why that "soft limit" for the number of buckets has been put into place.

Secondly that "sampling rate" to have ES place things in to buckets is already a feature Grafana has: "Min time interval". The only thing that lacks is the ability to dynamically adjust that for the query to ES to very unlikely exceed a certain number of buckets (if you have lots of terms you aggregate over that could still happen). But in most cases graphing the aggregation of terms is about a limited number of distinct values and their count over time.

@frittentheke
thank you for you answer.

I'm not so deep in the terms of ES and Grafana.
I see that Kibana can show logs from any time interval - I expect the same from Grafana.
If Kibana can show logs, and Grafana can't, my conclusion that problem in Grafana, not ES.

  1. Will be this issue fixed from Grafana side?
  2. Do we have workaround?
  3. If not, what should I write to elastic to solve this problem?

Thank you.

For all that having this problem, I solved my problem by increasing the bucket size in Elasticsearch:

curl command:
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{"persistent" : { "search.max_buckets" : "30000" }}'
Response:

{
  "acknowledged" : true,
  "persistent" : {
    "search" : {
      "max_buckets" : "30000"
    }
  },
  "transient" : { }
}

Hopefully this is working for you as well :).

Thanks mate, that helps me out!
I think the most common problem is that there is no hint of how to get the right syntax to create this PUT request.
In other words: where can users find this information?

Not here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html

so, where the hell es told us how to do?

Looks like we are not the only ones trying to find the same information, but it's on the Elasticsearch documentation: https://github.com/elastic/elasticsearch/issues/34209.

The doc here says:

The maximum number of buckets allowed in a single response is limited by a dynamic cluster setting named search.max_buckets. It defaults to 10,000, requests that try to return more than the limit will fail with an exception.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html

So the official answer is to raise the limit up to the maximum stress that you want to put on the database system.

This changelog says the default limit was unlimited before, and the new default is 10,000, instead of unlimited in the previous version:
https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#search-max-buckets-cluster-setting

And this community forum post mentions the value for unlimited is -1:
https://discuss.elastic.co/t/requesting-background-info-on-search-max-buckets-change/130334

I don't recommend changint the value to -1, but technically you can.

@adeverteuil that is the correct answer _if_ the user wanted the precision that only an aggregation with 10k+ buckets can give but I believe that most people seeing this issue neither want that level of detail or are even asking Grafana for it.

On the Explore page in Metrics mode you can produce any date histogram of document counts you want and the default interval value set to "auto" will adjust the "interval" parameter's value to a sane value based on the time range when the underlying code makes the request to grafana-instance.com/api/datasources/proxy/x/_msearch?max_concurrent_shard_requests=xx.

_However_ when on the Explore page in Logs mode, if you have enough data(in order to not create null buckets) changing the date range before modifying the query to something like last 30 days(typical workflow in Kibana Discover) will produce a _msearch request with this payload:

[
  { "search_type": "query_then_fetch", "ignore_unavailable": true, "index": "my-index-*" },
  {
    "size": 500,
    "query": {
      "bool": {
        "filter": [
          {
            "range": { "@timestamp": { "gte": "epoch_min", "lte": "epoch_max", "format": "epoch_millis" } }
          }
        ]
      }
    },
    "sort": { "@timestamp": { "order": "desc", "unmapped_type": "boolean" } },
    "script_fields": { },
    "docvalue_fields": ["@timestamp"],
    "aggs": {
      "2": {
        "date_histogram": {
          "interval": "1s",
          "field": "@timestamp",
          "min_doc_count": 0,
          "extended_bounds": { "min": "1577481456296", "max": "1577503056296" },
          "format": "epoch_millis"
        },
        "aggs": { }
      }
    }
  }
]

Grafana is properly setting the size constraint(500) to only return the log data that the user is interested in but the date histogram of the document counts at the top of the page is using the aggregation with a 1 second interval and no way to change it to anything higher. The usecase for that histogram isn't detailed graphing(thus not needing a modifiable interval like panels) but to give the user an easy way to eyeball out rough time ranges to focus on when reading the logs or narrowing the query time range down to.

I believe this issue could be closed(as fixed) by changing the Explore:Logs page to use the same auto interval logic as Explore:Metrics since it looks to be the only case where this error is being produced without the user specifically asking for the condition which triggered it. If the user sets a value that causes the error then the ElasticSearch guidance that you pointed out of "manually raise the level if you actually did it on purpose" is correct response and there is nothing Grafana can do about it. Kibana will give you the same bucket error when you do something like this on the Visualization page and modifying the query without the user's knowledge for graph panels via backend magic would be dangerous to do(changing the meaning of the returned aggregate, causing load problems for your cluster, etc).

@redNixon that's definitely a bug. Thanks for reporting.

Surely Grafana can do something here.

I've noticed that since Elasticsearch 7.x Elasticsearch now counts the terms aggregation towards bucket size, rather than just the date historgram. Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Panel time range -> Date historgram resolution
15 minutes -> 10 second
30 minutes -> 15 seconds
1 hour -> 30 seconds
4 hours -> 1 minute
12 hours -> 1 minute
24 hours -> 5 minutes
48 hours -> 10 minutes
7 days -> 1 hour

It appears although Grafana can automatically widen the date historgram time range, it is still making elasticsearch return too many buckets.

Maybe there could be a way for us to specify time resolutions based on our date pickers time range?

Elasticsearch will return whatever you ask it to. The interval is a client side parameter - the problem with the implementation in Grafana is that it uses the "Date Histogram" aggregation and then doesn't scale the "interval" parameter. The whole concept of Auto isn't supported by the Date Histogram aggregation method in the Search API at all and is misleading when this option is presented in Grafana. I believe this confusion occurs because when people choose Auto in Kibana, it dynamically changes the interval size, but perhaps not how you you might think.

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-datehistogram-aggregation.html

Looking at the API documentation, the interval is supplied by the client at query. There is no feature in this part of the search API to "Auto" time interval.

The problem when looking at large time series is that even though you may have < 10000 buckets, those buckets have many large shards or you are performing Term sub-aggregations along with the Date Histogram which adds more total buckets (sub queries) to the parent aggregation. That for me results in Java OOM errors in Elasticsearch. If your query generated more than 10000 buckets, you will hit the too many buckets exception as in the OP. As people have mentioned, if you manually set the Min Time Interval, you basically increase the stability of the query by reducing the total aggregation buckets. While this might work in some limited situations, it is always a trade off when zooming in to small time periods (Very large time buckets reduce resolution of visualisation) or zooming out to larger time frames (OOM/Too Many Buckets)

While a solution could be coded into Grafana to scale the time interval to something they feel is sensible per quick time interval pick, the obvious solution for this in my humble opinion is to expose the Auto Date Histogram aggregation method in the Elasticsearch Search API to the Group By section in Grafana. This will allow the user to define the max number of time buckets a given visualisation should return, similar to the auto time interval in Kibana. You can check out the examples here.

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-autodatehistogram-aggregation.html

The user is then in control of selecting the maximum time buckets per query which allows the user to control how heavy/detailed each query is and then have Elasticsearch scale the buckets over larger time frames. I think this would be a killer feature for the Elasticsearch data source in Grafana and provide a similar experience to the default Date Aggregations in Kibana :)

jumping on the discussion, is there a way to change the interval value in a dashboard in an easy way? the auto method does not work for me, so i would be happy to ahve a single point where to change it.

Hi all,

For those who want to get rid of search.max_buckets,

  • Setting the value doesn't work (You get the error Failed to parse value [-1] for setting [search.max_buckets] must be >= 0 when you try)
  • However, you can set it to the maximum accepted value (2^31-1):
PUT _cluster/settings
{
  "persistent":{
  "search.max_buckets":"2147483647"
  }
}

That effectively disable the setting.

For information, this setting is currently being deprecated (see https://github.com/elastic/elasticsearch/issues/51731 )

Hi all,

For those who want to get rid of search.max_buckets,

* Setting the value doesn't work (You get the error `Failed to parse value [-1] for setting [search.max_buckets] must be >= 0` when you try)

* However, you can set it to the maximum accepted value (2^31-1):
PUT _cluster/settings
{
  "persistent":{
  "search.max_buckets":"2147483647"
  }
}

That effectively disable the setting.

For information, this setting is currently being deprecated (see elastic/elasticsearch#51731 )

This works flawlessly, thanks a lot!

Hi guys,

Have a good news for you:
https://github.com/elastic/elasticsearch/pull/46751

According to https://github.com/elastic/elasticsearch/pull/55266

We introduced a new search.check_buckets_step_size setting to
better control how the coordinating node allocates memory when aggregating
buckets. The allocation of buckets is now be done in steps, each step
allocating a number of buckets equal to this setting. To avoid an OutOfMemory
error, a parent circuit breaker check is performed on allocation.

I think it would be ideal if grafana handled time interval dynamically based on time range like kibana. If you want per second values over multiple days. It doesn't make computational sense to request every 1 second of multiple days from elasticsearch.

I think it would be ideal if grafana handled time interval dynamically based on time range like kibana. If you want per second values over multiple days. It doesn't make computational sense to request every 1 second of multiple days from elasticsearch.

Exactly that (as I also suggested in my comment above if I may say so :-) ).

@frittentheke @s1sfa I think Grafana shouldn't be responsible for managing the scaling when this feature is available in the Elasticsearch Search API already. Just need to add the Autodate Histogram Aggregation instead of the regular Date Histogram Aggregation to the Elasticsearch data source in Grafana, then Elasticsearch will scale the buckets accordingly to the time range query requested.

@berglh while the new functionality migh be helpful, very helpful even it's not as simple as to just "use the right query or function". The Auto-interval Date Histogram Aggregation will comfortably create buckets for a certain interval sensible to greate the graph. But even if using this there could be cases (querying i.e. for counts of individual termins) in which Grafana still needs to deal with the case that the selected / requested time interval cannot be queried without causing too many buckets to be created. But certainly it's best to use as much of the storage backend functionality to optimize the querying server-side - sorry for not properly diving into the discussion with my last post.

@berglh - I like the idea but I think it would need a bit of testing on grafana elastic query building. I tried to simply swap out date_histogram for auto_date_histogram and it appears to not work with other aggregations, like sum. {'reason': 'The first aggregation in buckets_path must be a multi-bucket aggregation. Secondly. With this elastic way or not, the ability to scale to a specified interval is pretty important. If I want a per second rate or something, auto date doesn't have any parameters to know, but it does return the interval it used. Which would be pretty similar to grafana just changing the interval on a regular query then doing the division to get the intended values.

The max_bucket being at a low threshold is mostly an elasticsearch problem which it looks like they are improving in new versions. But if we think about trying to get one month worth of per second data on a graph, some sort of auto scaling needs to exist. Whether grafana is making the decision based on some source parameters or elastic search autodate is figured out and grafana has the ability to do a calculation on the interval to have values in the desired interval, like per second.

@s1sfa Thanks for trying it out :) Just to clarify my position if I wasn't clear, I'm not suggesting we swap it out directly. I think both types of aggregations are useful depending on the case of the visualisation. I just think providing it as an option as a query type for Elasticsearch data source would be useful for people wanting a more Kibana like experience when creating a dashboard in Grafana.

The max_bucket being at a low threshold is mostly an elasticsearch problem which it looks like they are improving in new versions.

I read the recent improvement in this area that they are looking to handle the circuit breaking of long running queries more reliably to prevent Out of Memory errors. The performance of Elasticsearch has always been improving, increasing stability under larger queries over time - so you are probably right.

Still, I would doubt that there is never a condition where a query will hit a circuit breaker and return a different error like "unable to service the query due to exceeding circuit breaker" due to effectively the cluster determining too many buckets being the cause of the issue. These types of problems will probably occur less with solid state storage, on spinning disk clusters though with datasets many times larger than the combined JVM heap of the cluster, or histograms split by a big number of term sub aggregations will always run into issues with buckets one way or another.

@frittentheke Giving the user the ability to set a specific integer for the "buckets" parameter of the auto-date histogram query method would let the user tune the graph to the performance characteristics of the dataset and hardware performance. There is nothing stopping a user requesting the last 10 years of data and the query still timing out or hitting some other Elasticsearch performance issue - I figure there is only so much hand holding Grafana can do. I still think there is benefit at least we can give the user an option for an auto-interval scaling solution.

It's up to the Grafana community and data source maintainers to decide if an auto-interval scaling solution should be handled by Grafana, and if there are any trade-offs with metrics style aggregations as you pointed out @s1sfa - I don't have enough experience with this query to say whether it's even worth implementing, I just read the manual and was voicing an opinion based on that limited information - it reads like an easy win to give the user control over to auto-interval to a sensible bucket limit on a case by case basis 😳

I believe this issue is actually closed by this commit: https://github.com/grafana/grafana/pull/21937. You can now set the maximum data points per visualisation which then automatically calculates the time interval of the aggregation buckets. Between setting your maximum sub aggregation size limits and the max data points, you get a nicely scaling solutions with the aggregation filter. :tada: I am running Grafana latest from Docker Hub v7.2.0 (efe4941ee3).

image

@berglh Thanks for bringing this up, and you are right, this should be fixed in the 6.6.2 release with #21937.
I'm closing this issue, if someone is still facing this problem we can reopen it 🙂

Was this page helpful?
0 / 5 - 0 ratings