Azure-sdk-for-java: [BUG] Exception when Cosmos DB HTTP response header is larger than 8192 bytes

Created on 29 Oct 2019  ·  28Comments  ·  Source: Azure/azure-sdk-for-java

Describe the bug
When the continuation token in the header of the HTTP response from the Cosmos DB is larger than 8192 bytes, the netty that is used by the SDK throws an exception. It seems that this was already fixed for SDK versions <3.0 but the changes were not ported to the SDK 3.x: https://github.com/Azure/azure-cosmosdb-java/issues/24

Exception or Stack Trace

io.netty.handler.codec.TooLongFrameException: HTTP header is larger than 8192 bytes.
    at io.netty.handler.codec.http.HttpObjectDecoder$HeaderParser.newException(HttpObjectDecoder.java:829)
    at io.netty.handler.codec.http.HttpObjectDecoder$HeaderParser.process(HttpObjectDecoder.java:821)
    at io.netty.buffer.AbstractByteBuf.forEachByteAsc0(AbstractByteBuf.java:1306)
    at io.netty.buffer.AbstractByteBuf.forEachByte(AbstractByteBuf.java:1286)
    at io.netty.handler.codec.http.HttpObjectDecoder$HeaderParser.parse(HttpObjectDecoder.java:793)
    at io.netty.handler.codec.http.HttpObjectDecoder.readHeaders(HttpObjectDecoder.java:592)
    at io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:218)
    at io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:202)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421)
    ... 12 frames truncated

To Reproduce
To reproduce this, you have to make a request that will lead to a large continuation token. I don't know about the internals of the Cosmos DB, but since the state is mapped into the continuation token and larger databases lead to larger states, a large database may be the prerequisite to reproduce this issue. We experience this issue only with multi-partition queries.

Expected behavior
The accepted size of HTTP request/response headers should always be larger than the maximum that can potentially created by the Cosmos DB, whatever this size is. Maybe 32768 bytes as used in the linked issue are sufficient, maybe this is still not enough.

Setup

  • OS: Linux Arch with kernel 5.3.7, JRE 11.0.5
  • IDE : IntelliJ
  • Java SDK: version 3.2.2
Client Cosmos Service Attention needs-v4-port v3-item v4-item customer-reported

Most helpful comment

I'm using Spring Data Cosmos latest (at the moment 2.1.8) and I have the same issue. This version of Spring data uses SDK 3.1.0. Could do you provide also hot fix on this version? When do you think of resolving that? Thank you

All 28 comments

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @shurd

@f-paulus I am investigating this.
Since v3.3.0, we have updated netty and reactor-netty versions to use their latest releases.
While I investigate this on current release, can you please try at your end too if you can reproduce this on v3.3.1 ?

FeedOptions.responseContinuationTokenLimitInKb can limit the continuation toke size. As a work-around can you please try setting it <= 8KB?

@kushagraThapar I can confirm that the same issue persists on v3.3.1.

@kirankumarkolli Thanks for the hint. I tried it and did set the size of the continuation token to 5kb. With this, I got an empty result instead of the expected set of items. I did a lot of testing and I'm quite sure, that I found another bug. The feed response even did contain a continuation-token together with the empty result list. I filed an issue here: https://github.com/Azure/azure-sdk-for-java/issues/6115

I'm using Spring Data Cosmos latest (at the moment 2.1.8) and I have the same issue. This version of Spring data uses SDK 3.1.0. Could do you provide also hot fix on this version? When do you think of resolving that? Thank you

@f-paulus and @apescione - I am prioritizing this in our next sprint, so we are looking this fix to be released somewhere at end of November.
spring-data-cosmosdb release can follow after this gets fixed in Cosmos DB SDK.

@f-paulus I have a potential fix to this issue, but I am wondering if you are able to reproduce this issue pretty consistently, so I can take your help to verify my fix ?

@f-paulus I have created this PR: https://github.com/Azure/azure-sdk-for-java/pull/6690
If you have some way to test it, that would be great.
Meanwhile, I am also working with other folks to reproduce the scenario.

@kushagraThapar Thanks for working on this! Unfortunatly I can not build the azure-sdk locally. I get the error "Could not find artifact com.azure:sdk-build-tools:jar:1.0.0" when running maven.
Therefore I don't see a possibility to test your fix.

Did you try to reproduce it as I suggested: Running a query against a very large database? For me, it always returns big continuation tokens that do not fit into the 8KB header-size.

@f-paulus Yes I tried running the query against large database but couldn't reproduce it. I am trying to take another route of mocking it, but not sure if that will really test it or not.

To build the azure-sdk locally, all you have to run is this command from sdk/microsoft-azure-cosmos directory: mvn clean install -Dgpg.skip -f ../../eng/code-quality-reports/pom.xml

@f-paulus .. I haven't been able to reproduce the large continuation token. It will be really helpful if you can test the change: https://github.com/Azure/azure-sdk-for-java/pull/6690
Feel free to post any questions if you have on building the jar locally.

Otherwise I will release the change in the PR.

Any update on this? I'm getting the same exact error on a medium size DB, i am also using spring-data-cosmosdb. I've tried to build and test locally your PR without success. There is any timeline scheduled for the fix?

Let me know if i can do anything to help you speed up the process :)

Thanks for your work!

Any update on this? I'm getting the same exact error on a medium size DB, i am also using spring-data-cosmosdb. I've tried to build and test locally your PR without success. There is any timeline scheduled for the fix?

Let me know if i can do anything to help you speed up the process :)

Thanks for your work!

The fix is outstanding on verification. I have been trying to test it, but can't reproduce the issue locally.
Did you test the PR, what issues did you face?
If you can share some code to reproduce this issue, that will be great.
What version of spring-data-cosmosdb are you using ?

As soon as we can verify the PR fix works, I will release this fix as a hotfix, and then within a day or two spring-data-cosmosdb will follow up.

@apescione : My only concern with fixing this on top of 2.1.8 is that this version of cosmos db SDK uses project reactor core - 3.3.0.RELEASE and project reactor-netty - 0.9.0.RELEASE, whereas spring-data-cosmosdb 2.1.8 uses older spring libraries:
<spring.springframework.version>5.1.9.RELEASE</spring.springframework.version> <spring.data.version>2.1.10.RELEASE</spring.data.version>
These older spring libraries use project reactor and reactor netty older versions.

One solution I can think of is to specify these reactor versions in spring-data-cosmosdb so that it overrides the reactor versions coming from older spring libraries.
Let me know if you know a better way.

@kushagraThapar The issue seems to happen when running paginated queries with relatively small page size, right now i'm not able to provide a full example, but in my case the issue happens in a situation like this:

String requestContinuation = null;
int pageSize = 20;
Sort sorting = Sort.by(fieldName).descending();
Criteria queryCriteria = Criteria.getInsance(CriteriaType.STARTS_WITH, fieldName, List.of(value));
DocumentDbPageRequest pageRequest = DocumentDbPageRequest.of(0, pageSize, requestContinuation, sorting);
DocumentQuery query = new DocumentQuery(queryCriteria).with(pageRequest);

Page<MyObject> page = template.paginationQuery(query, MyObject.class, myCollectionName);

Considerations:

  • I'm using spring-data-cosmosdb at the latest version.
  • I'm performing this kind of query on non-indexed string
  • As i stated before i'm working on a medium size DB.

I hope that this helps to reproduce the issue.

Thanks :+1:

@kushagraThapar The issue seems to happen when running paginated queries with relatively small page size, right now i'm not able to provide a full example, but in my case the issue happens in a situation like this:

String requestContinuation = null;
int pageSize = 20;
Sort sorting = Sort.by(fieldName).descending();
Criteria queryCriteria = Criteria.getInsance(CriteriaType.STARTS_WITH, fieldName, List.of(value));
DocumentDbPageRequest pageRequest = DocumentDbPageRequest.of(0, pageSize, requestContinuation, sorting);
DocumentQuery query = new DocumentQuery(queryCriteria).with(pageRequest);

Page<MyObject> page = template.paginationQuery(query, MyObject.class, myCollectionName);

Considerations:

  • I'm using spring-data-cosmosdb at the latest version.
  • I'm performing this kind of query on non-indexed string
  • As i stated before i'm working on a medium size DB.

I hope that this helps to reproduce the issue.

Thanks 👍

@desh901 Thank you for providing some code sample here. I tried with this configuration but still can't reproduce the issue on my end. I have a database with over 1 million documents, and paging them with size of 10, 20, 100 and 300. Can't repro it.

Can you please try building cosmos db SDK locally from this PR: https://github.com/Azure/azure-sdk-for-java/pull/6690

Build instructions:
To build the azure-sdk locally, all you have to run is this command from sdk/microsoft-azure-cosmos directory: mvn clean install -Dgpg.skip -f ../../eng/code-quality-reports/pom.xml

In your spring-data-cosmosdb project, you can take dependency on built azure-cosmos jar and test the changes.

This will really help us moving forward. Meanwhile I will try to take the mocking route and see if that works.

Let me know if you have problems building or using azure-cosmos jar locally.

@kushagraThapar I've pulled the PR code, the only directory named microsoft-azure-cosmos is in the path sdk/cosmos/microsoft-azure-cosmos, from here i run the following command mvn clean install -Dgpg.skip -f ../../../eng/code-quality-reports/pom.xml This generates the sdk-build-tools JAR in the directory /eng/code-quality-reports/target/sdk-build-tools-1.0.0.jar but not the azure-cosmos JAR. Are you able to provide a compiled JAR of the library?

Then i'm going to run the command mvn install:install-file -Dfile <path-to-jar> to insall the dep in my local mvn repo, and finally i will add the dependency in my spring-data project with the following syntax:

<dependency>
   <groupId>com.microsoft.azure</groupId>
   <artifactId>azure-cosmos</artifactId>
   <version>version_written_in_the_pom_of_azure_repo</version>
</dependency>

Is that correct or i am missing something?

Thanks!
Is the process correct?

@kushagraThapar I've finally managed to get it working :)

For everyone else that need to test the PR locally:

  • Clone the PR repo
  • Go to <repo-dir>/sdk/cosmos/microsoft-azure-cosmos
  • Run mvn clean install -Dgpg.skip -f ../../../eng/code-quality-reports/pom.xml
  • Run mvn clean install -Dgpg.skip -f pom.xml
  • Copy the <repo-dir>/sdk/cosmos/microsoft-azure-cosmos/target/azure-cosmos-3.5.0.jar in your project
  • Add the following maven deps:
<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-core</artifactId>
    <version>3.3.0.RELEASE</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-collections4</artifactId>
    <version>4.4</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.7</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>27.0.1-jre</version> 
</dependency>
<dependency>
    <groupId>io.dropwizard.metrics</groupId>
    <artifactId>metrics-core</artifactId>
    <version>4.1.0</version>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-core</artifactId>
    <version>1.2.0</version>
</dependency>
<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-test</artifactId>
    <version>3.3.0.RELEASE</version> 
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>io.reactivex.rxjava2</groupId>
    <artifactId>rxjava</artifactId>
    <version>2.2.4</version>
</dependency>
<dependency>
    <groupId>io.projectreactor.netty</groupId>
    <artifactId>reactor-netty</artifactId>
    <version>0.9.0.RELEASE</version>
</dependency>
<dependency>
    <groupId>com.microsoft.azure</groupId>
    <artifactId>azure-cosmos</artifactId>
    <version>3.1.0</version>
    <scope>system</scope>
    <systemPath>${project.basedir}/azure-cosmos-3.5.0.jar</systemPath>
</dependency>
  • Run your project and test if the PR solves your problem.

@kushagraThapar I can confirm that this PR solved my issue, i can now retrieve the results except for the continuation token being very very long. But that is something that i can manage by my side.

Thanks for the support!

Valerio

@kushagraThapar I've finally managed to get it working :)

For everyone else that need to test the PR locally:

  • Clone the PR repo
  • Go to <repo-dir>/sdk/cosmos/microsoft-azure-cosmos
  • Run mvn clean install -Dgpg.skip -f ../../../eng/code-quality-reports/pom.xml
  • Run mvn clean install -Dgpg.skip -f pom.xml
  • Copy the <repo-dir>/sdk/cosmos/microsoft-azure-cosmos/target/azure-cosmos-3.5.0.jar in your project
  • Add the following maven deps:
<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-core</artifactId>
    <version>3.3.0.RELEASE</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-collections4</artifactId>
    <version>4.4</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-text</artifactId>
    <version>1.7</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>27.0.1-jre</version> 
</dependency>
<dependency>
    <groupId>io.dropwizard.metrics</groupId>
    <artifactId>metrics-core</artifactId>
    <version>4.1.0</version>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-core</artifactId>
    <version>1.2.0</version>
</dependency>
<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-test</artifactId>
    <version>3.3.0.RELEASE</version> 
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>io.reactivex.rxjava2</groupId>
    <artifactId>rxjava</artifactId>
    <version>2.2.4</version>
</dependency>
<dependency>
    <groupId>io.projectreactor.netty</groupId>
    <artifactId>reactor-netty</artifactId>
    <version>0.9.0.RELEASE</version>
</dependency>
<dependency>
    <groupId>com.microsoft.azure</groupId>
    <artifactId>azure-cosmos</artifactId>
    <version>3.1.0</version>
    <scope>system</scope>
    <systemPath>${project.basedir}/azure-cosmos-3.5.0.jar</systemPath>
</dependency>
  • Run your project and test if the PR solves your problem.

@kushagraThapar I can confirm that this PR solved my issue, i can now retrieve the results except for the continuation token being very very long. But that is something that i can manage by my side.

Thanks for the support!

Valerio

@desh901 Thank you so much for taking time and effort to check this. I really appreciate it. I am so glad we were able to fix the issue. :)
I will release azure-cosmos hotfix today, with the fix, and will hotifix it in spring-data-cosmosdb latest version, which is v2.2.1.M1.

@apescione For version 2.1.8, I will first have to check if taking dependency on latest azure-cosmos hotfix version actually doesn't breaks the code because of project-reactor version difference.

@kushagraThapar Could you please notify us when this will be applied also to spring-data-cosmosdb?

@kushagraThapar Could you please notify us when this will be applied also to spring-data-cosmosdb?

@desh901 This will be applied to spring-data-cosmosdb by the end of this week and the next version to be released will be 2.2.1.M2.

I will also try to hotfix this to spring-data-cosmosdb v2.1.8

@apescione Regarding spring-data-cosmosdb 2.1.8, we won't be able to fix this issue because of reactor-netty version mismatch.
Since spring-data-cosmosdb v2.1.x release train uses springframework 5.1.x releases and spring-data v2.1.x releases, which use reactor-netty 0.8.x release.
And the above fix has only been applied on reactor-netty 0.9.x release.

Thank you @kushagraThapar for the update. How can I manage this issue in 2.1.8 version? Is there a workaround for that? At the moment is not possible to migrate to Spring boot 2.2 for me.

I've been able to use the azure-cosmos 3.5.0 version on spring 2.1.x versions by adding the following dependencies in the pom.xml file:

<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-core</artifactId>
    <version>3.3.0.RELEASE</version>
</dependency>
<dependency>
    <groupId>io.projectreactor</groupId>
    <artifactId>reactor-test</artifactId>
    <version>3.3.0.RELEASE</version> 
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>io.reactivex.rxjava2</groupId>
    <artifactId>rxjava</artifactId>
    <version>2.2.4</version>
</dependency>
<dependency>
    <groupId>io.projectreactor.netty</groupId>
    <artifactId>reactor-netty</artifactId>
    <version>0.9.0.RELEASE</version>
</dependency>
<dependency>
    <groupId>com.microsoft.azure</groupId>
    <artifactId>azure-cosmos</artifactId>
    <version>3.5.0</version>
</dependency>

A part from that i will migrate to Spring Boot 2.2.x

Thanks @desh901 for the suggestion, @apescione I would try this suggestion as it looks good to me.
Other than this, I am actively working on getting out some more fixes on v2.1.9 and v2.2.1.M2 of spring-data-cosmosdb. Will update this thread once they are out.

@desh901 spring-data-cosmosdb v2.2.1 has been released which contains this fix.

This issue has been fixed on all SDK versions. Closing it now.

Was this page helpful?
0 / 5 - 0 ratings