Azure-sdk-for-java: [QUERY] How to alleviate Timeouts in List Blobs operation?

Created on 15 Sep 2020  ·  57Comments  ·  Source: Azure/azure-sdk-for-java

Query/Question
How to alleviate Timeouts in List Blobs operation?

The timeout is set for 30s which is the max permissible for Blob Service ( as per the Azure documentation ). The max number of keys for listing ( maxResultsPerPage ) is the default 5000. The buckets being listed are large buckets with 100k+ objects.

I know that adding a retry is another possibility but would prefer if there was another alternative.

The timeout exception is given below

Caused by: reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 30000ms in 'flatMap' (and no fallback has been configured)
        at reactor.core.Exceptions.propagate(Exceptions.java:393) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.BlockingIterable$SubscriberIterator.hasNext(BlockingIterable.java:168) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.BlockingIterable$SubscriberIterator.next(BlockingIterable.java:198) ~[observer-3.20.92.jar:na]
        at kdc.cloudadapters.adapters.MicrosoftAzureAdapter$AzureListRequest.nextBatch(MicrosoftAzureAdapter.java:566) ~[observer-3.20.92.jar:na]
        ... 9 common frames omitted

Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 30000ms in 'flatMap' (and no fallback has been configured)
        at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:289) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:274) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:396) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73) ~[observer-3.20.92.jar:na]
        at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:117) ~[observer-3.20.92.jar:na]
        at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68) ~[observer-3.20.92.jar:na]
        at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28) ~[observer-3.20.92.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_252]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_252]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[na:1.8.0_252]
       ... 3 common frames omitted

Additional information

A shorter variation of the code used is given below

class AzureList {

        BlobContainerClient container;
        Iterator<PagedResponse<BlobItem>> iterator;
        String continuationToken;

        public AzureList(String accountName, String accountKey, String bucketName) {
            StorageSharedKeyCredential credential = new StorageSharedKeyCredential(accountName, accountKey);
            String endpoint = String.format(Locale.ROOT, "https://%s.blob.core.windows.net", accountName);
            BlobServiceClient serviceClient = new BlobServiceClientBuilder().credential(credential)
                                                         .endpoint(endpoint)
                                                         .buildClient();
            container = serviceClient.getBlobContainerClient(bucketName);
            iterator = getIterator(/*prefix*/ "");  // Current use case is just "" as prefix but can be different in the future
            continuationToken = null;
        }

        private Iterator<PagedResponse<BlobItem>> getIterator(String prefix) {
            ListBlobsOptions options = new ListBlobsOptions().setPrefix(prefix);
            return container.listBlobs(options, continuationToken, Duration.ofSeconds(30L)).iterableByPage().iterator();
        }

        public void iterate() {
            List<BlobItem> blobs;
            do {
                blobs = listBlobs();
                // hand off blob list to different consumer class
            } while (continuationToken != null);
        }

        private List<BlobItem> listBlobs() {
            PagedResponse<BlobItem> pagedResponse = iterator.next();
            List<BlobItem> blobs = pagedResponse.getValue();
            continuationToken = pagedResponse.getContinuationToken();
            return blobs;
        }
    }

Why is this not a Bug or a feature Request?
Unsure if it is a Bug or an issue with my local env / my code.

Setup (please complete the following information if applicable):

  • OS: Ubuntu 18.04
  • IDE : IntelliJ 19.1.4
  • SDK: azure-storage-blob v12.7.0

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [x] Query Added
  • [x] Setup information Added
Client Storage customer-reported question

All 57 comments

Hi, @somanshreddy. We are actually already in the midst of investigating the performance of list blobs. A couple things:

  1. Can you describe more about the behavior of your application. Do you have any amount of concurrency when you see this issue? Or is the app single threaded? Does this happen every time or intermittently?
  2. Can you try running the application without specifying a timeout? This will really just help us get some more information and correlate it with other issues. If you are able to load an slf4j binding and turn that on so we can get some logs from the network layer, that would be extra helpful.
  3. There were some performance improvements around parsing that came out in azure-core 12.8.1. Could you try pinning that in your pom and see if that helps? We haven't yet released a version of storage with this updated core dependency, but we will soon.

Hello @rickle-msft

  1. The app is multi-threaded. But each thread would only access one container. Hence we wouldn't have multiple threads listing the same container.

The issue doesn't happen every time but for large containers, it is pretty frequent. If the entire listing of a container with 1 million objects is counted as one operation, I would say that 1 out of every 3 operations fails with this timeout. And if the LISTing is 'paused' and resumed using the continuation token, then it is more likely that we hit an error.

  1. I thought the timeout was internally capped at 30s? By Azure itself. https://docs.microsoft.com/en-us/rest/api/storageservices/setting-timeouts-for-blob-service-operations

Sure, I can enable logging. Can you help me by being a bit more specific as to what type of logging you would prefer / how to enable this logging at network layer?

  1. What is the 'parsing' that is happening here? Is it improvements in how the SDK internally parses the LIST response ?

Thanks for this information.

  1. Right now, the concern is less over contention on the service end and more about contention on the client end, so it doesn't matter so much that it's the same or different container because we think some client resources are the bottleneck. How many threads would you estimate you have doing listing operations at once? How long is this "pause"? That is interesting that it makes it more likely to error. I'm not sure what to make of that

  2. The service does still have a timeout of 30s, so the operation will likely still fail, but by removing the timeout parameter, we allow the service to close the connection, which propagates the error in a different way that will hopefully give us more insight into the state of the networking stack. You can follow this guide to enable logging. Enabling debug level logging should be sufficient. That should catch all the logs from the reactor netty layer (which is our http client).

  3. That is correct. The list response comes back as XML, and we parse it into an iterable to return. And the latest core has some performance improvements around this parsing that might help with this.

@rickle-msft

  1. Ever since this issue was first discovered, only 1 thread has been used, i.e, only one container's listing was initiated. It was restarted when any timeout error occurred.

The pause was roughly about 30s - 1 minute I think. Pause here being 'stopping' the thread that is listing and then spawning a new thread to continue listing.

Regarding the pause, I haven't done a side-by-side comparison nor have any concrete numbers to justify that it is more likely to occur when resumed from a continuation token. It is probably just that a single uninterrupted thread feels like it had failed less often than a thread that was stopped and then a new thread being spawned with the continuation token to pick up from where it had last stopped.

I can probably get back with some numbers but I am more concerned about addressing the single uninterrupted listing since the pause + resume is a secondary feature.

Side note: The same 'Listing' logic for AWS works fine. So I doubt it is due to the bottleneck of client resources. Hopefully the logs reveal more.

  1. Sure, I will try this and get back to you.

  2. Cool. Will try that as well.

Also, since we are still unsure of the exact cause, would reducing the max keys being listed to, say 2500, alleviate the problem a little? Might give that a try as well after capturing the logs for the current listing of 5000 objects.

  1. I see. Thank you for that information. That should hopefully lend itself well to our continued investigation. Apologies if I was unclear here, but by client resources, I meant specifically how the SDK is managing access to the XML parser, nothing to do with your machines themselves : )

I have seen a few cases where reducing the size of the listing has helped this problem, so it is worth a shot. I haven't explored this thoroughly enough to know what size acts as a sort of threshold for success. I also have typically been only getting the first page as that has been sufficient for me to reproduce the problem, so I'm not sure if reducing the size of the pages will still help when running through the whole list. But again it's definitely a good thing to try.

@rickle-msft Oops, sorry for the misunderstanding.

I have added DEBUG level logs when the LIST operation is called with and without timeout. They are attached below

_1. With explicit timeout ( 30 seconds )_

a ) Error via SDK

Exception in thread "main" reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 30000ms in 'flatMap' (and no fallback has been configured)
    at reactor.core.Exceptions.propagate(Exceptions.java:393)
    at reactor.core.publisher.BlockingIterable$SubscriberIterator.hasNext(BlockingIterable.java:168)
    at reactor.core.publisher.BlockingIterable$SubscriberIterator.next(BlockingIterable.java:198)
    at AzureCloudSystem.listBlobs(AzureCloudSystem.java:91)
    at AzureCloudUtil.main(AzureCloudUtil.java:53)
Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 30000ms in 'flatMap' (and no fallback has been configured)
    at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:289)
    at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:274)
    at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:396)
    at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73)
    at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:117)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

b ) Debug logs from com.azure.core -> https://pastebin.com/YRmmq1uh


_2. Without an explicit timeout ( null )_

a ) Error via SDK

Exception in thread "main" reactor.core.Exceptions$ReactiveException: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
    at reactor.core.Exceptions.propagate(Exceptions.java:393)
    at reactor.core.publisher.BlockingIterable$SubscriberIterator.hasNext(BlockingIterable.java:168)
    at reactor.core.publisher.BlockingIterable$SubscriberIterator.next(BlockingIterable.java:198)
    at AzureCloudSystem.listBlobs(AzureCloudSystem.java:91)
    at AzureCloudUtil.main(AzureCloudUtil.java:53)
Caused by: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

b ) Debug logs from com.azure.core -> https://pastebin.com/PrEN8Bzv


I haven't perused the log files myself yet. The error without a timeout seems like it gives some more info, but I have yet to look into it. Just pasted the error as soon as I was able to reproduce it, to keep you updated @rickle-msft

I will try the 12.8.1 azure core and see how things work. And the reduced number of keys as well.

Let me know if there is anything else required in the meantime.

Interesting. The only exception I'm seeing in both cases is an UnsatisfiedLinkError. Are you sure both those logs capture the issue in question? It could also be that I can't see all the content you shared? The do seem to both suspiciously cut off rather abruptly at 348 lines?

Sorry for the late reply, I was caught up in other work.

The above log files captured the case when the first LIST call itself failed. Hence the abrupt end, I assume. Let me try to reproduce the case when there are a few successful LIST calls followed by one that fails ( this chain uses the continuation token as mentioned in the code snippet ). So maybe there is some other error that would come up in this test case.

Meanwhile, any issues with getting an UnsatisfiedLinkError ?

No worries!

Based on the javadocs, I suspect the UnsatisfiedLinkError is coming from the networking layer looking for certain system resources or files and not finding them. I currently believe that it's not too much of a problem because it'll fall back on a default and would error out entirely if it didn't have what it needed. So let's see if we can get some more examples like you were saying and investigate that. If at that time the UnsataisfiedLinkError seems to be contributing to the problem we can look more into it then.

@rickle-msft
With explicit timeout ( 30 seconds ) -> https://pastebin.com/APnJ7tdj

A call was made at 15:00:33.248 . This timed out 30s later at 15:01:03.265 . After this my application restarted the LISTing operation.

@rickle-msft In the Exception thrown by the service when there is no explicit timeout passed by the client in the LIST API call

Caused by: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Isn't this a clear indication that the server itself isn't responding and hence has timed out?

@rickle-msft When there is no explicit timeout, sometimes the server does not respond with a failure. It gets stuck in an infinite wait. Attaching the code and logs below

Code below

       public void foo() {
            // setup .. 
            PagedResponse<BlobItem> pagedResponse = getNextPage();
            List<T> objectList = getObjectList(pagedResponse);
            // process the list 
        }

        private PagedResponse<BlobItem> getNextPage() {
            for (int trialNumber = 1; trialNumber <= MAX_SERVER_ERROR_RETRIES; trialNumber++) {
                log.debug("Try #{} for listObjects for bucket={} with current marker={}", trialNumber, container.getBlobContainerName(), continuationToken);
                try {
                    return iterator.next();  // iterator is set by this ' container.listBlobs(options, continuationToken, /*timeout*/ null).iterableByPage().iterator() '
                } catch (Exception e) {
                    log.error("Try #{} failed to list objects for bucket={} with current marker={}", trialNumber, container.getBlobContainerName(), continuationToken, e);
                    // sleep
                }
            }
            // throw exception
        }

        private List<BlobItem> getObjectList(PagedResponse<BlobItem> pagedResponse) {
            List<BlobItem> objects = pagedResponse.getValue();
            log.debug("Listed {} objects for bucket={} with current marker={}", objects.size(), container.getBlobContainerName(), continuationToken);
            return objects;
        }

Logs below. The last line was the last logged operation. The thread was stuck on iterator.next() waiting for the server to error out

2020-09-30 12:28:35.053 DEBUG fileoperations [bufferedListProvider-pool-63-thread-1] - Try #1 for listObjects for bucket=10m-dataset with current marker=2!108!MDAwMDM2IUZvbGRlcjYvanVseTIyL2Q0L2QzOC9kMTMvZDI0L2YzLnR4dCEwMDAwMjghOTk5OS0xMi0zMVQyMzo1OTo1OS45OTk5OTk5WiE- 
2020-09-30 12:28:37.116 DEBUG fileoperations [bufferedListProvider-pool-63-thread-1] - Listed 5000 objects for bucket=10m-dataset with current marker=2!108!MDAwMDM2IUZvbGRlcjYvanVseTIyL2Q0L2QzOC9kMTMvZDI0L2YzLnR4dCEwMDAwMjghOTk5OS0xMi0zMVQyMzo1OTo1OS45OTk5OTk5WiE-

2020-09-30 12:28:37.118 DEBUG fileoperations [bufferedListProvider-pool-63-thread-1] - Try #1 for listObjects for bucket=10m-dataset with current marker=2!108!MDAwMDM1IUZvbGRlcjYvanVseTIyL2Q0L2Q0L2Q1L2QxMy9mNDUuanBnITAwMDAyOCE5OTk5LTEyLTMxVDIzOjU5OjU5Ljk5OTk5OTlaIQ--
2020-09-30 12:28:39.225 DEBUG fileoperations [bufferedListProvider-pool-63-thread-1] - Listed 5000 objects for bucket=10m-dataset with current marker=2!108!MDAwMDM1IUZvbGRlcjYvanVseTIyL2Q0L2Q0L2Q1L2QxMy9mNDUuanBnITAwMDAyOCE5OTk5LTEyLTMxVDIzOjU5OjU5Ljk5OTk5OTlaIQ--

2020-09-30 12:28:39.226 DEBUG fileoperations [bufferedListProvider-pool-63-thread-1] - Try #1 for listObjects for bucket=10m-dataset with current marker=2!108!MDAwMDM2IUZvbGRlcjYvanVseTIyL2Q0L2Q0Mi9kMTcvZDUyL2Y2LnRhciEwMDAwMjghOTk5OS0xMi0zMVQyMzo1OTo1OS45OTk5OTk5WiE-

@somanshreddy Thank you for this information. Connection reset is an indication that the server has closed the connection, typically as a result of the client being idle for too long. It's not clear to me that it's the same as the timeout, however. I think in the case of this timeout being exceeded for a given operation, the server should actually return a timeout error rather than simply closing the connection. On the other hand, if the server is sitting waiting to long for information from a socket and not getting anything, I think then it will close the socket abruptly and we see this connection reset.

The connection reset is consistent with another case of listing being problematic. One thing we're investigating is that there might be some issues with sync listing specifically. Would it be possible for you to try using the async client and get an iterable from that rather than using the PagedIterable returned by the sync client? The PagedFlux type returned by the async client should have .byPage().toIterable() methods which should give you something that's functionally about the same but hopefully circumvents this iterable type that seems to be having some problems.

On second thought. I think the toIterable method might actually be what causes the problem. It might be best to wait for [this] PR to get merged and shipped and try again at that time.

I have tried a couple of options.

1) Sleep after a timeout. And then retry. This is basically the retry of the iterator.next() line. Does not help.

2) Sleep after a timeout. But discard the list iterator and start a new iterator upon failure using container.listBlobs(options, continuationToken, /*timeout*/ null).iterableByPage().iterator() . This has decreased the rate of errors but it is still present.

3) Sleep after a timeout. But discard the entire BlobServiceClient itself. And create a new one to resume listing. I didn't see any difference in the rate of errors from method 2. Both have lesser number of failures compared to earlier on containers with 10 million blobs but there is an error nonetheless.

Can you share the link to the PR that you have mentioned?

@rickle-msft I have tried using the async client to make the ListBlobs API request. As in the case of a sync client with no explicit timeout configured, eventually one of the List requests gets blocked indefinitely.

I see the below line in the Azure Storage Logging ( $logs container )

1.0;2020-10-04T05:12:24.1093529Z;ListBlobs;NetworkError;200;25314;576;authenticated;kompriseqa;kompriseqa;blob;"https://kompriseqa.blob.core.windows.net:443/1m-dataset?prefix=&amp;marker=2%2192%21MDAwMDI1IWF1ZzUvZDMvZDUvZDEvZDE2L2Y0MC5tcDQhMDAwMDI4ITk5OTktMTItMzFUMjM6NTk6NTkuOTk5OTk5OVoh&amp;restype=container&amp;comp=list";"/kompriseqa/1m-dataset";3be05f24-c01e-0094-250c-9a10bf000000;0;49.206.55.100:33489;2019-07-07;581;0;210;1916690;0;;;;;;"Test/1.0 Test/3.4.0 azsdk-java-azure-storage-blob/12.7.0 (1.8.0_181; Linux 4.18.16-1.el7.elrepo.x86_64)";;"d5360de4-e69d-4adc-b58f-101395999201"

The token of interest here is NetworkError followed by an equally interesting Http response status code of 200

The description of NetworkError according to https://docs.microsoft.com/en-us/rest/api/storageservices/storage-analytics-logged-operations-and-status-messages is
Most commonly occurs when a client prematurely closes a connection before timeout expiration.

The async request does not have a timeout overload hence there is no Timeout exception. Any guesses as to why the client is closing the connection prematurely?

Side note: Two threads were performing the list calls on two separate containers belonging to two different storage acccounts ( hence they were using two different 'BlobServiceAsyncClient' objects )

The Service Client's HttpClient was configured as below and added to the builder as builder.httpClient(getHttpClient())

private com.azure.core.http.HttpClient getHttpClient() {
    return new NettyAsyncHttpClientBuilder(HttpClient.create())
                    .connectionProvider(ConnectionProvider.create(/*name*/ "http", /*maxConnections*/ 1000))
                    .build();
}

@somanshreddy Can you try one more thing? I know it's maybe felt like we've been running around this issue a while because I've been trying to experiment with some work arounds while we got this other listing fix out. I apologize for that. Could you try upgrading to blobs 12.9.0-beta.1 and see if that helps? The fix I was looking for was released there, and I'm hopeful that will resolve the issue.

Sorry I might have misspoken. I think core shipped after we did, so you might need to try azure-core v1.9.0 to pick up the fix.

@rickle-msft Hello, I tried it with the recommended core SDK but still hit the same issue. I upgraded the storage SDK to the beta version suggested by you but no luck with that either.

Though, I did notice something in the logs prior to upgrading the SDK. Something called the 'Last HTTP packet' wasn't received. I guess that is why the client was timing out. I will try reproducing this and give you the relevant log lines. Meanwhile, any clue as to why the last HTTP packet wasn't received?

Thanks, @somanshreddy for giving that a try. I'll be happy to take a look at some logs.

@alzimmermsft have you ever seen any logs about the last http packet not being received? I don't think I have seen this error before.

@rickle-msft Oops, sorry. I should have been more explicit. I was observing the logs and compared the logs for a successful request vs the one that timed out. And the one that time out did not have the log line that says 'Received last HTTP packet' which the successful requests had. So I assume that the timeout is due to this particular packet not being received.

Correct, LastHttpContent is a special type of HttpContent in Netty indicating that the inbound should be completely received. If it never gets this signal the connection will remain open awaiting more data from the remote connection.

@alzimmermsft so it sounds like if the service stops sending data, we can't really do much about that other than retry. But theoretically we should be retrying these timeout exceptions, right?

@rickle-msft that is correct. In a recent release of azure-core we added granular timeout handling for sending requests, receiving a response, and receiving the body of a response to our HttpClient implementation libraries. These will fail within the request workflow unlike the blocking timeout set on the entire request triggering them to be retried.

@alzimmermsft Can you guide @somanshreddy on how to set these timeout options so we can give that a shot?

The following would be an example of setting send, response, and read timeouts to 30 seconds each.

HttpClient httpClient = new NettyAsyncHttpClientBuilder()
    .writeTimeout(Duration.ofSeconds(30))
    .responseTimeout(Duration.ofSeconds(30))
    .readTimeout(Duration.ofSeconds(30))
    .build();

BlobServiceClient serviceClient = new BlobServiceClientBuilder()
    .connectionString(<connection-string>)
    .httpClient(httpClient)
    .buildClient();

Further details on what exactly this does.

Write timeout is between packets being written to the socket, so if you are uploading a stream you are downloading if packets are being read at a 15 seconds interval between packets the timeout will never be triggered. If at any point there is more than 30 seconds between packets written the operation will be deemed timed out.

Once the write operation has completed then the response timeout will begin. A response is deemed received once headers are returned. Response timeout is straight-forward.

Read timeout is like write timeout but for receiving packets from the socket.

Thanks @alzimmermsft , will give this a try

@alzimmermsft Now that we are setting fine grained timeouts on the client, I assume we should no longer be passing a timeout in the request itself?

container.listBlobs(options, continuationToken, TIMEOUT).iterableByPage().iterator();

@somanshreddy with using fine grained timeouts it may not be necessary to apply a timeout to the API call itself. The API call timeout still serves a purpose if you want to limit the amount of time an API call can take overall, such as you want to stop processing if the call (with retries included) takes longer than 10 minutes. Looking at your use case, based on this issue, I don't think it will be necessary.

Okay. And I had tried the RequestRetryOptions that the ServiceClient has, earlier in the 12.7.0 SDK but that didn't seem to help. I do have a simple try catch block logic in place that is doing a retry currently. Do you recommend that I switch back to the inbuilt retry mechanism?

Also, since the issue is with the LastHttpContent, does that mean that an eventual retry would just work? Because sometimes 3-4 retries failed consecutively. So I assume that there was something wrong on the client side since the server should not be failing so many times in a row. So if we keep a persistent retry logic, then can we hope that it would eventually succeed?

@alzimmermsft This is the error with the fine grained timeout on client and no explicit timeout set on the request itself. The retry ( not built in but the custom try catch retry ) worked. This was the case earlier as well. But sometimes even the retry ended up failing.

2020-10-20 20:43:50.490 ERROR fileoperations [bufferedListProvider-pool-28-thread-1] - Try #1 failed to list objects for bucket=10m-dataset
reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Channel read timed out after 30000 milliseconds.
    at reactor.core.Exceptions.propagate(Exceptions.java:393) ~[observer-3.4.0.jar:na]
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:97) ~[observer-3.4.0.jar:na]
    at reactor.core.publisher.Flux.blockLast(Flux.java:2497) ~[observer-3.4.0.jar:na]
    at com.azure.core.util.paging.ContinuablePagedByIteratorBase.requestPage(ContinuablePagedByIteratorBase.java:94) ~[observer-3.4.0.jar:na]
    at com.azure.core.util.paging.ContinuablePagedByIteratorBase.hasNext(ContinuablePagedByIteratorBase.java:53) ~[observer-3.4.0.jar:na]
    at com.azure.core.util.paging.ContinuablePagedByIteratorBase.next(ContinuablePagedByIteratorBase.java:42) ~[observer-3.4.0.jar:na]
    at kdc.cloudadapters.adapters.MicrosoftAzureAdapter$AzureListRequest.getNextPage(MicrosoftAzureAdapter.java:800) [observer-3.4.0.jar:na]
    Suppressed: java.lang.Exception: #block terminated with an error
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:99) ~[observer-3.4.0.jar:na]
        ... 15 common frames omitted
Caused by: java.util.concurrent.TimeoutException: Channel read timed out after 30000 milliseconds.
    at com.azure.core.http.netty.implementation.ReadTimeoutHandler.readTimeoutRunnable(ReadTimeoutHandler.java:68) ~[observer-3.4.0.jar:na]
    at com.azure.core.http.netty.implementation.ReadTimeoutHandler.lambda$handlerAdded$0(ReadTimeoutHandler.java:48) ~[observer-3.4.0.jar:na]
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[observer-3.4.0.jar:na]
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176) ~[observer-3.4.0.jar:na]
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[observer-3.4.0.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[observer-3.4.0.jar:na]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[observer-3.4.0.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[observer-3.4.0.jar:na]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[observer-3.4.0.jar:na]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[observer-3.4.0.jar:na]
    ... 1 common frames omitted

Do you recommend that I switch back to the inbuilt retry mechanism?

The goal would be minimizing the number of external try/catch statements outside of the SDK itself. There may be situations where operations return non-retryable responses which will bubble up as an error or cases where the operation will exhaust the number of retries allowed and will terminate.

Also, since the issue is with the LastHttpContent, does that mean that an eventual retry would just work? Because sometimes 3-4 retries failed consecutively. So I assume that there was something wrong on the client side since the server should not be failing so many times in a row. So if we keep a persistent retry logic, then can we hope that it would eventually succeed?

Yes, if the operation is idempotent a retry should eventually work.

This is the error with the fine grained timeout on client and no explicit timeout set on the request itself. The retry ( not built in but the custom try catch retry ) worked. This was the case earlier as well. But sometimes even the retry ended up failing.

I'll take a look into this. It may be the case where the exception being returned by the HttpClient is getting wrapped in Reactor's exception type and its cause isn't being inspected for being retryable.

@alzimmermsft Does the http client timeouts that we have set ( 30s ) include the retry time as well? Since you mentioned that the API Timeout includes the retry time.

@somanshreddy The http client timeouts are on a per-request basis, so they do not include retries. The api timeouts are per operation, so they do include retries.

@somanshreddy, I've taken a look into the exception being returned and not retried. Write and response timeouts will be retried when they occur due to them happening on sending the request and awaiting the response, read timeouts may be retried when they occur.

Read timeouts don't have an explicit guarantee on being retried as the consumption of the response body may begin in different location. Generally, we do not begin reading the body until we've reached out deserialization logic, this happens outside of the context of our HttpPipeline, therefore being outside of the scope of the RequestRetryPolicy/RetryPolicy that would attempt to reprocess the request.

Given this for the time being it would be best to retain your external try/catch block. Scenarios where a request being sent or the response headers being received is taking longer than expected would be handled by the SDK. The last read getting stuck would need to be caught externally.

I'll be investigating solutions to this issue so that the SDK would be able to handle all three timeout scenarios safely.

Thanks @alzimmermsft @rickle-msft

Also, what is the estimate for the 12.9.0-beta.1 to be officially released, i.e, the general availability version?

@somanshreddy We are hoping to release it as a part of our November release in a couple a week or two.

@rickle-msft @alzimmermsft I have hit this issue again. It worked perfectly fine for containers that have about 10 million objects. Sometimes a try 1 failed but the try 2 was always successful. But for larger containers with about 30 million objects, the listing failed at about the 24th million mark.

Tried this with the custom timeouts on the Http Client and a null timeout on the API. SDK versions are given below

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-storage-blob</artifactId>
            <version>12.9.0-beta.1</version>
        </dependency>
        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-core</artifactId>
            <version>1.9.0</version>
        </dependency>

Stack trace

2020-11-10 22:42:16.738 ERROR fileoperations [bufferedListProvider-pool-336-thread-1] - Try #3 failed to list objects for bucket=large-dataset with current marker=2!108!MDAwMDM3IUZvbGRlcjEvRm9sZGVyMS9hdWc1L2QzL2QwL2Q4L2Y1LnRpZmYhMDAwMDI4ITk5OTktMTItMzFUMjM6NTk6NTkuOTk5OTk5OVoh
reactor.core.Exceptions$ReactiveException: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
        at reactor.core.Exceptions.propagate(Exceptions.java:393) ~[observer-3.20.111.jar:na]
        at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:97) ~[observer-3.20.111.jar:na]
        at reactor.core.publisher.Flux.blockLast(Flux.java:2497) ~[observer-3.20.111.jar:na]
        at com.azure.core.util.paging.ContinuablePagedByIteratorBase.requestPage(ContinuablePagedByIteratorBase.java:94) ~[observer-3.20.111.jar:na]
        at com.azure.core.util.paging.ContinuablePagedByIteratorBase.hasNext(ContinuablePagedByIteratorBase.java:53) ~[observer-3.20.111.jar:na]
        at com.azure.core.util.paging.ContinuablePagedByIteratorBase.next(ContinuablePagedByIteratorBase.java:42) ~[observer-3.20.111.jar:na]
        at kdc.cloudadapters.adapters.MicrosoftAzureAdapter$AzureListRequest.getNextPage(MicrosoftAzureAdapter.java:1144) [observer-3.20.111.jar:na]

`

Hi @somanshreddy,

I'm currently looking into the issue but given the type of error, connection reset by peer, additional information is needed to get closer to the root cause. Is there any chance you could enable more verbose logging in Netty itself, this will help give a better view of the application at the time of failure. Ex, how many open connections are in the Netty connection pool, more details on the failure cause, etc.

I have been able to replicate a connection reset by peer using a custom testing server. From this I was able to validate that connection reset by peer errors will be retried but in your case you reached a retry limit when the error occurred. A potential cause for this issue is that multiple connections in the Netty connection pool went stale and were told to close by the service due to inactivity. Though, without logs indicating that the connection pool had long living connections unused in it I'm not able to say explicitly whether that is the root cause.

Sure @alzimmermsft. I will try to reproduce this and get back to you. DEBUG level on io.netty would suffice?

Sure @alzimmermsft. I will try to reproduce this and get back to you. DEBUG level on io.netty would suffice?

Yes, DEBUG should include information about the number of connections active and inactive within the connection pool and contain other information surrounding requests and responses.

@alzimmermsft Since we are still using the try catch block, we would be manually retrying any errors. Here is one such failure that occurs after the 1st attempt but a manual retry ( 2nd attempy ) succeeds.

Exception stack trace from SDK -> https://pastebin.com/raw/5Cf5y7EK

Netty DEBUG logs -> https://pastebin.com/raw/gQyMSneb

The Netty logs cover a larger area than just the listing failure because I thought it would be better in case there is a larger context needed to understand the logs. The timestamp of the time of request and exception thrown is present in the former snippet which should help in looking at the corresponding Netty logs

Hi @somanshreddy, I just merge this PR (https://github.com/Azure/azure-sdk-for-java/pull/17699) which should have the HTTP client eagerly read the response body when we know it will deserialized. This should reduce the number of occurrences when a TimeoutException or PrematureCloseException are thrown from the SDK by completing more of the HTTP response consumption within the scope of our retrying logic. These changes should be available from Maven after our next SDK release.

@alzimmermsft Now all 3 tries have failed. And the los are captured below. Please let me know if more information is required.

Exception stack trace from SDK -> https://pastebin.com/raw/f4q5gf0u

Netty DEBUG logs -> https://pastebin.com/raw/Hq5UZ3Lv

Hi @somanshreddy, are these logs from using the built source code or released versions? Looking through the logs I'm noticing a case where exceptions are fired after the pipeline has been closed which should be fixed in a source code change. Additionally, I would have expected the Netty logs to contain connection reset by per logs as well, similar to this:

128162 [reactor-http-epoll-4] WARN reactor.netty.http.client.HttpClientConnect - [id: 0x982a505e, L:/127.0.0.1:47488 - R:localhost/127.0.0.1:7777] The connection observed an error
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

@alzimmermsft These are logs from 12.9.0 azure-storage-blob. When was the exception thing fixed?

Appears this was fix by azure-core 1.11.0 and azure-core-http-netty 1.7.0, azure-storage-blob 12.9.0 has dependencies on versions 1.10.0 and 1.6.3 so it won't have the fix. If you include the newer versions of azure-core and azure-core-http-netty in your project directly they will be used in place of the versions that azure-storage-blob depends on, and this will be safe as the newer versions are backward compatible. Once a newer version of azure-storage-blob depending on the fix versions, or newer, is available you should be able to remove the direct dependencies on azure-core and azure-core-http-netty.

@alzimmermsft @rickle-msft Do you have a time estimate for the next release of the storage SDK? Thanks!

This should be out by february

@somanshreddy We have released a new GA version of the SDK. Could you please give it a try and see if it addresses your problem? If it does, could you also please close the issue?

Thanks @rickle-msft . Will probably need some time to test this but will keep this thread updated with the results

@somanshreddy How are things looking? Can we close this issue or do you need further support?

@rickle-msft Sorry, we are actually waiting for #17648 to be released in a GA version so that we can test both together. Every SDK upgrade requires a full regression test, so we would prefer to do it just once when both the fixes are out :)

I see. Makes sense. Since Feb was a beta release, I expect March will be a GA release

@rickle-msft Thank you for that info :)

Hi @somanshreddy

Unfortunately there have been some additional features we have added to the latest beta version which have prevented us from releasing a stable version of the 12.11.0 library. We will update this thread when we are able to GA the library.

Thanks @gapra-msft

I keep checking the changelog every couple of days hoping for an update :)
Appreciate the info. Thanks!

Was this page helpful?
0 / 5 - 0 ratings