S3 Retransmit Rate

0

We run a service that continuously uploads S3 objects of various sizes (from 1MB to 100MB). The rate is about 100 objects per second. From time to time (1 out of 10k), one such uploading could take extremely long time (10sec for a 5MB object for example). Also the S3 retransmit rate was about 1.5% consistently. To remove those long uploading outliers, we implemented a logic to timeout those long uploading job (capped at 3 seconds) then retry. As a result, the S3 retransmit rate dropped to 0.5%.

The service is written in Java, and we are using Java SDK 2.25.0, and the service runs on AWS VPC.

I have a few questions:

  • How to find out what's causing the retransmit?
  • More importantly how to explain the rate drop after applying timeout and retry logic?
  • Lastly, did we pick a wrong solution or is there a better way to handle those outliers?
Kai
asked a month ago549 views
2 Answers
0

Are you uploading the data from an on-premises or other non-AWS origin, or from inside an AWS VPC?

Are you uploading with the AWS CLI and its aws s3 cp or aws s3 sync commands, for example, or are you using some third-party software or your own custom code?

In general, uploads and downloads performed either with the CLI or with the similar high-level functions available in the AWS SDKs, a very powerful way to compensate for momentary disruptions in network connections and other intermittent phenomena is to set the concurrency attributes to sufficiently high values, while not setting them so high that the excessive degree of concurrency will cause the non-S3 side of the transfer to stall under heavy context switching.

For example, if you perform the upload with 64 concurrent requests and have thousands of files to upload, even if a single one of those 64 stalled for some time, the remaining 63 independent connections would probably use up the capacity momentarily left unused by the one that's stalled.

For the CLI, the concurrency settings are documented here: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html. The key ones are max_concurrent_requests, which you should set to an adequate value as discussed above, and max_queue_size which you can set to a high value, like 1,000 or 10,000 to have the client prepare subsequent files/chunks for upload while others are already on their way.

The multipart_threshold and multipart_chunksize settings can be used to adjust concurrency significantly when dealing with a small number of large files, but with small files, they don't necessarily matter that much. You can try out different values for optimisation.

EXPERT
Leo K
answered a month ago
profile picture
EXPERT
reviewed a month ago
0

I see you updated the question with answers to the platform questions. Following up, is your VPC IPv4-only, dual-stack IPv4+IPv6, or IPv6-only?

If it's IPv4-only or dual-stack, have you got a VPC gateway endpoint for S3 in the VPC? Is the VPC endpoint included in the route table attached to the subnet(s) hosting the compute capacity performing the uploads?

In your Java code, are you using S3TransferManager for the file upload: https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/transfer/s3/S3TransferManager.html. It supports the same style of concurrent multipart upload tuning as the AWS CLI settings I described earlier. There are practical code examples for uploads using the different classes on this documentation page: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/java_s3_code_examples.html

For your Java runtime, have you ensured that you have ample memory allocated and garbage collection runs don't coincide with the moments when the upload stalls? I realise this may not be the case, if only some uploads are affected, but checking to be sure.

EXPERT
Leo K
answered a month ago