S3 Retransmit Rate

We run a service that continuously uploads S3 objects of various sizes (from 1MB to 100MB). The rate is about 100 objects per second. From time to time (1 out of 10k), one such uploading could take extremely long time (10sec for a 5MB object for example). Also the S3 retransmit rate was about 1.5% consistently. To remove those long uploading outliers, we implemented a logic to timeout those long uploading job (capped at 3 seconds) then retry. As a result, the S3 retransmit rate dropped to 0.5%.

The service is written in Java, and we are using Java SDK 2.25.0, and the service runs on AWS VPC.

I have a few questions:

How to find out what's causing the retransmit?
More importantly how to explain the rate drop after applying timeout and retry logic?
Lastly, did we pick a wrong solution or is there a better way to handle those outliers?

Topics

Storage

Tags

Amazon Simple Storage Service

Language

English

Kai

asked a month ago549 views

2 Answers

Newest
Most votes
Most comments

Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge.

Are you uploading the data from an on-premises or other non-AWS origin, or from inside an AWS VPC?

Are you uploading with the AWS CLI and its aws s3 cp or aws s3 sync commands, for example, or are you using some third-party software or your own custom code?

In general, uploads and downloads performed either with the CLI or with the similar high-level functions available in the AWS SDKs, a very powerful way to compensate for momentary disruptions in network connections and other intermittent phenomena is to set the concurrency attributes to sufficiently high values, while not setting them so high that the excessive degree of concurrency will cause the non-S3 side of the transfer to stall under heavy context switching.

For example, if you perform the upload with 64 concurrent requests and have thousands of files to upload, even if a single one of those 64 stalled for some time, the remaining 63 independent connections would probably use up the capacity momentarily left unused by the one that's stalled.

For the CLI, the concurrency settings are documented here: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html. The key ones are max_concurrent_requests, which you should set to an adequate value as discussed above, and max_queue_size which you can set to a high value, like 1,000 or 10,000 to have the client prepare subsequent files/chunks for upload while others are already on their way.

The multipart_threshold and multipart_chunksize settings can be used to adjust concurrency significantly when dealing with a small number of large files, but with small files, they don't necessarily matter that much. You can try out different values for optimisation.

EXPERT

Leo K

answered a month ago

EXPERT

Giovanni Lauria

reviewed a month ago

I see you updated the question with answers to the platform questions. Following up, is your VPC IPv4-only, dual-stack IPv4+IPv6, or IPv6-only?

If it's IPv4-only or dual-stack, have you got a VPC gateway endpoint for S3 in the VPC? Is the VPC endpoint included in the route table attached to the subnet(s) hosting the compute capacity performing the uploads?

In your Java code, are you using S3TransferManager for the file upload: https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/transfer/s3/S3TransferManager.html. It supports the same style of concurrent multipart upload tuning as the AWS CLI settings I described earlier. There are practical code examples for uploads using the different classes on this documentation page: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/java_s3_code_examples.html

For your Java runtime, have you ensured that you have ample memory allocated and garbage collection runs don't coincide with the moments when the upload stalls? I realise this may not be the case, if only some uploads are affected, but checking to be sure.

EXPERT

Leo K

answered a month ago

Relevant content

S3 performance problems: How can i achieve a s3 request rate above the limit of 3500 PUT per second with multiple prefixes?
AWS-User-2107092
asked 2 years ago
S3 - How do we protect ourselves from a malicious user refreshing a page a million times in order to rack up an AWS S3 Bill?
newby
asked a year ago
What is the max object-per-second throughput possible when copying S3 objects from Standard to Glacier Instant Retrieval with S3 Batch Operations?
Accepted Answer
James
asked 3 months ago
Understanding reading rate limit from a single prefix in S3
AlwaysLearning
asked 9 months ago
How can I copy all objects from one Amazon S3 bucket to another bucket?
AWS OFFICIALUpdated 3 years ago
How can I check the integrity of an object uploaded to Amazon S3?
AWS OFFICIALUpdated 2 years ago
Why is it taking a long time to replicate Amazon S3 objects using Cross-Region Replication between my buckets?
AWS OFFICIALUpdated a year ago
How do I view objects that failed replication from one Amazon S3 bucket to another?
AWS OFFICIALUpdated a year ago
Optimizing Storage Costs by Transitioning Millions of S3 Objects from Standard to Glacier Tier
EXPERT
Ben Lee
published 8 months ago
Why can't my S3 File Gateway access objects uploaded by cross-account users?
EXPERT
Tyler
published 2 months ago