AWS Elastic Disaster Recovery Source Server Lag

I have recently began setting up and using the Disaster Recovery service. Using Systems Manager, I was to install the needed requirements for the replication agent to be installed on Ubuntu servers. The initial population of the Source Servers went fine. The default setting for the PIT policy is set to two days. However all of my source servers say my Data replication status is "Lag 6 days". I have checked that the replication services are running. I know the network configuration for subnet and security groups is fine or I wouldn't have been able to complete the initial sync. What could be causing the lag? Would not a PIT of 2 days mean the farthest behind I should expect data loss is two day? I have searched many forums and I am unable to determine what is causing the Lag to be so much. The EC2 replication instance was the recommended t2.small. The subnet is dedicated to only to be used for the DRS. Searching through logs has shown nothing of relevance so far. I have no errors showing from the agent. Any tips would be appreciated.

Topics

Storage Management & Governance

Tags

AWS Elastic Disaster Recovery AWS Systems Manager Fleet Manager

Language

English

pshute

asked 2 months ago178 views

4 Answers

Newest
Most votes
Most comments

Accepted Answer

Depending on how fast data is changing on the disks of the source machine, a t2.small could easily be too small to keep up. For the t2 family, the exact network and EBS throughput aren't documented publicly, but for the newer generation instance of equivalent size, t3.small, the sustained network throughput is 128 Mbit/s (16 MB/s) and the sustained EBS throughput is 174 Mbit/s (21.75 MB/s).

You could try switching the replication instance to a larger instance in the t3 family, for example, or a fixed-capacity instance, such as an m6a family instance. You can find the network and EBS throughput specifications for general purpose instances (like t2, t3, and m6a) here: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html

EXPERT

Leo K

answered 2 months ago

pshute
2 months ago
Thank you for the input. The steps I have taken:

Created new dedicated subnet in us-east-1a to increase server options

Update the replication instance from the t2.small to the m7i.large

Delete the source server

Uninstall/reinstall the aws-replication agent on the source server

Add in the source server again with the new m7i as the default replication instance

Now the issue seems to be that I complete the initial sync to 100% but it gets stuck on creating a snapshot. Again no errors to determine why it can't move past the Initial Sync phase completely. I see the new volumes attached to the replication instance. I am not sure where I have gone wrong. I have triple checked the documentation. Everything seems to be in place.

For the snapshot appearing to get stuck, open the EC2 console and the snapshots view, sort the snapshots in descending order by start time, and check if any are in progress. If they are, that means that permissions are properly configured and creating the snapshot is just taking time. Creating an initial snapshot can take quite a while, particularly if using an inexpensive EBS volume type or if the volume is simply large. The snapshots view in the console will also allow you to track the progress.

If no snapshots are in progress and the EDR console still shows the snapshot as pending, check CloudTrail logs in the region EDR is running in for the event names CreateSnapshot and CreateSnapshots. In the CloudTrail event history console view, you can modify the column selections to include "Error code" to help you to find API calls that failed. Check this view for any events that are showing an error code, particularly for insufficient permissions or API throttling.

EXPERT

Leo K

answered 2 months ago

pshute
2 months ago
Leo, thank you for the continued help.

Unfortunately neither the snapshots from EC2 show anything pending and the Cloudtrail shows no snapshots in error or the API call even being made.

The last entry for the source server in the Cloudtrail Event history is DescribeSourceServers with no error code.

Are you seeing snapshots of the disks attached to the replication instance having completed successfully? If the snapshots are completing okay, the EDR console is just not accurately reflecting the stage where it's getting stuck or taking time. If the issue persists, the quickest way might be to contact AWS support. They can see the details of what is happening in your account.

EXPERT

Leo K

answered 2 months ago

Problem resolved. The issue ended up being VPC Endpoints that were added to access a Private subnet later after the initial deployment. It was a network issue after all. However I do believe the t2.small was still an issue.

I have removed the VPC endpoints for now for testing and will re-configure them.

pshute

answered 2 months ago

Relevant content

What to do with source servers with replica after recovery into EC2 [Elastic Disaster Recovery]
VKL
asked 2 years ago
Source server with installed agent does not appear in AWS Elastic Disaster Recovery Console
LarryD
asked 2 months ago
AWS elastic disaster recovery
Nitish
asked 3 years ago
Elastic Disaster recovery agent
Amr Fawzy
asked 2 years ago
Why is my Application Migration Service or Elastic Disaster Recovery replication process stuck at 100% with the "Finalizing Initial Sync" message?
AWS OFFICIALUpdated a year ago
Why is my EC2 Linux instance migrated with Application Migration Service or Disaster Recovery Service failing instance status checks?
AWS OFFICIALUpdated a year ago
How do I configure disaster recovery for AWS Private CA?
AWS OFFICIALUpdated a month ago
How do I restore, resize, or create an EBS persistent storage snapshot in Amazon EKS for disaster recovery or when the EBS modification rate is exceeded?
AWS OFFICIALUpdated 2 years ago
Migrating Microsoft SQL Server workloads to the AWS Cloud
EXPERT
Gayathri Krishnamoorthy
published a year ago
AWS Elastic Disaster Recovery with VMware Raw Device Mappings
EXPERT
Patrick Kremer
published a year ago