AWS Elastic Disaster Recovery Source Server Lag

0

I have recently began setting up and using the Disaster Recovery service. Using Systems Manager, I was to install the needed requirements for the replication agent to be installed on Ubuntu servers. The initial population of the Source Servers went fine. The default setting for the PIT policy is set to two days. However all of my source servers say my Data replication status is "Lag 6 days". I have checked that the replication services are running. I know the network configuration for subnet and security groups is fine or I wouldn't have been able to complete the initial sync. What could be causing the lag? Would not a PIT of 2 days mean the farthest behind I should expect data loss is two day? I have searched many forums and I am unable to determine what is causing the Lag to be so much. The EC2 replication instance was the recommended t2.small. The subnet is dedicated to only to be used for the DRS. Searching through logs has shown nothing of relevance so far. I have no errors showing from the agent. Any tips would be appreciated.

4 Answers
1
Accepted Answer

Depending on how fast data is changing on the disks of the source machine, a t2.small could easily be too small to keep up. For the t2 family, the exact network and EBS throughput aren't documented publicly, but for the newer generation instance of equivalent size, t3.small, the sustained network throughput is 128 Mbit/s (16 MB/s) and the sustained EBS throughput is 174 Mbit/s (21.75 MB/s).

You could try switching the replication instance to a larger instance in the t3 family, for example, or a fixed-capacity instance, such as an m6a family instance. You can find the network and EBS throughput specifications for general purpose instances (like t2, t3, and m6a) here: https://docs.aws.amazon.com/ec2/latest/instancetypes/gp.html

EXPERT
Leo K
answered 2 months ago
  • Thank you for the input. The steps I have taken:

    1. Created new dedicated subnet in us-east-1a to increase server options
    2. Update the replication instance from the t2.small to the m7i.large
    3. Delete the source server
    4. Uninstall/reinstall the aws-replication agent on the source server
    5. Add in the source server again with the new m7i as the default replication instance

    Now the issue seems to be that I complete the initial sync to 100% but it gets stuck on creating a snapshot. Again no errors to determine why it can't move past the Initial Sync phase completely. I see the new volumes attached to the replication instance. I am not sure where I have gone wrong. I have triple checked the documentation. Everything seems to be in place.

0

For the snapshot appearing to get stuck, open the EC2 console and the snapshots view, sort the snapshots in descending order by start time, and check if any are in progress. If they are, that means that permissions are properly configured and creating the snapshot is just taking time. Creating an initial snapshot can take quite a while, particularly if using an inexpensive EBS volume type or if the volume is simply large. The snapshots view in the console will also allow you to track the progress.

If no snapshots are in progress and the EDR console still shows the snapshot as pending, check CloudTrail logs in the region EDR is running in for the event names CreateSnapshot and CreateSnapshots. In the CloudTrail event history console view, you can modify the column selections to include "Error code" to help you to find API calls that failed. Check this view for any events that are showing an error code, particularly for insufficient permissions or API throttling.

EXPERT
Leo K
answered 2 months ago
  • Leo, thank you for the continued help.

    Unfortunately neither the snapshots from EC2 show anything pending and the Cloudtrail shows no snapshots in error or the API call even being made.

    The last entry for the source server in the Cloudtrail Event history is DescribeSourceServers with no error code.

0

Are you seeing snapshots of the disks attached to the replication instance having completed successfully? If the snapshots are completing okay, the EDR console is just not accurately reflecting the stage where it's getting stuck or taking time. If the issue persists, the quickest way might be to contact AWS support. They can see the details of what is happening in your account.

EXPERT
Leo K
answered 2 months ago
0

Problem resolved. The issue ended up being VPC Endpoints that were added to access a Private subnet later after the initial deployment. It was a network issue after all. However I do believe the t2.small was still an issue.

I have removed the VPC endpoints for now for testing and will re-configure them.

pshute
answered 2 months ago