RDS taking snapshots outside the backup window

0

I administer an RDS SQL Server instance, with a backup window of 22:00-22:30 UTC and a retention period of 35 days.

I have noticed some peculiarities:

  • There are 55 automatic snapshots currently stored, not 35 as I would expect. 20 of the days seem to have 2 snapshots rather than 1.
  • 19 of those 55 snapshots have start times outside the backup window. If they were a couple of minutes outside I could ignore that, but the times are all over the place, e.g.
17/05/2024 03:01:05
17/05/2024 11:26:10
17/05/2024 14:35:32
20/05/2024 12:36:04
21/05/2024 15:40:30
23/05/2024 17:06:00
24/05/2024 23:05:21
27/05/2024 22:45:26
28/05/2024 09:05:34
29/05/2024 17:50:23
01/06/2024 16:00:29
04/06/2024 12:50:47
06/06/2024 04:40:31
06/06/2024 05:30:47
08/06/2024 07:05:29
08/06/2024 16:40:31
08/06/2024 18:30:40
12/06/2024 16:35:40
17/06/2024 09:11:00

This is a problem because snapshots that start outside the backup window have been interfering with SQL Server native backups that start at other times of the day. I really need all snapshots to start within the backup window. Is there any way I can achieve that?

PS: The time zone of the instance is UTC, and all times quoted above are UTC.

UPDATE 1

I fetched the RDS snapshot list (35 days data), the RDS event log (14 days data), and the SQL error log (7 days data) and compared them.

In the RDS event log, I see 14 ("Backing up DB instance", "Finished DB Instance backup") sequences whose start times occur within the backup window. I also see 4 ("Emergent Snapshot Request", "Backing up DB instance", "Finished DB Instance backup") sequences whose start times occur outside the backup window.

Those 4 emergent event start times are within a minute of the creation times of the 4 most recent automated snapshots that occur outside the backup window. Given the small amount of log data I have to work with, this looks like a good correlation.

In the SQL error log, I can see evidence of various backup activity succeeding and failing. There are 2 failure sequences, both starting about 5 minutes before the 2 most recent emergent snapshots. The first few events logged for each one are:

2024-06-22 13:04:08.20 Backup      Error: 18210, Severity: 16, State: 1.
2024-06-22 13:04:08.20 Backup      BackupIoRequest::ReportIoError: write failure on backup device '630EE1BE-9DE4-4874-96EA-7308C965B234'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2024-06-22 13:04:08.20 Backup      Error: 3041, Severity: 16, State: 1.
2024-06-22 13:04:08.20 Backup      BACKUP failed to complete the command BACKUP LOG **********. Check the backup application log for detailed messages.
2024-06-22 13:04:08.21 spid133     Error: 18210, Severity: 16, State: 1.
2024-06-22 13:04:08.21 spid133     BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device '630EE1BE-9DE4-4874-96EA-7308C965B234'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2024-06-24 21:54:07.38 Backup      Error: 18210, Severity: 16, State: 1.
2024-06-24 21:54:07.38 Backup      BackupIoRequest::ReportIoError: write failure on backup device 'ACA57043-4F09-4503-A20A-39914C36848E'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2024-06-24 21:54:07.38 Backup      Error: 3041, Severity: 16, State: 1.
2024-06-24 21:54:07.38 Backup      BACKUP failed to complete the command BACKUP LOG **********. Check the backup application log for detailed messages.
2024-06-24 21:54:07.38 spid114     Error: 18210, Severity: 16, State: 1.
2024-06-24 21:54:07.38 spid114     BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device 'ACA57043-4F09-4503-A20A-39914C36848E'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).

These look like bullet point 3 from https://repost.aws/knowledge-center/rds-sql-server-emergent-snapshot-backup:

For Point in Time Recovery (PiTR), RDS uploads transaction log backups every five minutes for DB instances to Amazon Simple Storage Service (Amazon S3). When RDS doesn't take transactional log backups successfully, an Emergent Snapshot is triggered by RDS to mitigate problems during PiTR.

Microsoft documentation for error 18210 says:

While the cause can be varied, ultimately the error is due to a failed IO submission to the Operating System.

I'm not sure where to go from here. I can't look at OS event logs because it's RDS. SQL Server reports the following details about itself:

  • Product: Microsoft SQL Server Web (64-bit)
  • Version: 14.0.3381.3
  • OS: Windows Server 2016 Datacenter (10.0)
  • Memory: 8090 MB
  • Processors: 2
asked 2 months ago138 views
3 Answers
2

There can be several reasons to have more than 1 backups. Can you confirm if any of these happened?

  • AWS Backup Is Configured
  • The RDS Instance was stopped/started
  • RDS/SQL Version Upgrade
  • Instance Change
profile picture
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
profile picture
EXPERT
reviewed 2 months ago
  • Thanks Gary. To answer your questions:

    • AWS Backup is not configured (the default vault is empty, there are no plans and no jobs).
    • The RDS instance runs 24/7, AFAIK no one has shut it down.
    • The SQL version has not changed.
    • The instance has not been renamed or replaced.
2

Hi Christian,

This KB has some interesting guidance about so-called "emergent snaphots": https://repost.aws/knowledge-center/rds-sql-server-emergent-snapshot-backup

Maybe it matches what you see in your use case ?

Best,

Didier

profile pictureAWS
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
  • Thanks Didier. I have done more investigation, and added an "UPDATE 1" section to my question. Do you have any suggestion as to what I could do next?

0

Hi Hayter, Check RDS Events: Look for events related to snapshots in the RDS Management Console. Events like "Backing up DB instance" signify scheduled backups, while "Emergent Snapshot Request" indicate automatic snapshots outside the window due to pending backups.

Review User Activity: If you suspect a manual snapshot, check the RDS console logs for any snapshot creation activity outside the window.

Review Backup Window: If you find the scheduled window disruptive, consider adjusting it to a less critical time for your workload. **Automate Monitoring: **Set up CloudWatch alerts to notify you of unexpected snapshot activity outside the window.

For further details on RDS backups and snapshots, refer to the official AWS documentation: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html

EXPERT
answered 2 months ago