I administer an RDS SQL Server instance, with a backup window of 22:00-22:30 UTC and a retention period of 35 days.
I have noticed some peculiarities:
- There are 55 automatic snapshots currently stored, not 35 as I would expect. 20 of the days seem to have 2 snapshots rather than 1.
- 19 of those 55 snapshots have start times outside the backup window. If they were a couple of minutes outside I could ignore that, but the times are all over the place, e.g.
17/05/2024 03:01:05
17/05/2024 11:26:10
17/05/2024 14:35:32
20/05/2024 12:36:04
21/05/2024 15:40:30
23/05/2024 17:06:00
24/05/2024 23:05:21
27/05/2024 22:45:26
28/05/2024 09:05:34
29/05/2024 17:50:23
01/06/2024 16:00:29
04/06/2024 12:50:47
06/06/2024 04:40:31
06/06/2024 05:30:47
08/06/2024 07:05:29
08/06/2024 16:40:31
08/06/2024 18:30:40
12/06/2024 16:35:40
17/06/2024 09:11:00
This is a problem because snapshots that start outside the backup window have been interfering with SQL Server native backups that start at other times of the day. I really need all snapshots to start within the backup window. Is there any way I can achieve that?
PS: The time zone of the instance is UTC, and all times quoted above are UTC.
UPDATE 1
I fetched the RDS snapshot list (35 days data), the RDS event log (14 days data), and the SQL error log (7 days data) and compared them.
In the RDS event log, I see 14 ("Backing up DB instance", "Finished DB Instance backup") sequences whose start times occur within the backup window. I also see 4 ("Emergent Snapshot Request", "Backing up DB instance", "Finished DB Instance backup") sequences whose start times occur outside the backup window.
Those 4 emergent event start times are within a minute of the creation times of the 4 most recent automated snapshots that occur outside the backup window. Given the small amount of log data I have to work with, this looks like a good correlation.
In the SQL error log, I can see evidence of various backup activity succeeding and failing. There are 2 failure sequences, both starting about 5 minutes before the 2 most recent emergent snapshots. The first few events logged for each one are:
2024-06-22 13:04:08.20 Backup Error: 18210, Severity: 16, State: 1.
2024-06-22 13:04:08.20 Backup BackupIoRequest::ReportIoError: write failure on backup device '630EE1BE-9DE4-4874-96EA-7308C965B234'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2024-06-22 13:04:08.20 Backup Error: 3041, Severity: 16, State: 1.
2024-06-22 13:04:08.20 Backup BACKUP failed to complete the command BACKUP LOG **********. Check the backup application log for detailed messages.
2024-06-22 13:04:08.21 spid133 Error: 18210, Severity: 16, State: 1.
2024-06-22 13:04:08.21 spid133 BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device '630EE1BE-9DE4-4874-96EA-7308C965B234'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2024-06-24 21:54:07.38 Backup Error: 18210, Severity: 16, State: 1.
2024-06-24 21:54:07.38 Backup BackupIoRequest::ReportIoError: write failure on backup device 'ACA57043-4F09-4503-A20A-39914C36848E'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
2024-06-24 21:54:07.38 Backup Error: 3041, Severity: 16, State: 1.
2024-06-24 21:54:07.38 Backup BACKUP failed to complete the command BACKUP LOG **********. Check the backup application log for detailed messages.
2024-06-24 21:54:07.38 spid114 Error: 18210, Severity: 16, State: 1.
2024-06-24 21:54:07.38 spid114 BackupVirtualDeviceFile::RequestDurableMedia: Flush failure on backup device 'ACA57043-4F09-4503-A20A-39914C36848E'. Operating system error 995(The I/O operation has been aborted because of either a thread exit or an application request.).
These look like bullet point 3 from https://repost.aws/knowledge-center/rds-sql-server-emergent-snapshot-backup:
For Point in Time Recovery (PiTR), RDS uploads transaction log backups every five minutes for DB instances to Amazon Simple Storage Service (Amazon S3). When RDS doesn't take transactional log backups successfully, an Emergent Snapshot is triggered by RDS to mitigate problems during PiTR.
Microsoft documentation for error 18210 says:
While the cause can be varied, ultimately the error is due to a failed IO submission to the Operating System.
I'm not sure where to go from here. I can't look at OS event logs because it's RDS. SQL Server reports the following details about itself:
- Product: Microsoft SQL Server Web (64-bit)
- Version: 14.0.3381.3
- OS: Windows Server 2016 Datacenter (10.0)
- Memory: 8090 MB
- Processors: 2
Thanks Gary. To answer your questions: