AWS S3 SQL Backup Tierer

I have some servers that are writing their backup files into S3, and I wanted to have something that will tier off copies of it to different rentention periods. (I know there are other ways to handle this like tagging, but this is what we decided on).

This is how I am doing RDS based backups to S3:

DECLARE @Suffix AS VARCHAR(100),
        @S3_Path AS VARCHAR(100),
	@dbname varchar(100),
	@sql varchar(max),
	@Temp varchar(100)

SET @S3_Path = 'arn:aws:s3:::SOMEBUCKET/RDS/'

SELECT @suffix = replace(CONVERT(date, GETDATE()),'-','')
SELECT @Temp = REPLACE(CONVERT(varchar(5), GETDATE(), 108), ':', '')
set @Suffix = '_'+@Suffix+'_'+@Temp+'.bak'

declare db_cur cursor for
SELECT name
  FROM sys.databases where database_id > 5
  open db_cur
fetch next from db_cur into @dbname

while @@FETCH_STATUS = 0
begin
  set @sql= 'msdb.dbo.rds_backup_database @source_db_name='''+@dbname+''', @s3_arn_to_backup_to='''+@S3_Path+@dbname+@Suffix+''''
  exec(@sql)
fetch next from db_cur into @dbname
end

close db_cur
deallocate db_cur

This is how we are doing EC2 based backups to S3

import-module -name AWSPowershell

#Put in sources (can include extensions)
$Source = @('Y:\SQLBackups\TapeBackup', 'Z:\SQLBackups\TapeBackup')

$KeyPrefix = 'Daily/'
#$CommandUse = "PS-Native"
$CommandUse = "AWS-CLI"

#Configure destination/s3 options
$S3Options = @{BucketName = '<SOMEBUCKET>'
                   Region = 'us-west-1'
              EndpointURL = 'https://bucket.<VPCEID>.s3.us-west-1.vpce.amazonaws.com'
			 }

$ErrorActionPreference = "Stop"
ForEach($File in (Get-ChildItem -R $Source | where { $_.GetType().Name -eq "FileInfo"})) {
  $Key = $File.FullName.Replace($Folder,"").Replace("\","/")
  If ($Key.Substring(0,1) -eq "/") {
    $Key = $Key.Substring(1,$Key.Length - 1)
  }
  $Key = $KeyPrefix+$Key
  If (not (Get-S3Object -Key $Key @S3Options)) {
    #Write-host $Key ' was not found.  Writing...'
    Try {
      If ($CommandUse = "PS-Native") {
        Write-S3Object -File $File -Key $Key @S3Options
      } elseif ($CommandUse = "AWS-CLI") {
        $pinfo = New-Object System.Diagnostics.ProcessStartInfo ; $p = New-Object System.Diagnostics.Process
        $pinfo.RedirectStandardError = $true ; $pinfo.RedirectStandardOutput = $true ; $pinfo.UseShellExecute = $false
        $pinfo.FileName = "aws.exe"
        $pinfo.Arguments = "--region "+$S3Options['Region']+" --endpoint-url "+$S3Options['EndpointURL']+" s3 cp """+$File.VersionInfo.FileName+""" ""s3://"+$S3Options['BucketName']+"/$Key"""
        $p.StartInfo = $pinfo
        $p.Start()  | Out-Null
        while (!$p.HasExited) {
          $p.StandardOutput.ReadToEnd();
        }
        $p.WaitForExit()
        $Result = [pscustomobject]@{
           commandTitle = $commandTitle
           stdout = $p.StandardOutput.ReadToEnd()
           stderr = $p.StandardError.ReadToEnd()
           ExitCode = $p.ExitCode
        }
      } else {
        "Unknown CommandUse"
      }
    } Catch {
      Write-host $_.Exception.Message
      Throw
    }
  }
}

Now that we have the files in S3, we need something to copy (I might want to change this to tagging instead – but that would be easier 🙂 ) the files around to proper folders. I choose to do this with a Lambda function. Here is the pseudo code for what I wanted:

Set the date, prefix, and suffix for the backup file
  Treat prefix and suffix as unique indentifies for the file
Check the folder for exsiting file with prefix and suffix
  Run for the Monthly and Yearly
Copy file to appropriate folder(s)

Now we need to make something to make that happen, and I was told Python is the industry way to do Lambda (grr, not my native language, but thats ok, I can also find more community support – also the guy I had helping me didn’t want to do in his non native language of powershell). I got a skelleton of a base code from him/the internet and it worked well enough and had support to keep weekly’s, but it was basic and as I tried expanding to real life, I found the problems that always happen when you go from simple to real world. The first problem happened when my files got bigger than 5GB – I don’t exactly remember how I overcame that, but I think is was overcome by switching to boto3 resource instead of boto3 client. The next issue came when I needed to copy 800+gb, and I found transfer config and botocore max connections. This mostly fixed it, but I knew there was going to be potential issues when a file was both a monthly and a yearly. So, I decided I needed to learn how to do those tasks in parralel (bonus, I learn something new, and cause money savings that could add up to a whole dollar over the lifetime of the system) – enter python multiprocessing and thank you community for already discovering/sharing that Pipe apears to be the least problematic meathod.

I am not going to find/share the old/problematic ones, here is the final product:

print('Loading function')
import urllib.parse
import boto3
import re
import botocore
from multiprocessing import Process, Pipe
from boto3.s3.transfer import TransferConfig

#Setup common variables
concurrency=150;
s3 = boto3.client('s3', config=botocore.config.Config(max_pool_connections=concurrency) )
s3Resource = boto3.resource('s3', config=botocore.config.Config(max_pool_connections=concurrency) )
config = TransferConfig(max_concurrency=concurrency)

def lambda_handler(event, context):
    # Get/store object information from the event
    bucket = event['Records'][0]['s3']['bucket']['name'];
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8');
    copy_source = {
        'Bucket': bucket,   
        'Key': key
    }
    #Using a second bucket for this use case
    bucket2 = 'aces-backup-tier';
    bucketobj = s3Resource.Bucket(bucket2)


    try:
        #Find the prefix and suffix of the file based on the middle being _YYYYMMDD_HHMMSS
        res = re.search(r'_\d{8}_\d{6}',key)
        year_start  = 'Yearly/'+key[:res.start() + 5];
        month_start = 'Monthly/'+key[:res.start() + 7];
        file_end   = key[res.start() + 16:];
        
        #Assume action of copy until a file with the same prefix and suffix exists
        monthly_copy = 'true';
        for obj in bucketobj.objects.filter(Prefix=month_start):
            if obj.key.endswith(file_end):
                monthly_copy = 'false';
            
        #Assume action of copy until a file with the same prefix and suffix exists
        yearly_copy = 'true';
        for obj in bucketobj.objects.filter(Prefix=year_start):
            if obj.key.endswith(file_end):
                yearly_copy = 'false';
        
        #Spawn the yearly copy job
        if yearly_copy == 'true':
            print("Yearly backup");
            destination_key = "Yearly/"+key;
            parent_conn, child_conn = Pipe()
            p = Process(target=copy_thread, args=(copy_source, bucket2, destination_key))
            p.start()
            
        #Do the actual month copy job here (in this thread)
        if monthly_copy == 'true':
            print("Monthly backup");
            destination_key = "Monthly/"+key;
            s3Resource.meta.client.copy(copy_source, bucket2, destination_key,Config=config);

        #If yearly, rejoin spawned thread
        if yearly_copy == 'true':
            p.join()
        
        return key
    except Exception as e:
        print(e);
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket));
        raise e;

def copy_thread(copy_source, bucket, destination_key):
    try:
        s3Resource.meta.client.copy(copy_source, bucket, destination_key,Config=config);
    except Exception as e:
        print(e);
        print('Error getting object {} to bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(destination_key, bucket));
        raise e;

Steve Borba

My notes, I hope they help you, feel free to comment/add to them

AWS S3 SQL Backup Tierer

Leave a Reply Cancel reply