S3 Cross region migration

on

In this post I will go through the steps on how to easily set do an S3 cross region migration or start the replication if you want to keep them synced.

What will we use today?

  1. AWS Cross Region Replication
  2. Python + Boto
  3. A simple script
  4. ???
  5. Profit

AWS Cross Region Replication

Amazon Web Services enabled a wonderful serviced called CRR (Cross Region replication) around 4 years ago. Cross-region replication (CRR) enables automatic, asynchronous copying of objects across buckets in different AWS Regions.

Lovely no? We just go into our console. Follow the simple steps to turn on CRR and we are done. Buuuuuuuuuuuuuut there is a little problem in our plans my dear reader. If you dig a little deeper you will see this dreadful line in:

What is replicated?

Objects created after you add a replication configuration, with exceptions described in the next section.

Which means that all your previous data will not be replicated! This doesn’t matter if you are starting from cero but if you already have a lot of data this is far from ideal.

So the rest of this post is on how to make CRR think that the old files are actually new files so that they can be replicated using CRR.

So before you move one please enable CRR and make sure you make a rule to delete previous versions of your S3 objects or your storage will grow quite fast.

Python + Boto

For this timfoolery you will need to install python and the library boto. (This links should take you straight to the installation instructions in case you don’t have them already).

You could run the script on either you laptop or an external server (e.g EC2). That depends on you and the volume of data.

With boto we will be able to do actions directly to Amazon inside python. The first step would be to give the scripts the necessary permissions. For this you need a pair of keys. Please follow the AWS tutorials to get a nice pair of keys.

A simple script

So the trick to fool CRR is quite simple. What the script does is adding metadata to each object in a bucket. It adds a metadata called CRR and just stamps the day and time the script runs.

By changing the metadata CRR thinks this is a new file and then replicates! So you only need to run this script once per bucket and voil√† CRR will do the rest.

So there are 3 likely places you need to add your information on the script.

1. The Secret keys mentioned above

# boto is the library to use AWS in python
import boto3
# Start a session with your keys
session = boto3.Session(
    aws_access_key_id='AWS_SERVER_PUBLIC_KEY',
    aws_secret_access_key='AWS_SERVER_SECRET_KEY',
)

2. The bucket name

# The bucket that you want to copy
bucket = "Your_Bucket"

3. The metadata you wish to add

And that is all you need! So, at last, here is the script:

???

To run it just save it on the computer you wish to use and in a terminal use


python my_script.py

Profit

At the end if you left my metrics you should be greeted with how many objects were modified, how many GBs and the execution time. I do recommend that for really big buckets you remove this parts of the code.

Just so you have an idea. From my actual use of this script it took around 42 secs per GB in average.

And cero data loss haha just if you were wondering.