AWS Solution: Counting S3 usage per user

Written by Michael Anckaert
Published on
Posted in AWS, Cloud, Development

In this article I'll give a possible architecture example on how you can count the S3 usage per user. A possible use case could that you have a requirement to keep track of the amount of data your users are storing in S3.

NOTE This post is still in draft and some editing needs to take place.

Solution overview

The diagram below shows the architecture of the solution.

Solution Diagram

Every S3 create or remove event will trigger the execution of a Lambda function. This function will update the usage for that particular user in DynamoDB. Our solution requires us to store the file in the S3 bucket under a particular prefix, so we can determine the user from the S3 event. A file (or object in S3 parlance) being uploaded to the S3 bucket must be uploaded as user-identifier/filename.ext. Our Lambda function will extract the first part of the object key as the user identifier.

The code of this solution can be found this Github repository: https://github.com/MichaelAnckaert/aws-solution-count-s3-usage. In the rest of this post I'll walk you through the entire solution.

Terraform setup

We will be using Terraform to setup this solution. I've put the entire solution in a single Terraform module for simplicity.

Have a look at the file main.tf. After setting up some Terraform and AWS specific options, we create our S3 bucket. This is really basic code as you would see in the simplest of Terraform examples.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "bucket" {
  bucket = "sinax-test-bucket1"
  acl    = "private"

  tags = {
    Environment = "Dev"
  }
}

AWS Lambda setup

Next we create our Lambda function, add an execution role and attach an AWS managed role that is required for basic Lambda usage.

The IAM Role bucket-usage-role that we define first is the Lambda Execution Role. This is the role used by the lambda function when it runs so it can talk to other AWS services.

Next we attach an IAM role that is managed by AWS, AWSLambdaBasicExecutionRole. This role allows our function to have basic permissions such as create CloudWatch logs.

Next we create our Lambda function. We pass along the filename where the Lambda source code is found. We'll use Python for our Lambda function so we specify the correct runtime.

resource "aws_iam_role" "lambda_role" {
  name = "bucket-usage-role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "attach_lambda_role" {
  role       = aws_iam_role.lambda_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_lambda_function" "lambda" {
  filename      = "lambda_function_payload.zip"
  function_name = "bucket-usage"
  role          = aws_iam_role.lambda_role.arn
  handler       = "lambda_function.lambda_handler"

  source_code_hash = filebase64sha256("lambda_function_payload.zip")

  runtime = "python3.8"

  tags = {
    Environment = "Dev"
  }
}

The next step is configuring our S3 to allow it to call our Lambda function. This is done by creating an aws_lambda_permission resource. Then we configure an aws_s3_bucket_notification, this will configure S3 to trigger our Lambda function for the events we have specified.

resource "aws_lambda_permission" "allow_bucket" {
  statement_id  = "AllowExecutionFromS3Bucket"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.lambda.arn
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.bucket.arn
}

resource "aws_s3_bucket_notification" "bucket_notification" {
  bucket = aws_s3_bucket.bucket.id

  lambda_function {
    lambda_function_arn = aws_lambda_function.lambda.arn
    events              = ["s3:ObjectCreated:*", "s3:ObjectRemoved:*"]
  }

  depends_on = [aws_lambda_permission.allow_bucket]
}

DynamoDB setup

The last missing piece in our solution is the DynamoDB part.

In our solution, we need two tables, one for storing the total bucket usage by our users, and a second table for storing file details. An ideal design would have a single table storing total number of bytes used by a user. The S3 ObjectCreated event sends along the total number of bytes for the uploaded object, so we could simply add that to the already stored usage. Unfortunately, the S3 ObjectRemoved event doesn't send the object size along. So we need to have a way to determine how much bytes to subtract from the total number of bytes a user is storing. The reason why we store the total number of bytes per user, and not do a sum aggregation whenever we need this information is simply speed and throughput. Doing a sum aggregation on DynamoDB requires a scan across all records, a costly and ineffecient operation!

Let's setup the two tables that we will be needing. I won't be going into too much details, DynamoDB is powerful but complex service. Correctly designing your datamodel and application can be hard. Note that since DynamoDB is schemaless we only need to define our keys, all other fields can be added by our application as we store the data in our tables.

The first table, Users, stores a UserId and the total number of bytes used by that user. The second table, Files, will store the UserId, the FileName and the size of the file.

resource "aws_dynamodb_table" "user_table" {
  name           = "Users"
  billing_mode   = "PROVISIONED"
  read_capacity  = 20
  write_capacity = 20
  hash_key       = "UserId"

  attribute {
    name = "UserId"
    type = "S"
  }

  tags = {
    Environment = "dev"
  }
}

resource "aws_dynamodb_table" "file_table" {
  name           = "Files"
  billing_mode   = "PROVISIONED"
  read_capacity  = 20
  write_capacity = 20
  hash_key       = "UserId"
  range_key      = "FileName"

  attribute {
    name = "UserId"
    type = "S"
  }

  attribute {
    name = "FileName"
    type = "S"
  }

  tags = {
    Environment = "dev"
  }
}

The last missing piece in our Terraform module is setting up the correct DynamoDB permissions so our Lambda function can perform the correct actions on our DynamoDB tables.

Draft NOTE: Fix DDB permissions!

resource "aws_iam_policy" "dynamodb_policy" {
  name        = "dynamodb-bucket-usage-policy"
  description = "Allow read and write access to the dynamodb table containing bucket usage statistics"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "dynamodb:**"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "attach_dynamodb_policy" {
  role       = aws_iam_role.lambda_role.name
  policy_arn = aws_iam_policy.dynamodb_policy.arn
}

Lambda function

Let's have a look at the code of our Lambda function. As with the Terraform module I've split the file and will be going over the sections as we build up our mental model of the code. The function source in Github will give a complete picture that you can easily scroll through.

The Lambda handler function is not that complex. It's purpose is to loop over all records passed to the Lambda function. This is required (and often forgotten by developers to Lambda functions), but a single lambda execution can receive multiple events. Suppose that two files are uploaded to our S3 bucket at the same time, then it's quite possible that a single execution of our lambda function will receive two ObjectCreated events. So for every event (a record) our handler will extract relevant information and call the correct Python function to handle that event.

def lambda_handler(event, context):
    # Uncomment the line below to debug the event received
    # print("Received event: " + json.dumps(event, indent=2))

    for record in event['Records']:
        event = record['eventName']
        bucket = record['s3']['bucket']['name']
        key = urllib.parse.unquote_plus(record['s3']['object']['key'], encoding='utf-8')
        size = record['s3']['object'].get('size', None)
        print(f"Got a {event} event for object '{key}' in bucket '{bucket}'. Reported size is {size or 'unknown'}")
        if event == "ObjectCreated:Put":
            save_object_size(key, size)
        elif event == "ObjectRemoved:Delete":
            remove_object_size(key)

Let's take a look at two of our workhorse functions, save_object_size and remove_object_size. These functions both follow the same structure, but the save_object_size function creates an item in our DynamoDB table while the remove_object_size function deletes an item.

Both functions call the update_usage function that will update the total usage by a user. You will also notice that we extract the user_id from the object key, as noted in the solution architecture above.

A special thing about the code in the remove_object_size function is that it makes use of a DynamoDB feature that returns the item that was deleted. So by deleting an item in our table, we can also retrieve what we deleted. This allows us to receive the size of the object we deleted, which we then pass on to the update_usage function.

def save_object_size(object_key, size):
    user_id = object_key.split("/")[0]

    response = dynamodb.put_item(
        TableName="Files",
        Item={
            "UserId": {"S": user_id},
            "FileName": {"S": object_key},
            "Size": {"N": str(size)},
        },
    )

    update_usage(user_id, str(size))


def remove_object_size(object_key):
    user_id = object_key.split("/")[0]

    response = dynamodb.delete_item(
        TableName="Files",
        Key={"UserId": {"S": user_id}, "FileName": {"S": object_key}},
        ReturnValues="ALL_OLD",
    )

    size = int(response["Attributes"]["Size"]["N"])

    update_usage(user_id, str(0 - size))

The final piece of code in our Lambda function is the update_usage function. This function receives a user identifier and the number of bytes that the size used has changed. Suppose that a user has used an extra 50 bytes, we pass along 50 to this function in the size_change parameter. If the user has freed 100 bytes, we can pass along -100 to update the total usage downwards.

The code makes use of the put_item operation to update an item in our DynamoDB table. A cool feature of DynamoDB are update expressions, which we can use to update one or more fields based on an expression. In the code below, we add a value to our Size attribute of the item we've selected based on the given user_id.

def update_usage(user_id, size_change):
    response = dynamodb.update_item(
        TableName="Users",
        Key={
            "UserId": {"S": user_id},
        },
        UpdateExpression="ADD Size :size",
        ExpressionAttributeValues={":size": {"N": size_change}},
    )

Conclusion

In this solution we combined a number of different AWS services and made use of some best practices such as IaC (Infrastructure as Code). The final code in the Github repository isn't perfect. For simplicity I have omitted any error handling and there are plenty of hard coded values that should ideally be passed as environment (or terraform) variables. I'll leave these as an exercise to the reader.