Build a Slack Clone with an API Gateway WebSockets API

See the full example on GitHub

In a previous blog post I walked through the process of creating a Next.js application with a very simple WebSocket connection for streaming updates from the server. In that post, I explained that the task of running your own WebSocket-capable server might be a burden for some people, especially if you have a mostly serverless deployment model.

So I got to thinking, "Is there a way to run a WebSocket-based service in a serverless way?" There are SaaS vendors that offer real-time messaging as a service, like Ably and Pusher. But I was especially interested in AWS' API Gateway WebSockets API product, which offers a serverless WebSocket broker with a lot of options for developers to customize how those WebSocket connections fit in to the rest of their application.

I decided to build a clone of Slack to demonstrate some real-world communication scenarios using only API Gateway, Lambda, and DynamoDB.

Overview

Several browser windows, each running the Slack clone demo

The demo consists of a front-end Next.js app that makes a WebSocket connection to an API Gateway WebSockets API and uses that to stream updates about messages, when users are typing, and which users have an active connection.

The WebSockets API forwards every request to one of two Lambda functions. One handles when clients connect and disconnect, and the other handles all other application events. Dividing this work between two Lambdas was an arbitrary choice I made; you might choose to split work between more functions if each of your API operations has different scaling needs. Conversely, you might choose to keep API operations combined into a single Lambda to improve your cold-start ratio.

Each Lambda stores state in a DynamoDB table, which stores long-term information about channels and users, and also stateful information about the ephemeral WebSocket connections that are currently open. Depending on the API operation, these Lambdas will also publish events to relevant WebSocket connections. They do this inbound and outbound communication directly; A more advanced application might choose to decouple these tasks by publishing an event to an event bus, then letting a separate task handle the outbound communications.

Create the infrastructure

TL;DR: do everything in this section by running npm ci && npm run publish in the demo repo.

I've created a couple CloudFormation templates to provision everything this demo needs, but I'll walk through some of the resources one-by-one to explain them.

  1. Prep step: Create an S3 bucket for Lambda assets.

    These resources aren't really part of the demo itself, but make deploying it easier. I created an S3 bucket into which I can write Lambda code as ZIP files, and record the name of the bucket in AWS Systems Manager Parameter Store.

    S3Bucket:
      Type: AWS::S3::Bucket
      Properties:
        AccessControl: Private
        VersioningConfiguration:
          Status: Enabled
        LifecycleConfiguration:
          Rules:
            - Id: DeleteOldVersions
              Status: Enabled
              NoncurrentVersionExpirationInDays: 30
              AbortIncompleteMultipartUpload:
                DaysAfterInitiation: 7
    
    Parameter:
      Type: AWS::SSM::Parameter
      Properties:
        Name: !Sub /${ProjectName}/s3-bucket
        Type: String
        Value: !Ref S3Bucket
    
  2. Build and upload the Lambda code.

    Bundle, zip, and upload the code for each of the Lambda functions we're about to create.

    npm -w packages/test-connect run build && npm -w packages/test-connect run upload
    npm -w packages/test-default run build && npm -w packages/test-default run upload
    

    These npm scripts use esbuild to bundle the handler code into a single file, compresses that file into a .zip file, then uploads the .zip file to S3. When bundling with esbuild, I specify --external:@aws-sdk* to omit the AWS SDKs from the bundle, because those packages are already globally available as part of the nodejs20.x Lambda runtime.

  3. Create the DynamoDB table.

    This table uses pk and sk as its primary key and sort key fields, respectively. This is to accommodate the single-table design pattern which I'll explain more later. I'll also create an inverted index to support a variety of relational queries. (In the inverted index, sk is the partition key and pk is the sort key, thus inverting the main index pattern)

    DynamoDBTable:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: !Ref ProjectName
        BillingMode: PAY_PER_REQUEST
        KeySchema:
          - AttributeName: pk
            KeyType: HASH
          - AttributeName: sk
            KeyType: RANGE
        GlobalSecondaryIndexes:
          - IndexName: sk-index
            KeySchema:
              - AttributeName: sk
                KeyType: HASH
            Projection:
              ProjectionType: ALL
          - IndexName: conn
            KeySchema:
              - AttributeName: conn
                KeyType: HASH
            Projection:
              ProjectionType: KEYS_ONLY
        AttributeDefinitions:
          - AttributeName: pk
            AttributeType: S
          - AttributeName: sk
            AttributeType: S
          - AttributeName: conn
            AttributeType: S
        TimeToLiveSpecification:
          AttributeName: ttl
          Enabled: true
    

    I've also created an index on the conn attribute to support pruning connection records when a user disconnects, and a time to live on the ttl attribute to prune messages that are over a day old. I'm only doing the TTL thing to avoid growing storage costs for the demo; you could easily allow DynamoDB to store multiple terabytes worth of messages, or implement a tiered storage system that migrates older messages to S3 when they reach a certain age.

  4. Create the (empty) API Gateway.

    There's not a lot to do to make an API Gateway on its own. This is only an empty API and production stage, and I'll associate more resources with it later.

    ApiGateway:
      Type: AWS::ApiGatewayV2::Api
      Properties:
        Name: !Ref ProjectName
        ProtocolType: WEBSOCKET
        RouteSelectionExpression: "$request.body.event"
    
    ApiGatewayStage:
      Type: AWS::ApiGatewayV2::Stage
      Properties:
        ApiId: !Ref ApiGateway
        StageName: production
        AutoDeploy: true
        DefaultRouteSettings:
          ThrottlingRateLimit: 10
          ThrottlingBurstLimit: 100
        StageVariables:
          DYNAMODB_TABLE_NAME: !Ref DynamoDBTable
    

    I'm configuring the API Gateway to parse each incoming request as JSON and to use the event property of that message to route requests. I'm also configuring the deployment stage to limit clients to about 10 requests per second, while allowing temporary bursts above that limit. I'm pretty sure that limit is applied per connection and not to all connections in aggregate.

  5. Create an IAM role for the Lambda handlers.

    Both Lambdas will share the same IAM role, which has policies that allow it to write its logs to CloudWatch Logs (a basic Lambda requirement), write traces to AWS X-Ray, and to do the necessary read/write operations in DynamoDB and API Gateway.

    LambdaRole:
      Type: AWS::IAM::Role
      Properties:
        RoleName: !Sub ${ProjectName}-lambdas
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Principal:
                Service:
                  - lambda.amazonaws.com
              Action:
                - sts:AssumeRole
        Policies:
          - PolicyName: "LambdaBasicExecution"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: Allow
                  Action:
                    - logs:CreateLogGroup
                  Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
                - Effect: Allow
                  Action:
                    - logs:CreateLogStream
                    - logs:PutLogEvents
                  Resource:
                    - !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${ProjectName}*:*"
                - Effect: Allow
                  Action:
                    - xray:PutTraceSegments
                    - xray:PutTelemetryRecords
                  Resource: "*"
          - PolicyName: "ResourceAccess"
            PolicyDocument:
              Version: "2012-10-17"
              Statement:
                - Effect: Allow
                  Action:
                    - dynamodb:Query
                    - dynamodb:Scan
                    - dynamodb:GetItem
                    - dynamodb:PutItem
                    - dynamodb:UpdateItem
                    - dynamodb:DeleteItem
                    - dynamodb:BatchGetItem
                    - dynamodb:BatchWriteItem
                  Resource:
                    - !GetAtt DynamoDBTable.Arn
                    - !Sub "${DynamoDBTable.Arn}/index/*"
                - Effect: Allow
                  Action:
                    - execute-api:Invoke
                    - execute-api:ManageConnections
                  Resource: !Sub "arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${ApiGateway}/*/*/*"
    
  6. Create the Lambda handlers and supporting resources.

    I'll create a log group with a conservative retention policy (again, trying to control costs) and then create the Lambda function itself. After that, I'll create a resource policy to allow API Gateway to invoke the Lambda, and register the Lambda function as an "integration" with API Gateway, so it can be used with a variety of routes.

    ConnectLogGroup:
      Type: AWS::Logs::LogGroup
      Properties:
        LogGroupName: !Sub "/aws/lambda/${ProjectName}-connect"
        RetentionInDays: 7
    
    ConnectLambda:
      Type: AWS::Lambda::Function
      Properties:
        FunctionName: !Sub ${ProjectName}-connect
        Handler: index.handler
        Runtime: nodejs20.x
        Architectures:
          - arm64
        TracingConfig:
          Mode: Active
        MemorySize: 1800
        Role: !GetAtt LambdaRole.Arn
        Code:
          S3Bucket: !Ref AssetBucketName
          S3Key: connect.zip
        Environment:
          Variables:
            TABLE_NAME: !Ref DynamoDBTable
        LoggingConfig:
          LogGroup: !Ref ConnectLogGroup
    
    ConnectPermission:
      Type: AWS::Lambda::Permission
      Properties:
        Action: lambda:InvokeFunction
        FunctionName: !Ref ConnectLambda
        Principal: apigateway.amazonaws.com
        SourceArn: !Sub "arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${ApiGateway}/*/*"
    
    ConnectIntegration:
      Type: AWS::ApiGatewayV2::Integration
      Properties:
        ApiId: !Ref ApiGateway
        IntegrationType: AWS_PROXY
        IntegrationUri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${ConnectLambda.Arn}/invocations"
        IntegrationMethod: POST
        PayloadFormatVersion: "1.0"
    

    There's a lot going on there:

    • I'm manually creating the log group so that I can control its retention policy. If I didn't do this, AWS would create a log group automatically, but that log group would be configured to store all its logs forever and would not be deleted after I delete the other resources.
    • I'm using the arm64 (AKA "Graviton") architecture for the better price-performance, and because I have no x86-specific dependencies.
    • I'm setting the function memory to 1,800 MB in order to be slightly above the magic threshold of 1,769 MB at which Lambdas have 1vCPU of compute capacity.
    • The wildcards when configuring API gateway permissions are a little opaque to me; GitHub Copilot helped me write them and I'm honestly not sure how I would have arrived at the correct values on my own.
  7. Register the Lambdas with API Gateway routes.

    The ConnectIntegration resource I created before doesn't actually modify the API Gateway API; it only indicates that the Lambda is available as an integration. I'll make routes for each API operation that I want to associate with one of the Lambdas.

    ConnectRoute:
      Type: AWS::ApiGatewayV2::Route
      Properties:
        ApiId: !Ref ApiGateway
        RouteKey: $connect
        Target: !Sub integrations/${ConnectIntegration}
    
    DisconnectRoute:
      Type: AWS::ApiGatewayV2::Route
      Properties:
        ApiId: !Ref ApiGateway
        RouteKey: $disconnect
        Target: !Sub integrations/${ConnectIntegration}
    
    SendMessageRoute:
      Type: AWS::ApiGatewayV2::Route
      Properties:
        ApiId: !Ref ApiGateway
        RouteKey: sendMessage
        Target: !Sub integrations/${DefaultIntegration}
    

    For each route, RouteKey corresponds to either the $connect or $disconnect defaults built in to API Gateway, or to the value of the event property in the request body.

DynamoDB data model

The application uses a single DynamoDB table that leverages the single-table design pattern. In single-table design, you overload the DynamoDB partition and sort keys to allow them to represent multiple collections of items, as well as relationships between items. I do this by including additional information about the type of item being represented in the pk and sk columns. For example, I have a record like {"pk": "ROOM#general", "sk": "ROOM"} that represents a chat room itself, and additional records like {"pk": "ROOM#general", "sk": "USER#<userId>"} that represent each user that has joined the room. The general pattern that I followed is that records like {"pk": "THING#<id>", "sk": "THING"} represent the thing itself along with its metadata, and records like {"pk": "THING#<id>", "sk": "RELATION#<id>"} represent an item related to the original thing.

Following the single-table design pattern, I have two generic columns named pk and sk that correspond to the primary key and sort key, respectively. I also created an inverted index named sk-index that allows me to query on the sk column directly, which helps me make a variety of queries to list the items related to another item.

As of this writing, I have a fairly large mistake in my data model. The main purpose of single-table design according to Alex DeBrie's writing is to avoid making multiple, serial calls to get a whole set of related data (this is basically Dynamo's version of SQL joins). Unfortunately, that's exactly what my app is doing. This is because I'm only storing foreign keys in most records and then making additional queries to look up the related items.

If I get around to it, I'll fix this by including more data about the related entities in the item itself.

Rooms

Much of the data model revolves around rooms, which I use to represent both channels and DMs. I also need to know which users, WebSocket connections, and messages are associated with each room.

pkskconnmessagettluser
ROOM#generalROOM
USER#abcdefghijkl
CONN#Q000000000CONN#Q000000000
MESSAGE#1703806209644Hello,Ā world!1703892609USER#abcdefghijkl

This model supports the following access patterns:

  • List rooms (SELECT * FROM "table"."sk-index" WHERE sk = 'ROOM')
  • List all users in a room (SELECT * FROM "table" WHERE pk = 'ROOM#general' AND begins_with(sk, 'USER#'))
  • List all WebSocket clients subscribed to a room (SELECT * FROM "table" WHERE pk = 'ROOM#general' AND begins_with(sk, 'CONN#'))
  • List all messages in a room (SELECT * FROM "table" WHERE pk = 'ROOM#general' AND begins_with(sk, 'MESSAGE#'))

Direct messages are rooms just like any other. Their IDs are generated by sorting and then concatenating the user IDs of each member in the conversation, like ROOM#<userId1>#<userId2>. I'm not currently enforcing any permissions to prevent third parties from reading the messages in a DM. šŸ˜¬

Users

Users are persistent entities, and I use their records to track things like persistence (the last time they were seen) and rate-limiting (to ensure they aren't spamming messages). The user collection is also a handy place to record a list of the WebSocket connections open for that user so I can notify the user of certain events.

pkskhandlepresencesendMessageLimitsendMessageTtl
USER#abcdefghijklUSERalice1703806209581703806269
CONN#Q000000000

This model supports the following access patterns:

  • Get details about a specific user (SELECT * FROM "table" WHERE pk = 'USER#abcdefghijkl' AND sk = 'USER')
  • List all users (SELECT * FROM "table"."sk-index" WHERE sk = 'USER')
  • List all WebSocket connections for a given user (SELECT * FROM "table" WHERE pk = 'USER#abcdefghijkl' AND begins_with(sk, 'CONN#'))

Lambda handlers

All of the business logic of the demo is handled by two Lambda functions. One handles when WebSocket clients connect and disconnect, and the other handles every other message sent over an active connection.

Handling connections

API Gateway has two built-in events, $connect and $disconnect, to handle new connections and terminating connections, respectively. I have the Lambda at packages/test-connect handle both of these. The event payload includes event.requestContext.eventType which is either CONNECT or DISCONNECT, so the function is able to handle each type differently.

The connection Lambda does extremely primitive authenticationā€”it looks for any value in either an Authorization header or an ?authorization query string. It does no validation to determine whether that is a valid authorization token. The only purpose of this "authorization" is to identify which user has connected. After all, it's just a demo!

When a user connects, the connection Lambda adds items to DynamoDB with the user's connection ID for every room the user has joined. This makes it easier for other event handlers to quickly identify all the connections that are currently subscribed to any given room.

When a user disconnects, the Lambda looks for every record of that connection's ID and deletes it from DynamoDB so that future events stop being sent to it. (Trying to send messages to a closed connection fails pretty gracefully, but it's not very efficient to try to send messages that we know will fail)

Handling incoming messages

After a user connects, they are able to send and receive messages on the open connection. Each of these messages looks something like {"event": "sendMessage", "detail": {} }. The event property is required so that API Gateway will accept and route the request to the Lambda, and contents of the detail property will vary depending on the value of event.

Some of the message types (like listRooms and listUsers) follow a request/response cycle and would work fine in a REST API; I just didn't want to create a second API for this purpose. Each of the event types is just handled by a big switch statement:

  • joinRoom: Creates a record indicating that a user would like to subscribe to messages in a room. Notifies other users in that room that a new user has joined.
  • listMessages: Lists all messages in a room. Used to retrieve the message history when a user first opens a room in their app.
  • listUsers: Lists all users and their presence status. Used to retrieve the current presence status for all users.
  • sendMessage: Sends a message to a room. Notifies all users in that room of the received message.
  • updatePresence: Updates the timestamp at which the user was last actively connected. The app sends this message every minute or so as long as it's connected. Other users may observe this value and show the user as "away" if the presence timestamp is too old. Notifies all users in all rooms of the received message.
  • userTyping: Indicates that a user is composing a message. Notifies all users in the relevant room of the received message.

Sending messages to clients

Because my Lambda functions aren't actually handling WebSockets directly, they don't communicate back to clients via a direct method like WebSocket.send(). Instead, API Gateway exposes an HTTP endpoint that my application can POST a message to, and API Gateway will forward that message to the client. This endpoint has a URL like

https://{{apiDomain}}/{{stage}}/@connections/{{connectionId}}

Each message includes the client's connection ID at event.requestContext.connectionId, so sending responses directly back to the client is pretty straightforward. However, to broadcast a message to other clients (like when notifying them of a new message), I have to look up the relevant connection IDs from DynamoDB.

The @connections endpoint is secured by IAM authorization, meaning I have to sign each request with AWS Signature v4 using the aws4 npm package.

Test the WebSocket API via the CLI

This is an optional step, but was a very helpful way for me to understand exactly what the WebSocket API was doing, and to debug the behavior of each Lambda handler as I was developing them. It requires the wscat utility.

wscat --connect $(aws ssm get-parameter --name /serverless-slack-clone/websocket-url --query Parameter.Value --output text)
> {"event": "joinRoom", "detail": {"room": "general"} }

wscat lets you type and send messages on a WebSocket connection and to see responses emitted by the API. In this demo, every message has event and detail properties, in which event indicates what type of operation is being done and detail includes open-ended details about the operation. The available event types are defined in the demo repo.

Cost

the AWS bill for this demo

I built this demo over the course of about a week, and the whole thing cost me exactly $0.10. Most of that cost came from a few times where I turned on an automated chat bot and sent myself and three other WebSocket clients 1,500 messages/minute for several minutes. (I implemented rate limiting shortly after)

In this demo, my Lambda usage fell under the AWS Free Tier and so cost $0. Without the free tier, my Lambda usage for this demo would cost an additional $0.12.

As with many serverless products, this is free/cheap to run at small scale, but could run into surprise costs at large scale. For WebSockets APIs, API Gateway charges for both connection minutes and messages sent. If you had 1,000 clients connected 24/7, your cost for simply keeping those connections open would be $10.95. If you add to that a fairly conservative 3,000 messages sent per minute (say, if each client was sending a ping every 20 seconds), then the cost quickly climbs to $142.35, including the connection costs.

Although you only pay for what you use, the price you pay scales directly with how heavily-used the service is. This could be problematic if you have a large number of users or a very chatty application. If real-time communication is the core of your business, you might be better off building a more serverful system that's less elastic but more predictable.

ā† All Posts