For HashiTalks 2021 I was able to present a Terraform project that managed the training and serving of NLP models.
Built in AWS and using ECS, S3, DynamoDB, SQS, Lambda, and EventBridge, the project provides a way to do automated containerized NLP model training. You can queue a model for training by describing the model you want to train and putting that message on an SQS queue. An EventBridge Rule triggers a Lambda function that consumes from the queue and initiates an ECS task to train the model per the given specification. When complete, the trained model is uploaded to S3 and details about the model are stored in DynamoDB.
Model serving the model is possible by another set of Terraform scripts that creates an ECS service and task. When the service and task start the model can be used for named-entity recognition over a REST interface.
Fargate was not used for model training or serving because it does not support GPU instances. Using ECS allows us to use a GPU-enabled EC2 instance. AWS SageMaker was not used for a couple of reasons. The first is that typically any hosted service is a trade off between control and convenience. By doing it ourselves we have full control over the process. The second reason is there is a lot more to learn and play with by rolling our own!
Because the two sets of Terraform scripts need to share variables between them, Systems Manager Parameter Store was used. To elaborate, the model serving scripts need some details of the ECS cluster and S3 bucket created by the model serving scripts. Those values are put into Parameter Store and then read by the second scripts.
There are a lot of ways this work can be extended and it’s purpose was not to provide an all-in-one turnkey solution but instead a project that can easily be adapted to your needs. There are other combinations of AWS services that could be used so it depends on what services you are comfortable with and those you are able to use or are already using.
Check out the code out on GitHub. Just follow the steps in the README to build and deploy it.