paka is a versatile LLMOps tool that simplifies the deployment and management of large language model (LLM) apps with a single command.
- Cloud-Agnostic Resource Provisioning: paka starts by breaking down the barriers of cloud vendor lock-in, currently supporting EKS with plans to expand to more cloud services.
- Optimized Model Execution: Designed for efficiency, paka runs LLM models on CPUs and Nvidia GPUs, ensuring optimal performance. Auto-scaling of model replicas based on CPU usage, request rate, and latency.
- Scalable Batch Job Management: paka excels in managing batch jobs that dynamically scale out and in, catering to varying workload demands without manual intervention.
- Seamless Application Deployment: With support for running Langchain and LlamaIndex applications as functions, paka offers scalability to zero and back up, along with rolling updates to ensure no downtime.
- Comprehensive Monitoring and Tracing: Embedded with built-in support for metrics collection via Prometheus and Grafana, along with tracing through Zipkin.
Current runtime inference is done through the awesome llama.cpp and llama-cpp-python projects.
vLLM support is coming soon.
Each model is ran in a separate model group. Each model group can have its own node type, replicas and autoscaling policies.
Applications are deployed as serverless containers using knative. However, users can deploy their applications to the native cloud offerings as well, such as Lambda, Cloud Run, etc.
Optional redis broker can be provisioned for celery jobs. Job workers are automatically scaled based on the queue length.
Vector store is a key-value store for storing embeddings. Paka supports provisioning qdrant.
Paka comes with built-in support for monitoring and tracing. Metrics are collected via Prometheus and Grafana, and tracing is done through Zipkin. Users can also enable Prometheus Alertmanager for alerting.
Paka supports continuous deployment with rolling updates to ensure no downtime. Application can be built, pushed to container registry and deployed with a single command.
Application, job code is built using buildpacks. No need to write Dockerfile. However, user still needs to have docker runtime installed.
Install the paka CLI
pip install paka
Create a cluster.yaml
aws:
cluster:
name: example
region: us-west-2
nodeType: t2.medium
minNodes: 2
maxNodes: 4
modelGroups:
- nodeType: c7a.xlarge
minInstances: 1
maxInstances: 3
name: llama2-7b
resourceRequest:
cpu: 3600m
memory: 6Gi
autoScaleTriggers:
- type: cpu
metadata:
type: Utilization
value: "50"
Provision the cluster
paka cluster up -f cluster.yaml -u
Change to the application directory and add a Procfile
and a .cnignore file.
In Procfile
, add the command to start the application. For example, for a flask app, it would be web: gunicorn app:app
. In .cnignore
, add the files to ignore during build.
To pin the version of the language runtime, add a runtime.txt
file with the version number. For example, for python, it could be python-3.11.*
.
For a python application, a requirements.txt file is required.
To deploy the application, run `paka function deploy --name <function_name> --source <source_path> --entrypoint <Procfile_command>. For example:
paka function deploy --name langchain-server --source . --entrypoint serve
paka cluster down -f cluster.yaml
- Open a PR
- Format and lint code with
make lint
- Run tests with
make test
- docker daemon
- pack cli (https://buildpacks.io/docs/for-platform-operators/how-to/integrate-ci/pack/)
- pulumi cli (https://www.pulumi.com/docs/install/)
- aws cli and credentials for the AWS deployment
# Make sure aws credentials and cli are set up. Your aws credentials should have access to the following services:
# - S3
# - ECR
# - EKS
# - EC2
aws configure