LLMOps tool designed to simplify the deployment and management of large language model (LLM) applications

LLMOps, RAG, production, Cloud
pip install paka==0.1.8


Welcome to Paka


paka is a versatile LLMOps tool that simplifies the deployment and management of large language model (LLM) apps with a single command.

Paka Highlights

  • Cloud-Agnostic Resource Provisioning: paka starts by breaking down the barriers of cloud vendor lock-in, currently supporting EKS with plans to expand to more cloud services.
  • Optimized Model Execution: Designed for efficiency, paka runs LLM models on CPUs and Nvidia GPUs, ensuring optimal performance. Auto-scaling of model replicas based on CPU usage, request rate, and latency.
  • Scalable Batch Job Management: paka excels in managing batch jobs that dynamically scale out and in, catering to varying workload demands without manual intervention.
  • Seamless Application Deployment: With support for running Langchain and LlamaIndex applications as functions, paka offers scalability to zero and back up, along with rolling updates to ensure no downtime.
  • Comprehensive Monitoring and Tracing: Embedded with built-in support for metrics collection via Prometheus and Grafana, along with tracing through Zipkin.

Runtime Inference

Current runtime inference is done through the awesome llama.cpp and llama-cpp-python projects.

vLLM support is coming soon.

Each model is ran in a separate model group. Each model group can have its own node type, replicas and autoscaling policies.

Serverless Containers

Applications are deployed as serverless containers using knative. However, users can deploy their applications to the native cloud offerings as well, such as Lambda, Cloud Run, etc.

Batch Jobs

Optional redis broker can be provisioned for celery jobs. Job workers are automatically scaled based on the queue length.

Vector Store

Vector store is a key-value store for storing embeddings. Paka supports provisioning qdrant.


Paka comes with built-in support for monitoring and tracing. Metrics are collected via Prometheus and Grafana, and tracing is done through Zipkin. Users can also enable Prometheus Alertmanager for alerting.

Continuous Deployment

Paka supports continuous deployment with rolling updates to ensure no downtime. Application can be built, pushed to container registry and deployed with a single command.


Application, job code is built using buildpacks. No need to write Dockerfile. However, user still needs to have docker runtime installed.

Paka CLI Reference

Install the paka CLI

pip install paka

Provision a cluster

Create a cluster.yaml

    name: example
    region: us-west-2
    nodeType: t2.medium
    minNodes: 2
    maxNodes: 4
    - nodeType: c7a.xlarge
      minInstances: 1
      maxInstances: 3
      name: llama2-7b
        cpu: 3600m
        memory: 6Gi
        - type: cpu
            type: Utilization
            value: "50"

Provision the cluster

paka cluster up -f cluster.yaml -u

Deploy an application

Change to the application directory and add a Procfile and a .cnignore file. In Procfile, add the command to start the application. For example, for a flask app, it would be web: gunicorn app:app. In .cnignore, add the files to ignore during build.

To pin the version of the language runtime, add a runtime.txt file with the version number. For example, for python, it could be python-3.11.*.

For a python application, a requirements.txt file is required.

To deploy the application, run `paka function deploy --name <function_name> --source <source_path> --entrypoint <Procfile_command>. For example:

paka function deploy --name langchain-server --source . --entrypoint serve

Destroy a cluster

paka cluster down -f cluster.yaml


  • Open a PR
  • Format and lint code with make lint
  • Run tests with make test


# Make sure aws credentials and cli are set up. Your aws credentials should have access to the following services:
# - S3
# - ECR
# - EKS
# - EC2
aws configure