This guide will show you how to quickly get started with using Predibase to deploy and prompt LLMs. We’ll walk through setting up your environment, running inference, and using streaming responses.

Prerequisites

  1. Create an account here
  2. Navigate to the Settings page and click Generate API Token
  3. Setup your environment and install the Python SDK:
pip install -U predibase

Create and prompt a private deployment

Let’s start by deploying a model and running inference.

from predibase import Predibase, DeploymentConfig

# Initialize the client with your API token
pb = Predibase(api_token="<PREDIBASE API TOKEN>")

# Create a deployment
deployment = pb.deployments.create(
    name="my-qwen3-8b",
    config=DeploymentConfig(base_model="qwen3-8b")
)

# Generate text
response = pb.deployments.client('my-qwen3-8b').generate(
    "What is a Large Language Model?",
    max_new_tokens=50
)
print(response.generated_text)
# "(LLM) Explained\n\nA large language model (LLM) is a type of artificial intelligence..."

Prompt a shared endpoint

For quick experimentation, you can use our shared endpoints, available for SaaS users only.

from predibase import Predibase

pb = Predibase(api_token="<PREDIBASE API TOKEN>")

# Get a list of available models
available_models = pb.deployments.list()

# Connect to a shared deployment
client = pb.deployments.client("qwen3-8b", max_new_tokens=32)

# Generate text
response = client.generate(
    "What are some popular tourist spots in San Francisco?"
)
print(response.generated_text)

Note the explicit use of special tokens (like [INST]) before and after the prompt. These are used with instruction- and chat-tuned models to improve response quality. See Chat Templates for details.

Stream responses

For longer responses, you might want to stream the tokens as they’re generated:

from predibase import Predibase

pb = Predibase(api_token="<PREDIBASE API TOKEN>")
client = pb.deployments.client("qwen3-8b")

# Stream tokens as they're generated
for response in client.generate_stream(
    "What are some popular tourist spots in San Francisco?", max_new_tokens=256
):
    if not response.token.special:
        print(response.token.text, sep="", end="", flush=True)

All examples above use the Python SDK for simplicity. A REST API is also available if you prefer making direct HTTP calls. See our Chat Completions API for details.

Next steps

Need help?