This guide will show you how to quickly get started with using Predibase to
deploy and prompt LLMs. We’ll walk through setting up your environment, running
inference, and using streaming responses.
Let’s start by deploying a model and running inference.
Copy
Ask AI
from predibase import Predibase, DeploymentConfig# Initialize the client with your API tokenpb = Predibase(api_token="<PREDIBASE API TOKEN>")# Create a deploymentdeployment = pb.deployments.create( name="my-qwen3-8b", config=DeploymentConfig(base_model="qwen3-8b"))# Generate textresponse = pb.deployments.client('my-qwen3-8b').generate( "What is a Large Language Model?", max_new_tokens=50)print(response.generated_text)# "(LLM) Explained\n\nA large language model (LLM) is a type of artificial intelligence..."
For quick experimentation, you can use our shared endpoints, available for SaaS users only.
Copy
Ask AI
from predibase import Predibasepb = Predibase(api_token="<PREDIBASE API TOKEN>")# Get a list of available modelsavailable_models = pb.deployments.list()# Connect to a shared deploymentclient = pb.deployments.client("qwen3-8b", max_new_tokens=32)# Generate textresponse = client.generate( "What are some popular tourist spots in San Francisco?")print(response.generated_text)
Note the explicit use of special tokens (like [INST]) before and after the
prompt. These are used with instruction- and chat-tuned models to improve
response quality. See Chat Templates
for details.
For longer responses, you might want to stream the tokens as they’re generated:
Copy
Ask AI
from predibase import Predibasepb = Predibase(api_token="<PREDIBASE API TOKEN>")client = pb.deployments.client("qwen3-8b")# Stream tokens as they're generatedfor response in client.generate_stream( "What are some popular tourist spots in San Francisco?", max_new_tokens=256): if not response.token.special: print(response.token.text, sep="", end="", flush=True)
All examples above use the Python SDK for
simplicity. A REST API is also available if you prefer making direct HTTP
calls. See our Chat Completions
API for details.