Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.predibase.com/llms.txt

Use this file to discover all available pages before exploring further.

Predibase supports various Python SDK methods for prompting deployments.
  1. OpenAI Client (use this for chat messages input format).
  2. Predibase Client.

Initialize Client

Initialize a LoRAX client for running inference on a deployment.
pb.deployments.client(
    deployment_ref: str,
    force_bare_client: Optional[bool] = False,
    serving_url_override: Optional[str] = None
) -> LoRAXClient
Parameters
  • deployment_ref: str - Name of the deployment to prompt
  • force_bare_client: bool, optional, default False - When False, the SDK runs a sub-process which queries the Predibase API and prints out a helpful message if the deployment is still scaling up. This is useful for experimentation and notebooks. Use True for production to avoid these additional checks.
  • serving_url_override: str, optional, default None - Override the default URL used to prompt deployments. Only used for direct-ingress VPC deployments. The available VPC endpoints for a direct-ingress deployment can be found in the Configuration tab for a deployment in the Predibase UI.
Returns
  • LoRAX Client - Client object for running inference
Example
client = pb.deployments.client("qwen3-8b", force_bare_client=True)

Generate Text

Generate text from a prompt using the deployed model. This is equivalent to the OpenAI completions endpoint.
Chat templates are not applied to the prompt when using the SDK generate commands. Use the OpenAI-compatible chat completions endpoint or apply the chat template to the prompt for chat use-cases.
client.generate(
    prompt: str,
    adapter_id: Optional[str] = None,
    adapter_source: Optional[str] = None,
    api_token: Optional[str] = None,
    max_new_tokens: int = 20,
    best_of: Optional[int] = None,
    repetition_penalty: Optional[float] = None,
    return_full_text: bool = False,
    seed: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    truncate: Optional[int] = None,
    response_format: Optional[Union[Dict[str, Any], ResponseFormat]] = None,
    decoder_input_details: bool = False,
    details: bool = True
) -> Response
Parameters
  • prompt: str ­- Input text to generate from
  • adapter_id: str, optional ­– Adapter ID to apply to the base model (e.g. "adapter-name/1"); can include a checkpoint (e.g. "adapter-name/1@7")
  • adapter_source: str, optional ­– Where to load the adapter from: "hub", "local", "s3", or "pbase"
  • api_token: str, optional ­– Token used to access private adapters
  • max_new_tokens: int ­– Maximum number of tokens to generate
  • best_of: int ­– Generate best_of sequences and return the one with the highest log-probability
  • repetition_penalty: float ­– Penalty applied to repeated tokens (1.0 means no penalty)
  • return_full_text: bool ­– If True, prepend the original prompt to the generated text
  • seed: int ­– Random seed for reproducible sampling
  • stop_sequences: List[str] ­– Stop generation when any of these sequences is produced
  • temperature: float ­– Softmax temperature for sampling
  • top_k: int ­– Keep only the highest-probability k tokens for sampling
  • top_p: float ­– Use nucleus sampling to keep the smallest set of tokens whose cumulative probability ≥ top_p
  • truncate: int ­– Truncate input tokens to this length before generation
  • response_format: Dict[str, Any] | ResponseFormat, optional ­– Schema describing a structured format (e.g. a JSON object) to impose on the output
  • decoder_input_details: bool ­– Return log-probabilities and IDs for the decoder’s input tokens
  • details: bool ­– Return log-probabilities and IDs for all generated tokens
Returns
  • GenerationResponse - Object containing the generated text and metadata
Examples
from predibase import Predibase

# Basic prompting
client = pb.deployments.client("qwen3-8b")
print(client.generate("What is your name?").generated_text)

# Using an adapter with max_new_tokens
client = pb.deployments.client("qwen3-8b")
print(client.generate("hello", adapter_id="news-summarizer-model/1", max_new_tokens=100).generated_text)

# Using a specific adapter checkpoint
client = pb.deployments.client("qwen3-8b")
# Prompts using the 7th checkpoint of adapter version `news-summarizer-model/1`.
print(client.generate("hello", adapter_id="news-summarizer-model/1@7", max_new_tokens=100).generated_text)

Generate Embeddings

Generate embeddings for a text.
pb.embeddings.create(
  model: str,     # Embedding model deployment
  input: str      # Input text
) -> Response
Parameters
  • model: str - Embedding model deployment
  • input: str - The text to generate embeddings for
Returns
  • list[float] - Vector embedding for the text
Example
from predibase import Predibase

# Generate embeddings for your data
text = "Generate embeddings using your dedicated deployment."
response = pb.embeddings.create(model="my-embedding-model", input=text)
print(f"Generated {len(response.data[0].embedding)}-dimensional embedding")

# Process a batch of documents
documents = [
    "First document for embedding",
    "Second document with different content",
    "Third document to process in batch"
]
batch_embeddings = [pb.embeddings.create(model="my-embedding-model", input=doc).data[0].embedding for doc in documents]