Predibase supports various Python SDK methods for prompting deployments.

  1. Predibase Client
  2. OpenAI Client

Initialize Client

Initialize a LoRAX client for running inference on a deployment.

pb.deployments.client(
    deployment_ref: str,
    force_bare_client: Optional[bool] = False,
    serving_url_override: Optional[str] = None
) -> LoRAXClient

Parameters

  • deployment_ref: str - Name of the deployment to prompt
  • force_bare_client: bool, optional, default False - When False, the SDK runs a sub-process which queries the Predibase API and prints out a helpful message if the deployment is still scaling up. This is useful for experimentation and notebooks. Use True for production to avoid these additional checks.
  • serving_url_override: str, optional, default None - Override the default URL used to prompt deployments. Only used for direct-ingress VPC deployments. The available VPC endpoints for a direct-ingress deployment can be found in the Configuration tab for a deployment in the Predibase UI.

Returns

  • LoRAX Client - Client object for running inference

Example

client = pb.deployments.client("qwen3-8b", force_bare_client=True)

Generate Text

Generate text from a prompt using the deployed model.

client.generate(
    prompt: str,
    adapter_id: Optional[str] = None,
    adapter_source: Optional[str] = None,
    api_token: Optional[str] = None,
    max_new_tokens: int = 20,
    best_of: Optional[int] = None,
    repetition_penalty: Optional[float] = None,
    return_full_text: bool = False,
    seed: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    truncate: Optional[int] = None,
    response_format: Optional[Union[Dict[str, Any], ResponseFormat]] = None,
    decoder_input_details: bool = False,
    details: bool = True
) -> Response

Parameters

  • prompt: str ­- Input text to generate from
  • adapter_id: str, optional ­– Adapter ID to apply to the base model (e.g. "adapter-name/1"); can include a checkpoint (e.g. "adapter-name/1@7")
  • adapter_source: str, optional ­– Where to load the adapter from: "hub", "local", "s3", or "pbase"
  • api_token: str, optional ­– Token used to access private adapters
  • max_new_tokens: int ­– Maximum number of tokens to generate
  • best_of: int ­– Generate best_of sequences and return the one with the highest log-probability
  • repetition_penalty: float ­– Penalty applied to repeated tokens (1.0 means no penalty)
  • return_full_text: bool ­– If True, prepend the original prompt to the generated text
  • seed: int ­– Random seed for reproducible sampling
  • stop_sequences: List[str] ­– Stop generation when any of these sequences is produced
  • temperature: float ­– Softmax temperature for sampling
  • top_k: int ­– Keep only the highest-probability k tokens for sampling
  • top_p: float ­– Use nucleus sampling to keep the smallest set of tokens whose cumulative probability ≥ top_p
  • truncate: int ­– Truncate input tokens to this length before generation
  • response_format: Dict[str, Any] | ResponseFormat, optional ­– Schema describing a structured format (e.g. a JSON object) to impose on the output
  • decoder_input_details: bool ­– Return log-probabilities and IDs for the decoder’s input tokens
  • details: bool ­– Return log-probabilities and IDs for all generated tokens

Returns

  • GenerationResponse - Object containing the generated text and metadata

Examples

from predibase import Predibase

# Basic prompting
client = pb.deployments.client("qwen3-8b")
print(client.generate("What is your name?").generated_text)

# Using an adapter with max_new_tokens
client = pb.deployments.client("qwen3-8b")
print(client.generate("hello", adapter_id="news-summarizer-model/1", max_new_tokens=100).generated_text)

# Using a specific adapter checkpoint
client = pb.deployments.client("qwen3-8b")
# Prompts using the 7th checkpoint of adapter version `news-summarizer-model/1`.
print(client.generate("hello", adapter_id="news-summarizer-model/1@7", max_new_tokens=100).generated_text)

Generate Embeddings

Generate embeddings for a text.

pb.embeddings.create(
  model: str,     # Embedding model deployment
  input: str      # Input text
) -> Response

Parameters

  • model: str - Embedding model deployment
  • input: str - The text to generate embeddings for

Returns

  • list[float] - Vector embedding for the text

Example

from predibase import Predibase

# Generate embeddings for your data
text = "Generate embeddings using your dedicated deployment."
response = pb.embeddings.create(model="my-embedding-model", input=text)
print(f"Generated {len(response.data[0].embedding)}-dimensional embedding")

# Process a batch of documents
documents = [
    "First document for embedding",
    "Second document with different content",
    "Third document to process in batch"
]
batch_embeddings = [pb.embeddings.create(model="my-embedding-model", input=doc).data[0].embedding for doc in documents]