Querying Models

Predibase supports various Python SDK methods for prompting deployments.

Initialize Client

Initialize a LoRAX client for running inference on a deployment.

pb.deployments.client(
    deployment_ref: str,
    force_bare_client: Optional[bool] = False,
    serving_url_override: Optional[str] = None
) -> LoRAXClient

Parameters

deployment_ref: str - Name of the deployment to prompt
force_bare_client: bool, optional, default False - When False, the SDK runs a sub-process which queries the Predibase API and prints out a helpful message if the deployment is still scaling up. This is useful for experimentation and notebooks. Use True for production to avoid these additional checks.
serving_url_override: str, optional, default None - Override the default URL used to prompt deployments. Only used for direct-ingress VPC deployments. The available VPC endpoints for a direct-ingress deployment can be found in the Configuration tab for a deployment in the Predibase UI.

Returns

LoRAX Client - Client object for running inference

Example

client = pb.deployments.client("qwen3-8b", force_bare_client=True)

Generate Text

Generate text from a prompt using the deployed model.

client.generate(
    prompt: str,
    adapter_id: Optional[str] = None,
    adapter_source: Optional[str] = None,
    api_token: Optional[str] = None,
    max_new_tokens: int = 20,
    best_of: Optional[int] = None,
    repetition_penalty: Optional[float] = None,
    return_full_text: bool = False,
    seed: Optional[int] = None,
    stop_sequences: Optional[List[str]] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    truncate: Optional[int] = None,
    response_format: Optional[Union[Dict[str, Any], ResponseFormat]] = None,
    decoder_input_details: bool = False,
    details: bool = True
) -> Response

Parameters

prompt: str - Input text to generate from
adapter_id: str, optional – Adapter ID to apply to the base model (e.g. "adapter-name/1"); can include a checkpoint (e.g. "adapter-name/1@7")
adapter_source: str, optional – Where to load the adapter from: "hub", "local", "s3", or "pbase"
api_token: str, optional – Token used to access private adapters
max_new_tokens: int – Maximum number of tokens to generate
best_of: int – Generate best_of sequences and return the one with the highest log-probability
repetition_penalty: float – Penalty applied to repeated tokens (1.0 means no penalty)
return_full_text: bool – If True, prepend the original prompt to the generated text
seed: int – Random seed for reproducible sampling
stop_sequences: List[str] – Stop generation when any of these sequences is produced
temperature: float – Softmax temperature for sampling
top_k: int – Keep only the highest-probability k tokens for sampling
top_p: float – Use nucleus sampling to keep the smallest set of tokens whose cumulative probability ≥ top_p
truncate: int – Truncate input tokens to this length before generation
response_format: Dict[str, Any] | ResponseFormat, optional – Schema describing a structured format (e.g. a JSON object) to impose on the output
decoder_input_details: bool – Return log-probabilities and IDs for the decoder’s input tokens
details: bool – Return log-probabilities and IDs for all generated tokens

Returns

GenerationResponse - Object containing the generated text and metadata

Examples

from predibase import Predibase

# Basic prompting
client = pb.deployments.client("qwen3-8b")
print(client.generate("What is your name?").generated_text)

# Using an adapter with max_new_tokens
client = pb.deployments.client("qwen3-8b")
print(client.generate("hello", adapter_id="news-summarizer-model/1", max_new_tokens=100).generated_text)

# Using a specific adapter checkpoint
client = pb.deployments.client("qwen3-8b")
# Prompts using the 7th checkpoint of adapter version `news-summarizer-model/1`.
print(client.generate("hello", adapter_id="news-summarizer-model/1@7", max_new_tokens=100).generated_text)

Generate Embeddings

Generate embeddings for a text.

pb.embeddings.create(
  model: str,     # Embedding model deployment
  input: str      # Input text
) -> Response

Parameters

model: str - Embedding model deployment
input: str - The text to generate embeddings for

Returns

list[float] - Vector embedding for the text

Example

from predibase import Predibase

# Generate embeddings for your data
text = "Generate embeddings using your dedicated deployment."
response = pb.embeddings.create(model="my-embedding-model", input=text)
print(f"Generated {len(response.data[0].embedding)}-dimensional embedding")

# Process a batch of documents
documents = [
    "First document for embedding",
    "Second document with different content",
    "Third document to process in batch"
]
batch_embeddings = [pb.embeddings.create(model="my-embedding-model", input=doc).data[0].embedding for doc in documents]

Inference

Fine-Tuning

Initialize Client

Generate Text

Generate Embeddings

Inference

Fine-Tuning

​Initialize Client

​Generate Text

​Generate Embeddings

Initialize Client

Generate Text

Generate Embeddings