Querying Models
SDK operations for prompting your deployed models
Predibase supports various Python SDK methods for prompting deployments.
Initialize Client
Initialize a LoRAX client for running inference on a deployment.
Parameters
- deployment_ref: str - Name of the deployment to prompt
- force_bare_client: bool, optional, default False - When False, the SDK runs a sub-process which queries the Predibase API and prints out a helpful message if the deployment is still scaling up. This is useful for experimentation and notebooks. Use True for production to avoid these additional checks.
- serving_url_override: str, optional, default None - Override the default URL used to prompt deployments. Only used for direct-ingress VPC deployments. The available VPC endpoints for a direct-ingress deployment can be found in the Configuration tab for a deployment in the Predibase UI.
Returns
- LoRAX Client - Client object for running inference
Example
Generate Text
Generate text from a prompt using the deployed model.
Parameters
- prompt: str - Input text to generate from
- adapter_id: str, optional – Adapter ID to apply to the base model (e.g.
"adapter-name/1"
); can include a checkpoint (e.g."adapter-name/1@7"
) - adapter_source: str, optional – Where to load the adapter from:
"hub"
,"local"
,"s3"
, or"pbase"
- api_token: str, optional – Token used to access private adapters
- max_new_tokens: int – Maximum number of tokens to generate
- best_of: int – Generate best_of sequences and return the one with the highest log-probability
- repetition_penalty: float – Penalty applied to repeated tokens (1.0 means no penalty)
- return_full_text: bool – If
True
, prepend the original prompt to the generated text - seed: int – Random seed for reproducible sampling
- stop_sequences: List[str] – Stop generation when any of these sequences is produced
- temperature: float – Softmax temperature for sampling
- top_k: int – Keep only the highest-probability k tokens for sampling
- top_p: float – Use nucleus sampling to keep the smallest set of tokens whose cumulative probability ≥ top_p
- truncate: int – Truncate input tokens to this length before generation
- response_format: Dict[str, Any] | ResponseFormat, optional – Schema describing a structured format (e.g. a JSON object) to impose on the output
- decoder_input_details: bool – Return log-probabilities and IDs for the decoder’s input tokens
- details: bool – Return log-probabilities and IDs for all generated tokens
Returns
- GenerationResponse - Object containing the generated text and metadata
Examples
Generate Embeddings
Generate embeddings for a text.
Parameters
- model: str - Embedding model deployment
- input: str - The text to generate embeddings for
Returns
- list[float] - Vector embedding for the text
Example