Skip to main content


llm_deployment.generate(prompt, options)

This method allows you to query the specified deployment in your Predibase environment.

Note that prompt is passed directly to the LLM without any prompt formatting. So when querying a fine-tuned LLM, you must pass in the full prompt used during training.


best_of: Optional [int]

Generates 'n' best_of sequences at the same time and returns the one with the highest overall log probability over the entire sequences.

decoder_input_details: Optional [bool]

Return the token logprobs and ids of the input prompt tokens

details: Optional [bool]

Return the token logprobs and ids of the generated tokens

do_sample: Optional [bool]

Return the token logprobs and ids of the input prompt tokens

max_new_tokens: Optional[int] (default: None)

The maximum number of new tokens to generate, ignoring the number of tokens in the input prompt. If not set, will default to the max context window for the deployment minus the number of tokens in your prompt.


We'd still recommend setting max_new_tokens for non-instruction-tuned models since they are inclined to keep generating tokens.

repetition_penalty: Optional [float64]

The parameter for repetition penalty. 1.0 means no penalty. Default: 1

return_full_text: Optional [bool]

Whether or not to use sampling; use greedy decoding otherwise. Defaults to false.

seed: Optional [float64]

The seed to use for the random number generator. If not provided, will default to a random seed.

stop: Optional [float64]

Stop generating tokens if a member of stop_sequences is generated.

temperature: float (default: 0.1)

Temperature is used to control the randomness of predictions. Higher values increase diversity and lower values increase determinism. Setting a temperature of 0 is useful for testing and debugging. .

top_k: Optional[default: 0.1]

Top-k is a sampling method where the k highest-probability vocabulary tokens are kept and the probability mass is redistributed among them.

top_p: Optional[default: 0.1]

Top-p (aka nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. For example, 0.2 corresponds to only the tokens comprising the top 20% probability mass being considered.

truncate: Optional[default: 0.1]

The number of tokens to truncate the output to. If not provided, will default to user's default truncate.

typical_p: Optional[default: 0.1]

If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See Typical Decoding for Natural Language Generation for more information

watermark: Optional[default: 0.1]


A list of Predibase GeneratedResponse objects


result = llm_ft.generate("can you give me a function to return the fibonacci sequence", options={
'temperature': 0.1,
'details': True,