Why is ChatGPT not generating same text every time?

Non-determinism of ChatGPT

LLMs have seen remarkable advancements, with models like GPT-3 from OpenAI setting new standards in language understanding and generation. However, while these models have demonstrated exceptional capabilities, they also exhibit a curious phenomenon known as non-determinism.

Non-determinism refers to the fact that generating text using these models may not yield the same output every time, even when provided with the same input prompt. This article aims to explore the sources of non-determinism in ChatGPT and shed light on its impact and why this behaviour occurs.

Today we were discussing this topic in Maxpool and it lead to interesting findings.

The Need for Determinism

In various scenarios, it is imperative to ensure that the output of a language model remains consistent. Any production environment where test cases are crucial require deterministic behaviour. Researchers need determinism to quantify the model performance on the benchmarks for reproducibility purposes. Any variations in the generated text could skew the results and make it difficult to compare models or track their progress over time.

A recent paper LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation explores the impact of non-determinism of ChatGPT. They find that only a small portion (21.1%) of papers in past 2 years take the non-determinism threat into consideration when designing their experiments.

Sources of Non-determinism

Several factors contribute to the non-deterministic behaviour observed in ChatGPT. These factors include:

1. Temperature

Temperature is a parameter used during text generation that controls the randomness of the output. It adjusts the softmax distribution of word probabilities, affecting the diversity of words chosen in the generated text. A higher temperature value leads to a more diverse output, while a lower value makes the output more deterministic.

Interestingly, if one attempts to set the temperature to 0, an error occurs, as the temperature is required to be a positive value. OpenAI allows setting the temperature close to 0 by adding a small positive number. The impact of temperature on the output distribution demonstrates how even a small adjustment can significantly influence the generated text.

Setting the temperature parameter to zero in OpenAI does not guarantee deterministic behaviour

Info from this Stackoverflow question.

The difference between the low-temperature case (left) and the high-temperature case for the categorical distribution is illustrated in the picture above, where the heights of the bars correspond to probabilities.

import numpy as np

tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))

def sample_next(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)

2. Batching

Batching is a technique used to improve the efficiency of model training and inference. However, when generating text, batching can introduce non-determinism. Batching involves processing multiple inputs simultaneously, and if padding is required to make inputs of equal length, it can affect how the model generates text. Different padding configurations can lead to variations in the final output.

3. Length

This study find that the length of coding instructions has a negative correlation with both syntactic and structural similarity of the generated code, as well as the average correctness of the generated code. This implies that longer coding instructions may lead to higher levels of non-determinism and lower code quality.

4. Nucleus Sampling (Top-p)

Nucleus sampling, also known as top-p sampling, is a method used to control the diversity of generated text. When the next token's probability mass (top-p) is set, the model selects from the top words that contribute to the cumulative probability mass. This method allows for a balance between deterministic and diverse outputs.

5. Random seed

A random seed influences the outcome of nucleus sampling by providing an initial state for the randomization process, leading to consistent selections of words based on their probabilities. if a specific random seed is set, such as 42, the model will start from a predetermined point in its random number sequence. For instance, if the prompt is "The sky is", and the top-p value is 0.8, nucleus sampling will consistently select words that contribute to 80% of the cumulative probability mass, like "blue", "clear", and "overcast", leading to the same sequence of completions in subsequent runs with the same seed.

6. PyTorch and CUDA

The underlying framework used by ChatGPT, such as PyTorch and CUDA, can also introduce non-determinism. PyTorch offers both deterministic and non-deterministic algorithms for certain operations. When non-deterministic algorithms are used, the same input may produce different outputs. This behaviour can impact the overall determinism of the model. You can read more about them here.

Conclusion

The non-determinism observed in ChatGPT is a result of various factors, including temperature, sampling methods, batching, and the underlying frameworks. While non-determinism may not be entirely eliminated, a combination of careful parameter tuning can help mitigate its effects.


This was written by Pratik Bhavsar - a member of Maxpool - a data science community to discuss real ML problems.

Come join.maxpool.ai 🙌

We are a community of working professionals with a passion for LLM & search.

Last updated