A Stanford University study discovered that the popular AI chatbot ChatGPT, created by OpenAI, is getting worse at solving mathematical equations. Between March and June, the chatbot’s performance on specific tasks fluctuated significantly. The study examined two versions of the technology, GPT-3.5 and GPT-4, concentrating on activities such as math problem solving, sensitive question answering, software code generation, and visual reasoning. According to the Fortune story, the study discovered a phenomenon known as “drift,” in which the ability of technology to do specific jobs changed unpredictably over time. The accuracy of GPT-4’s math problem-solving skills declined dramatically from 97.6 percent in March to 2.4 percent in June. According to the data, the GPT-3.5 model followed an opposite direction.
When the models were asked to create code and conduct visual reasoning tests, similar variations were seen. One of the study’s authors, Stanford computer science professor James Zou, expressed surprise at the size of the change, given ChatGPT’s sophistication. “When we are tuning a large language model to improve its performance on certain tasks, that can actually have a lot of unintended consequences, which might actually hurt this model’s performance on other tasks,” Zou explained in an interview with Fortune. “There are a lot of interesting interdependencies in how the model responds to things, which can lead to some of the worsening behaviors we saw.” In response to ChatGPT’s deteriorating arithmetic ability, a Reddit user stated that getting dumber with age is the most human-like thing it can do.
The inconsistency in outcomes was not due to the model being erroneous in certain jobs. Instead, it happened because when scientists sought to improve the model’s performance on certain tasks, it had unintended consequences on other portions of the model, resulting in unexpected behavior.
The results show that these models can shift over time, which are referred to as “drifts.” Because of these drifts, the models perform inconsistently on various tasks. The study emphasizes the significance of periodically assessing the performance of these language models to ensure that they continue to work effectively. As a result, any problems or issues that develop as a result of these drifts can be discovered and rectified quickly, ensuring that the models continue to work optimally.
The Stanford University study sheds light on the issues caused by drifts in artificial intelligence language models such as ChatGPT. The study emphasizes the importance of further inquiry and transparency in order to verify that such systems work consistently and reliably in a variety of activities.