ChatGPT continues to be a phenomenon of modern technology. Since it exploded onto the scene overnight acquiring one million users in just five days, it has grown continuously, generating 1.63 million visits in February 2024. Many businesses are still trying to assess how they may be able to use it to streamline operations, generate marketing outputs and source information, balancing the benefits against the risks it poses. But how does it fare when it comes to data and analytics? Can it be trusted?
Last year, fifty-five ran an analysis on how ChatGPT responded when we asked the model a list of common questions our consultants are asked about Google Analytics 4. At the time, it could only answer a third of questions to an acceptable standard, getting responses entirely wrong in 50% of cases, with the most common reason being that GPT 3 was using out-of-date information to provide its responses.
Since then, a lot has moved on in the generative AI space. Google entered the game with Bard, setting up Gemini to compete with OpenAI’s explosively popular chatbot, and more recently, Anthropic’s model Claude has been making waves in the industry. It had become critical for Open AI to update its model to keep up with the competition, so when Open AI finally launched its upgraded model—GPT 4—which remains available only in ChatGPT Plus—we wondered how this would affect its accuracy.
Correct responses more than doubled, but misleading and incorrect responses remain concerning
The responses from the GPT 4 model were much more likely to be correct. Whilst only 30% of answers were correct in the previous study, the success rate from the new model was more than double, with 66% of answers being correct. Many of the questions answered wrongly due to out of date training data were resolved, and the new model also corrected factually incorrect statements, such as GA4’s capability to backfill data before the first date of tracking, and a capability to ‘automatically measure conversions through machine learning’.
However, one in three responses were still ‘semi-correct’ or incorrect, showing that we are still nowhere near to ChatGPT becoming a reliable member of data marketing teams. Many of the answers that were wrong previously have been improved but are still not fully correct, or are factually correct but lacking in key points that could lead a user to have an incomplete understanding of the topic. For example, when asked about the role of different attribution capabilities in GA4, GPT 4 gave a detailed and correct account of how data-driven attribution could be used for extensive analysis, but failed to mention that users can still choose to use a last click attribution method. In some cases, the latest iteration of ChatGPT also gave a perfect answer to a question, but added unnecessary information that is misleading.
The new model is far from infallible. The model failed to include the nuance that GA4 360 properties would be processing data up until July 2024, which in addition to the assertion that 14 months is the maximum data retention setting, suggests that GPT 4 doesn’t fully grasp the difference between standard and 360 properties. GPT 4 also offered some misinformation around GA4’s data processing, suggesting page view hits were measured differently in UA and GA4, and that improved cross-device tracking could be a plausible reason for purchase event differences between the two platforms.
Longer responses require ‘prompt engineering’
When running questions through the model, it became clear that GPT 4 gave significantly longer responses than its predecessor. GPT 3 gave responses of 165 words on average, whilst GPT 4 gave responses of 443 words on average. The new model gives an extremely robust answer every time, going into a lot of detail and often bringing in unrelated aspects. It was certainly clear that the new training data and generally improved model allowed Chat GPT to navigate the various nuances of GA4 in order to give plenty of guidance to users on how to get started. However, was the guidance more reliable than that given by the previous model?
The level of detail covered by the responses was extremely impressive, and in some cases added relevant caveats that could help with more nuanced responses. But often, the useful answer to the question was either buried alongside explanations around the broader differences between UA and GA4, or missing altogether. While GPT 4 could make a convincing politician, such responses wouldn’t be useful for data analysts looking for a new way of operating Google Analytics, and could be more time-consuming.
One can reduce the length of its answer by prompting the model—if you know what you are doing. When re-running the study on a subset of questions, we found that the word count reduced back down to 123 words on average, with no impact on the accuracy of responses. This example of ‘prompt engineering’ shows how using Generative AI techniques in projects can start to add real value.
Still missing the human factor
While generative AI models are continually improving, there is still a limit to what the models can achieve by themselves. But what these models lack in infallibility, they compensate with speed, scale, and cost-effectiveness, which is why the industry is already seeing Gen AI being used in lower risk use cases, such as virtual assistants, or in controlled, low-stakes automation processes.
Given the right parameters, and with some course correction processes in place, our experience has shown that human intervention on top of this powerful new technology results in a formidable force. As tangible use cases arise where GenAI can change the game, the ability of a system to monitor pain points and generate ever-improving results will separate the businesses simply using Generative AI from those who are truly using it to its full potential.