ChatGPT has emerged as a remarkable language model capable of understanding and generating human-like text. So, we wondered how good it was at making logical connections, especially when it comes to tricky puzzles combining culture, trivia, and sometimes puns with incomplete words or phrases. To find out, we embarked on a 60-day experiment, delving into the new NYT Connections puzzle that premiered on June 12. We had one mission. To uncover the AI's puzzle-solving abilities and observe its performance and accuracy during the NYT Connections' first 60-day beta period from June 12 to August 10.
For those unfamiliar yet with the NYT Connections puzzle, it's a daily word puzzle that poses a daily enigma. Players must group 16 words into four groups, containing four words each. The aim is to group them by discovering commonalities or common threads.
In this article, we present a comprehensive log of ChatGPT's performance during the 60-day challenge, highlighting its successes and occasional missteps. We invite you to join us as we explore the potential of AI language models and uncover fascinating insights from this unique experiment, all to answer the question - Can ChatGPT solve NYT's Connections?
Curious and determined, we embarked on this thrilling 60-day puzzle-solving adventure by crafting the perfect AI-friendly templated prompt for the ever-capable ChatGPT.
It took us a few iterations to craft a reliable daily prompt that gave enough W Questions (what, why, how, and where) to guide the language model.
Here is the daily prompt we used:
We considered a few factors when assessing whether Chatgpt solved the Connections puzzle that day. First, whether it managed to group the words according to commonalities correctly, and second, whether it managed to guess the correct category for that group (more or less.)
On June 12th, for example, we assessed that Chatgpt could group all 16 words into the correct four categories (e.g., Weather Phenomena: rain, hail, sleet, and snow) but that the category name was slightly different than Connections which was WET WEATHER. Our experiment still considered the answer good if the groupings were correct, even if the category name was incorrect.
In the spirit of sharing ChatGPT's performance, we created these interactive visuals; here, you'll see its triumphs and blunders.
Our findings are ordered by date, starting with the puzzle's debut on June 12th. We then documented how efficient the AI language model was at grouping the words into their respective categories according to the NYT Connections difficulty levels. In the last column, we gave ChatGPT a solving score out of 100%, depending on how many times it correctly grouped all 16 words daily - you can find the percentage of correct answers by hovering over the purple bar.
If you're curious, use the search box to find out how well ChatGPT did at solving a particular group, word, or puzzle date.
So, how many days did ChatGPT manage to solve that day's puzzle? Here's a bar graph showing how many days over the 60-day period that ChatGpt managed to find all four groups and how many days it missed one or more.
So, where did ChatGPT falter, and on which colored categories? This interactive map shows ChatGPT's errors grouped by the four difficulty levels of NYT Connections. Connections purple and blue categories are the hardest, and green and yellow are the easiest.
As you can see, the harder the difficulty level, the more ChatGPT struggled to find a common thread to solve the category. Hover over each category, and you'll see the date of the puzzle and the four words that make it up.
If you prefer to look at each color individually, simply use the filter in the left-hand corner and use the dropdown menu or click directly on the color you want to see.
While the AI language model usually managed to find commonalities between the yellow and green groups, we noticed it struggled a lot more with the purple and blue categories on average. These were some of the trickiest categories for ChatGPT to solve:
The majority of these were from the purple and blue groups; however, sometimes, a sneaky green or yellow tripped ChatGPT up, too, especially if it was trivia or a brand name.
Did ChatGPT showcase its linguistic brilliance in solving the NYT Connections puzzle? We will leave that for you to decide.
Regardless, we can all agree that it was an interesting experiment. From triumphant victories to intriguing challenges, this 60-day journey revealed the remarkable potential of AI-driven puzzle-solving. Glimpsing at the limitless possibilities of language and artificial intelligence going toward the future.