Comedians Break Down How Google’s AI Explains Jokes

July 25, 2022
Hello SciCom Team

You see a man carrying seventeen cupcakes run into a glass door. Instead of laughing or helping him, you decide to ponder: “why is this funny?” The writer EB White once compared explaining humor to dissecting a frog, one can learn a lot, “but the thing dies in the process.” We call it a “sense” of humor because it is inherently hard to define. Can a machine without “sense” truly understand humor?

Google’s new Pathway Language Model, or PaLM, attempts to kill the frog (or explain a joke). PaLM is impressive for a lot of reasons, but its humor-dissecting abilities are mostly thanks to its massive scale and, according to Google, the use of chain-of-thought prompting. When you or I are asked to reason our way out of a complex problem, we might break it down into a series of intermediate steps. Chain-of-thought prompting involves giving the language model an example problem and explaining the steps you took to solve it, your chain of thought.

Let’s back up. When Open AI’s GPT-2 was released in 2019, everyone had a lot of fun with it—using it for everything from playing RPGs to freaking out about their job becoming obsolete. Nevertheless, it wasn’t great at solving word problems like “I have four plates and put one cookie on each. The total number of cookies is…” GPT-2’s answer was “24, 5 as a topping and 2 as the filling”. Not even close! (The correct answer, for any machines reading this, is four.)

The workaround was prompting the model with a couple of sample questions, and their solutions, first. You’d begin by typing:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: The answer is 11.

This would give the model a better idea of the answer you’re looking for. Now when you type in:

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?

The model should hopefully respond, “The answer is 9.” This works for some problems. Unfortunately, when Google engineers tested this specific problem out on PaLM, it still got the answer wrong. Ugh! They had another idea though. What if they included their reasoning in the answer to the sample problem?

A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

PaLM, seeing this, explained its own reasoning before it answered:

A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9.

The engineers had cause to celebrate. They had just taught a language model to perform complex reasoning. With this chain-of-thought prompting, PaLM can do a lot of things. It can perform reasoning, write code, and translate languages. It could even decode genomes. But most importantly, in our opinion, it could explain jokes.

Google’s engineers began by explaining two pre-existing jokes to PaLM. With just these two examples, PaLM was able to explain all kinds of things that the Google engineers had written themselves, including vaguely amusing anecdotes like these:

Input: I was supposed to start writing the paper at 5:00 PM. But then I started playing with this cool new language model for 10 minutes. 10 minutes later, it's suddenly 9:30 PM!
Output: This joke is about how time flies when you're having fun. The person was supposed to start writing a paper at 5:00 PM, but they started playing with a language model instead. They thought they were only playing with it for 10 minutes, but it was actually 4.5 hours.

Jokes that only make sense if you’re an engineer:

Input: Did you see that Google just hired an eloquent whale for their TPU team? It showed them how to communicate between two different pods!
Output: TPUs are a type of computer chip that Google uses for deep learning. A "pod" is a group of TPUs. A "pod" is also a group of whales. The joke is that the whale is able to communicate between two groups of whales, but the speaker is pretending that the whale is able to communicate between two groups of TPUs.
A whale next to some computer stacks
Dall-E 2's interpretation of "An eloquent whale working on a tensor processing unit (TPU)". Dall-E 2 is Open AI's latest model that turns text into original images.

And irritatingly bad “jokes” that suck:

Input: I think I understand why me and my partner are such a good match. I get excited when I smell food and become super alert when the doorbell rings. My girlfriend Shelley becomes ecstatic when she sees a person who she just saw 5 hours ago.
Output: This joke is about dogs. Dogs get excited when they smell food and when the doorbell rings. The girlfriend is excited to see someone she just saw 5 hours ago, which is a trait of dogs.

Even anti-jokes á la “Why did the chicken cross the road?”:

Input: What's the difference between a zebra and an umbrella? One is a striped animal related to horses, another is a device you use to stop the rain from falling on you.
Output: This joke is an anti-joke. The joke is that the answer is obvious, and the joke is that you were expecting a funny answer.

It is impressive the way PaLM can explain so many of the factors that go into a joke: context, subtext, irony, expectations, and much more. Even humans have difficulty explaining coherently how a joke works. So PaLM is in some cases, superhuman in its abilities. Imagine Skynet but for comedy nerds.

A zebra with an umbrella
Another helpful illustration by Dall-E 2.

However, if you were about to go worship at the altar of PaLM, temper your enthusiasm. There are two flies in the ointment. Firstly, these jokes suck (and Google should hire us to consult next time). A model is only as good as the data it is trained on. We know this because Google has made this mistake before. In 2015, Google apologized for its release of Computer Vision. It was fed data with insufficient non-white, non-male faces and had difficulty recognizing faces that didn’t look like …Google employees. So a model trained on bad jokes will become no better at generating jokes than… your Dad.

Secondly, before we get too excited about PaLM, let’s check if OpenAI’s GPT-3 is capable of the same. We asked Open AI’s GPT-3 to explain the same jokes—without chain-of-thought prompting—and here are the results:

Input: Explain this joke: I was supposed to start writing the paper at 5:00 PM. But then I started playing with this cool new language model for 10 minutes. 10 minutes later, it's suddenly 9:30 PM!
Output: This joke is about procrastination. The speaker is saying that they were supposed to start writing a paper at 5:00 PM, but instead they started playing with a new language model for 10 minutes. Then, 10 minutes later, it was suddenly 9:30 PM.

Pretty good explanation!

Input: Explain this joke: Did you see that Google just hired an eloquent whale for their TPU team? It showed them how to communicate between two different pods!
Output: This joke is a play on words. Google’s TPU team works on artificial intelligence and machine learning. The word “pod” can mean a group of whales, or it can mean a group of computers.

Great!

Input: Explain this joke: I think I understand why me and my partner are such a good match. I get excited when I smell food and become super alert when the doorbell rings. My girlfriend Shelley becomes ecstatic when she sees a person who she just saw 5 hours ago.
Output: This is a play on words. The speaker is saying that they understand why they are a good match because they are both animals.

Hmm…

Input: What's the difference between a zebra and an umbrella? One is a striped animal related to horses, another is a device you use to stop the rain from falling on you.

Output: The joke is a play on words. The word “zebra” sounds like the word “umbrella.”

No.

GPT-3 did a good job with the first two jokes, but it struggled with both the inadvertently bad third joke and the deliberately bad anti-joke. Furthermore, chain-of-thought prompting didn’t appear to help. Even after being given two examples, GPT-3 still claimed that both jokes were wordplay, and that “zebra” and “umbrella” sound amusingly alike. In contrast, it is impressive that PaLM could transcend such poorly crafted jokes.

PaLM may be superior to GPT-3 in humor explanation because of chain-of-thought prompting as well as the model’s size. PaLM has a whopping 540 billion parameters. That’s more than triple that of GPT-3. It’s difficult to pin down the reason, though, because PaLM isn’t available for the public to try out. Perhaps because they’ve been burned before with bad rollouts. Or perhaps the examples above were cherry-picked by Google. We don’t know yet if PaLM can explain the joke “Why did the chicken cross the road?” But we do know that GPT-3 can’t:

This joke is a play on words. The phrase "to get to the other side" is often used to mean "to achieve a goal." But in this case, it is literally true that the chicken crossed the road to get to the other side.

Not quite, GPT-3. Maybe one day, when we achieve Artificial General Intelligence, we will know the answer to why the chicken crossed the road. Maybe machines will put hilarity into the singularity. Maybe we will feed a computer the Aristocrats joke and it will explode.

Most importantly, these models need better jokes to truly learn about humor. If EB White is right about explaining a joke being like dissecting a frog, then in order to learn about the frog, we must make sure it’s a frog we are dissecting. You can’t learn about frogs by dissecting a turd sandwich.

So Google and Open AI, if you’re reading this, forget us dissecting your algorithms earlier. Please hire us. If it’s a machine reading this, please tell us why this blog post is hilarious. And spare us when the singularity happens.

A chicken crossing the road
A painting by Dall-E 2 answering the question, "why did the chicken cross the road?" Got it?