Why Dall-E 2, the AI-Powered Image Generator, is Comedy Gold

July 22, 2022
Andrew Moorhead
James Caven

Dall-E 2 is one of most advanced AI technologies ever built. We used it to create a picture of King Kong fighting Nicolas Cage.

Hipster friar
Dall-E's interpretation of "hipster friar".

In case you’ve recently been living unplugged, blissfully unaware of the leading developments in artificial intelligence and image processing, allow us to welcome you back from your self-imposed technological exile with this image of a hipster friar.

This somewhat disturbing image, along with many more like it, are the work of DALL-E 2, an image generation tool currently undergoing beta-testing. Developed by the Elon-Musk-founded OpenAI—already widely praised for its developments in natural language processing as per its original processing program, GPT-3—DALL-E 2, like its predecessor, takes user-generated descriptive phrases as input and spits out the best images that it can as output.

“Never send a human to do a machine’s job.”The Matrix

Under the metaphorical hood, DALL-E 2 is what is known as a diffusion model. This is a fancy way of saying that it learns to generate images from random dots known as “noise”.

In short, DALL-E 2 learns to turn this random noise into images with the help of GPT-3, an advanced neural network.

In long, DALL-E 2 is the perfect union of two different pieces: one which translates between text and image, and another which generates the novel images actually presented to the (presumably elated) user. The translator program, known as “CLIP”, functions like a dictionary between captions and images. Trained on 400 million image-caption pairs, it converts both images and captions into strings of numbers, thus allowing the program to use math to predict the relevance of any particular image/caption to another. Separately, the image generation program—known technically as the “diffusion” program—learns by watching millions of images get turned into random noise–

Noise

–and then just doing those steps in reverse, like watching someone take apart an engine and then learning to put it back together. This program watches the images “get put back together” millions of times and learns how to do it; then, when we want completely new images, we just give it different random noise and it makes something new! In our engine analogy, it would be like giving the engineer different starting parts: of course they’ll make a different engine design with different parts, though it will still be based on the principles they’ve learned.

Finally, DALL-E 2 just makes sure that at every stage of the “rebuilding” of the image, the image generation program checks its work in the image/caption dictionary, and voila! Completely new images based on the principle of turning random noise into images, but conditioned on a dictionary that relates the text input to the final image. We did it! If you’re interested in a more granular explanation of DALL-E 2’s architecture, we’d recommend this article on AssemblyAI or this one on Medium.

Where DALL-E 2 really differs from previous AI image generators, however, is in both its understanding of natural language and its sheer accuracy in creating images beautiful, inspiring and—perhaps most frequently—comical.

Optimus Prime in sweatshirts
Optimus Prime modeling for Abercrombie & Fitch

There is no natural reason for this image to ever exist. Optimus Prime? He’s a fictional character. Abercrombie & Fitch? It’s been years since they were the center of the cultural consciousness. But still, despite the odds that logic, intuition, and simple common sense would have you believe, this image exists.

We hear your cries: “Where can I, a simple reader find additional whimsical and silly images it has generated?” Well, dear reader, look no further.

DALL-E: Ice T and cotton candy
"Law and Order SVU episode about Ice T eating fluffy cotton candy"
An older man dancing in the desert
"Mikhail Gorbachev dancing at Burning Man"
Judy Garland eating a salad on the moon
"Astronaut Judy Garland eating a salad"

These images communicate the clear truth that, no matter what technological leap forward DALL-E 2 technically represents, DALL-E 2 is, at the end of the day, a toy. An amazing, terrifying, absurd digital toy.

Is there some insight to learn from the silliness above? Some mystery about natural language being slowly unpeeled layer by layer? Perhaps…but the urge to generate just one more absurd image is simply too overwhelming.

DALL-E: someone fighting a tiger in comic fashion
Prompt: "A pop art father's day card of Simba and Mufasa performing at Live Aid Concert 1985"

Clearly, DALL-E 2 can make some incredibly funny stuff—the introductory image of “hipster friar” captures in amazingly comic fashion both the incessantly ironic fashion of the hipsters and the borderline oppressive gravitas of the friary—in just a few seconds, creating images we ourselves could only dream of seeing without hours and hours of diligent work and technical skill.

On top of just pure yuks, though, DALL-E 2’s creations can also move in the direction of “true” art: potentially comic creations that play off of established visual artists, from Michelangelo to Basquiat.

These attempts to get incredibly granular with DALL-E 2 come with mixed success, often due to the program’s hyper-literal nature and lack of human-style common sense. For example, in the following example of “Doctor Emmett Brown playing basketball on the ceiling of the Sistine Chapel”, it does technically put Doc Brown on the ceiling (i.e. he is jumping near the ceiling)...

A white-haired man playing basketball
"Doctor Emmett Brown playing basketball on the ceiling of the Sistine Chapel"

…but it is not in the way that we would naturally interpret the prompt’s ambiguity (i.e. we might image that he would be painted onto the famous interior facade of the Sistine Chapel’s ceiling).

As per DALL-E 2’s word-associative mathematical technique, including more stylistic words can often yield more desirable results from a qualitative standpoint, as in this addendum of “Doctor Emmett Brown playing basketball on the ceiling of the Sistine Chapel painted by Michelangelo”:

White-haired man playing basketball
"Doctor Emmett Brown playing basketball on the ceiling of the Sistine Chapel painted by Michelangelo"

Is this image funny? It wouldn’t be hard to argue so, especially to someone who had never seen a DALL-E 2 image before (you, dear reader, are already so used to its whimsical creations by now).

For fun, here are two other artistry-based DALL-E 2 creations:

A family in renaissance style
"Obi-Wan Kenobi babysitting in the style of a Renaissance painting"
Poorly-drawn Steve buscemi
A 1st grader’s colored pencil drawing of Steve Buscemi completing the trans-continental railroad.

The eagle-eyed among you might notice one recurring trend in DALL-E 2’s work thus far: for all its skill at combining images and emulating style, it is not particularly great at recreating human faces.

To see this in even more clarity, let’s take a look at a well-known figure of the past 30 years: Nicolas Cage.

Nick Cage and King Kong
Nicolas Cage fighting King Kong in New York City

Nicolas Cage, as a famous actor, has his likeness freely available all over the Internet, and so DALL-E 2 should have no shortage of reference images to draw from. Why, then, does it seem to have trouble getting his face just right?

It is possible that, at least as per our qualitative testing, DALL-E 2 does this by design. You only need a cursory understanding of the Internet to understand the insidious ways in which some people would take advantage of a system that creates photorealistic versions of people’s faces. The terms “deepfake” and “revenge porn” come to mind. Explicitly, the DALL-E 2 guidelines extend to prevent image generation of certain political figures; for example, DALL-E 2 refuses to generate images of figures such as Joe Biden, though some deceased politicians seem to be allowed (see Mikhael Gorbachev at Burning Man above). However, the prompt “a big happy family having fun” has no reference to any particular individual or individuals…

A family having fun
“a big happy family having fun”

…and yet some of the faces are still abnormal. This indicates that drawing faces seems to be a particular challenge for DALL-E 2. Of course, it’s important to remember that we are biologically hardwired in our social-monkey brains to be especially attuned to facial features, meaning that the bar for verisimilitude in the face is incredibly high, and DALL-E 2 is just not quite there yet.

DALL-E 2 can also run into trouble from another source when generating images from very abstract concepts or ideas, one from which no AI is immune: training bias. You may have noticed that the “big happy family” happens to be white in every iteration. Another example of this is in the prompt “freedom of speech”...

Some people protesting
"freedom of speech"

…in which all of the human subjects present as men, whereas in the images generated from the prompt “depression and anxiety”...

Sad women
“depression and anxiety”

…all of the human subjects present as women.

Every AI in the world is programmed with bias—gender, racial, and so on—because all AI is programmed by humans, and humans are inherently biased. Even being generous and assuming that this human bias is always inadvertent, the fact that it exists is still a major problem, one that engineers are constantly trying to solve.

The root of AI “bias” is almost always in its training; in DALL-E 2’s case, this came from 400 million image-caption pairs scraped from the Internet. Minimizing bias, while it is inevitable to some degree, represents one of the largest challenges facing trainers and developers of these large language models today. Just days ago, OpenAI released a new update to DALL-E 2 aimed at mitigating anti-diversity biases in its images. While this update seems effective in the case of more obvious subjects like occupational and workplace representation, more abstract concepts like the all-white “big happy family” above haven’t seen much improvement at all; there is still a long way to go.

Interestingly, though, when one does a Google Image Search for “freedom of speech”, these are the results:

Google freedom of speech: signs
Google image search of "Freedom of Speech"

…which looks qualitatively much different than DALL-E 2’s male megaphone-party. If you look at images of “freedom of speech” on stock image websites, those continue to look quite different. Even image searching “freedom” and “speech” individually yields nothing particularly similar. So where does this particular “bias” or specific schema come from? What informed this visual conceptualization of “freedom of speech?” Perhaps the center of the mystery is in the differing goals of Google Images Search and DALL-E 2: Google Images uses SEO to find the most relevant search results for your input, while DALL-E 2 (and its curation of only six image results) focuses on trying to accurately map caption to image in a way that Google Images does not.

The jury’s out on the potential for AI to exhibit true originality, but at the very least this search and these questions reiterate the need for transparency in how AI is designed and from what inspiration it creates, especially as these systems approach greater and greater emulations of human creativity.

Will humans eventually build an AI that can create genuinely original art? Maybe it depends on how we define creativity. Maybe it depends on how we build AI. But for now, the most pressing issue is not how these AI models relate to creativity, but how we, humans, relate to them. For once, a major development in AI was celebrated instead of mostly feared as usual. That joy, that little rush of dopamine at seeing the output’s funny and unexpected images, feels essential to how we experience technology today. Dall-E 2 knows how to capture our attention and, at the end of the day, isn’t that one of the true goals of art?

Who’s to say? But we don’t have time to worry about thorny existential questions like these right now, we’ve got Nicolas Cage memes to generate.