For most of the history of photography, if you had a black and white photograph and wanted to see it in color, you had to add the pigment with a paintbrush. Recently though, digital methods, and now artificial intelligence-driven methods, have made colorization available to everyone. This is a victory for technological progress! Or is it?

As a recent twitter thread demonstrated, AI colorization algorithms don’t quite get it right. If you take an existing color image, convert it to grayscale and use an AI colorization algorithm to repaint it, the re-colorized image looks dull in comparison. To make matters worse, in the photograph below, the re-colorized image also lightens the woman’s skin tone.

As an AI researcher interested in history, I find this issue troubling. Beyond the obvious problems with lightening a woman’s skin, photographs have a lot of power over how we imagine and feel about the world, and seeing the past with dulled color makes it look dead.

Colorization is hard for computers because it is *ill-posed,* meaning that there are multiple color images which are equally “correct” given a grayscale root version. The woman’s dress could be blue, but it could be another color, and there’s no information in the grayscale pixels to indicate which, so an algorithm has to take an informed guess. Rather than write lots of rules, we use machine learning to build a statistical model from data about which colors most likely occur.

The  colorization algorithm in question is Jason Antic’s DeOldify, which you can try out on It uses a sophisticated image generating model called a Generative Adversarial Network, and Antic’s algorithm for training it works pretty well, which results in a reliable image colorizer that produces realistic-looking images. But it isn’t perfect; the colors are dulled.

Contrary to what you might think, this problem isn’t happening because there are more white people in some historical photo dataset, or because those photos have more beige colors. In fact, the model wasn’t trained on historical photos at all! It was trained on the ImageNet dataset, put together by researchers at Stanford in 2009 with Flickr photos. While ImageNet likely contains more white people than people of color, there’s another source of bias as well.

The model takes color images which have been converted to grayscale and converts them back, trying to minimize an AI-measured “perceptual distance” between its colorized versions and the originals. Under this sort of metric, a very different reconstruction will be penalized more heavily than a slightly different one. If two colors are equally likely for an object based on the pixels, the algorithm will hedge its bets and choose something in the middle, which isn’t too different from each possibility, leading to beige colors. The developer was aware of this limitation and created another version of his model that produces less safe, more “artistic” colorizations.

At left: original of Alfred T. Palmer, “Operating a hand drill at Vultee-Nashville, woman is working on a ‘Vengeance’ dive bomber” (1943) via the Library of Congress; center: the same photo, converted to grayscale by the author; right: the grayscale image colorized by the DeOldify AI colorization algorithm. Notice the color differences from the original.

While the assumption that safe color choices are better than wrong choices is usually reasonable, it doesn’t work as well for human-made artifacts like clothes. Humans love colors; they look good; we give them meaning and value, and they look even better when we juxtapose them in certain ways. On some objects, having incorrect, but equally vibrant colors might be a better choice than beige, even though they’re more “distant.” A human colorist might know which color schemes make more sense for a photo given its place and time, but AI can’t really understand history and culture like humans do.

This problem is a fantastic example of how algorithm bias emerges. The programmer didn’t make a clearly biased decision, but because we’d like colorization algorithms to work on all kinds of images and produce colors which are less wrong, we end up lightening dark skin and dulling the past.

So, should we stop using AI colorization? Not necessarily. No recoloring is truly neutral, and colorization algorithms are a powerful tool: seeing one’s ancestor in color can be profoundly affective. However, colorization, alongside other image generation technologies like deepfakes, raises troubling questions about the trustworthiness of digital images. In a post-truth United States, it seems even photographs of the past aren’t safe.

Avatar photo

Samuel Goree

Sam Goree is a data scientist, amateur musician and design enthusiast. He is currently pursuing a PhD at Indiana University looking at ways to reconnect machine learning to the arts and humanities. Check...