Controllable AI Mosaics

April 4, 2025

Experiments, thoughts, and an initial demo around controllable and human-creativity-centered AI art creation.

TL;DR

I want creating AI art to feel like making art. Skip the background and jump to the demo.

Vision

In April 2025, it seems like AI image generation is solved. GPT-4o and Gemini have released models that handle image editing and follow prompt adherence with remarkable proficiency.

However, something fundamental is missing. Artists largely reject AI art; the r/ArtistHate subreddit has amassed 24,000 members who share this sentiment.

Generating red lanterns from MNIST digits

What if we approached it differently? What if creating AI art felt more like using a paintbrush than typing a command?

That's what I'm exploring: methods that allow for expressive and controllable AI art creation. In this first demo, individual components, which can be brush strokes, collage elements, or virtually any creative primitive, are placed on a canvas. Artists set a "target" with an image or phrase, and these components are modified through backpropagation to match that target.

My friend Trevor, generated out of parts of Riley

Unlike diffusion models that work at the pixel level, the element-level is controllable. Colors, positions, rotations can all be locked or allowed to evolve. It's possible to pause mid-creation, adjust parameters, draw alongside the model, or add new elements. Additionally, it's possible to stack computational aesthetic losses beyond the targets that enforce artistic rules, such as preventing overlap, ensuring balanced distribution, or preserving negative space.

Demo

Live Demo

The project's code is open source and available on GitHub:

Mosaic Interface - https://mosaic.mechifact.com/

Mosaic controls:

  • Place patches manually
  • Customize position and color properties
  • Select input library (patches of animals, handwritten digits, shore glass, or fruit collections)
  • Set the canvas background color
  • Control generation parameters, such as number of patches and optimization steps

The system uses these parameters to match the target, which can be either a text prompt (using CLIP) or an uploaded reference image:

Mosaic using an image target

Christmas tree created with CLIP guidance using the phrase "christmas tree, photorealistic":

Mosaic using CLIP with a text prompt target "christmas tree, photorealistic"
Library of fruit "patches" that make up the Christmas tree

The interface is built with Next.js, using Konva for the interactive canvas.

The backend is based on DeepMind's Neural Visual Grammars and Dual Encoders research with several enhancements, including the addition of MSE loss function for image-based generation alongside CLIP text prompts and an updated AdamW optimizer for faster image generation.

Robot, generated from animal shapes

The model is containerized with Cog and deployed it on Replicate for API access. I love Replicate, but the cold boot is killing me, so I'm planning to explore other solutions, like Lightning AI and Fly.io Machines.

I'm continuing to improve both the interface and the underlying model, with plans to add more control options and improve generation speed in future updates.

Fundamentals & Inspiration

While tools like ControlNet offer some ability to guide diffusion-based AI art, they typically require extensive pre-training and provide only limited control.

I'm interested in solutions that offer real artistic agency without requiring specialized pre-training on existing artwork. The breakthrough that makes this possible is the "Differentiable Rasterizer for Vector Graphics," released in 2020. This method allows a machine learning model to intake pixels but edit vector graphic features directly.

Source: https://github.com/BachiLi/diffvg
Source: https://github.com/BachiLi/diffvg
Vector graphic to raster image, modified back with differentiable rasterization, Source: https://people.csail.mit.edu/tzumao/diffvg/

Two papers in particular shaped my thinking about what's possible in this space:

CLIPDraw was my first encounter with using AI to control individual brushstrokes rather than pixels

CLIPDraw, Source: https://arxiv.org/pdf/2106.14843

Image-Space Collage and Packing illustrates the possibilities of stacking multiple loss functions to achieve precise control. The authors demonstrate how combining a target shape constraint with a non-overlapping constraint could create visually compelling arrangements.

Source: https://szuviz.github.io/pixel-space-collage-technique/

Next Steps

First, I want to recreate the PyTorch graphics effects in the front-end. The current Konva canvas library creates inconsistencies between the interface view and final output as more patches are added. Since performance currently degrades with complex compositions, I'm planning to build this implementation with WebGPU.

Next, I'll be creating stackable loss functions with modular text-to-loss function controls that allow users to combine multiple constraints. This will enable specifying conceptual directions ("the NYC subway underwater") alongside technical constraints ("don't let elements overlap"), providing precise creative direction through natural language.

Generate a Mosaic