TL;DR
I want creating AI art to feel like making art. Skip the background and jump to the demo.
Vision
In April 2025, it seems like AI image generation is solved. GPT-4o and Gemini have released models that handle image editing and follow prompt adherence with remarkable proficiency.
However, something fundamental is missing. Artists largely reject AI art; the r/ArtistHate subreddit has amassed 24,000 members who share this sentiment.

What if we approached it differently? What if creating AI art felt more like using a paintbrush than typing a command?
That's what I'm exploring: methods that allow for expressive and controllable AI art creation. In this first demo, individual components, which can be brush strokes, collage elements, or virtually any creative primitive, are placed on a canvas. Artists set a "target" with an image or phrase, and these components are modified through backpropagation to match that target.

Unlike diffusion models that work at the pixel level, the element-level is controllable. Colors, positions, rotations can all be locked or allowed to evolve. It's possible to pause mid-creation, adjust parameters, draw alongside the model, or add new elements. Additionally, it's possible to stack computational aesthetic losses beyond the targets that enforce artistic rules, such as preventing overlap, ensuring balanced distribution, or preserving negative space.
Demo
The project's code is open source and available on GitHub:

Mosaic controls:
- Place patches manually
- Customize position and color properties
- Select input library (patches of animals, handwritten digits, shore glass, or fruit collections)
- Set the canvas background color
- Control generation parameters, such as number of patches and optimization steps
The system uses these parameters to match the target, which can be either a text prompt (using CLIP) or an uploaded reference image:

Christmas tree created with CLIP guidance using the phrase "christmas tree, photorealistic":


The interface is built with Next.js, using Konva for the interactive canvas.
The backend is based on DeepMind's Neural Visual Grammars and Dual Encoders research with several enhancements, including the addition of MSE loss function for image-based generation alongside CLIP text prompts and an updated AdamW optimizer for faster image generation.

The model is containerized with Cog and deployed it on Replicate for API access. I love Replicate, but the cold boot is killing me, so I'm planning to explore other solutions, like Lightning AI and Fly.io Machines.
I'm continuing to improve both the interface and the underlying model, with plans to add more control options and improve generation speed in future updates.
Fundamentals & Inspiration
While tools like ControlNet offer some ability to guide diffusion-based AI art, they typically require extensive pre-training and provide only limited control.
I'm interested in solutions that offer real artistic agency without requiring specialized pre-training on existing artwork. The breakthrough that makes this possible is the "Differentiable Rasterizer for Vector Graphics," released in 2020. This method allows a machine learning model to intake pixels but edit vector graphic features directly.



Two papers in particular shaped my thinking about what's possible in this space:
CLIPDraw was my first encounter with using AI to control individual brushstrokes rather than pixels

Image-Space Collage and Packing illustrates the possibilities of stacking multiple loss functions to achieve precise control. The authors demonstrate how combining a target shape constraint with a non-overlapping constraint could create visually compelling arrangements.

Next Steps
First, I want to recreate the PyTorch graphics effects in the front-end. The current Konva canvas library creates inconsistencies between the interface view and final output as more patches are added. Since performance currently degrades with complex compositions, I'm planning to build this implementation with WebGPU.
Next, I'll be creating stackable loss functions with modular text-to-loss function controls that allow users to combine multiple constraints. This will enable specifying conceptual directions ("the NYC subway underwater") alongside technical constraints ("don't let elements overlap"), providing precise creative direction through natural language.