Work completed as part of MIT Advances in Computer Vision
This work introduces a novel approach for editing Neural Radiance Fields (NeRF) scenes with text prompts. A central hypothesis underpinning this work is that if we consistently style images while preserving geometry, we should be able to edit NeRF models by manipulating source images effectively, creating novel scenes
Through a series of comprehensive experiments, we explore the potential of our proposed method. This paper presents a detailed analysis of our experimental setup, results, and findings that validate and illustrate the potentials of our method and connect it with the broader NeRF editing landscape.
Diffusion code is accessible here and NeRF code is accessible here.
Imagine being able to describe changes to a 3D scene using just text - "make this room look abandoned" or "transform this landscape into a winter wonderland." That's the challenge I tackled in this project, where I developed a novel approach to editing Neural Radiance Fields (NeRF) scenes using text prompts and diffusion models.
The core idea was to combine two powerful AI technologies: NeRF, which creates detailed 3D scenes from photos, and ControlNet, which enables precise image editing with text. By adding cross-attention layers to the ControlNet pipeline, I created a system that could consistently modify images while preserving the geometric information needed for 3D reconstruction.
This project merged concepts from computer vision, 3D graphics, and machine learning, requiring me to:
- Design and implement a custom diffusion pipeline combining ControlNet with cross-attention
- Experiment with various approaches to maintain geometric consistency across image modifications
- Optimize the balance between creative transformation and structural preservation
- Develop methods for evaluating the quality and consistency of the resulting 3D scenes
Related Work
Much work has been done with similar goals and outcomes. The primary work that this project is inspired by is Instruct-NeRF2NeRF, a paper that shows the ability to modify specific objects in a NeRF scene with text [1]. Related work has been done applying stylization to rendered NeRF scenes [2][3][4], using GANs to increase the quality of a final model [5][6][7], and object-specific scene modification with text and parameter control [8][9][10].
Methodology
This section starts with the final iteration of the experiments, in which a cross-attention layer is added to the HuggingFace Stable Diffusion ControlNet Pipeline [11]. This method creates consistent images when compared to methods relying only on ControlNet and consistent seeds.
We take photographs that have been shown to produce NeRF models successfully and modify them with text prompts to create new NeRF models from those images. After that, other experiments and results are shown, including image modifications with color and image stylization.
ControlNet Pipeline with Attention
The approach discussed here was inspired by work done in video generation and text-prompt based diffusion editing for videos. The architecture discussed in Figure 1 was inspired by Text2Video-Zero [12], a proposed method for zero-shot video generation. Videos are a great place to look for inspiration, because they have many of the same needs as NeRF model generation. For a video to be successfully modified with Stable Diffusion, it needs to be consistent from frame to frame, and keep the objects in the scene identifiable and of the same shape as the original.
Architecture: ControlNet with Canny Edges applied to106 images
As shown in Figure 1, we approached this problem by combining ControlNet with a Cross-attention layer. ControlNet provides a way to keep the structure of the images consistent. The Cross-attention layer serves to style images consistently as shown in Figure 2. Images are modified in batches of 8 at a time for performance reasons.
A comparison of ControlNet with and without a Cross-attention layer
The Stable Diffusion model used is runwayml/stable-diffusion-v1-5, with specific ControlNet models lllyasviel/sd-controlnet-depth used for depth and lllyasviel/sd-controlnet-canny used for canny edges.
Performance is based on how well NeRF scenes are formed after 4,500 iterations, based on a combination of PSNR Score and human observation.
Canny Edges
ControlNet with canny edges provides a way to retain the structure of the original images. We experimented with the fidelity of these canny edge images.
Canny edges with less detail (left) to more detail (right).
Lower Detail
- Low_threshold: 125
- High_threshold: 200
Higher Detail
- Low_threshold: 10
- High_threshold: 125
Canny Edge ControlNet outputs with high and low detail
Depth Maps
Depth maps from images were created using the Midas method [13].
Midas depth map output
Combining Models
We attempted to combine ControlNet models. While possible, we ran into limitations with MultiControlNet’s ability to handle batch image processing. However, initial results are promising, as shown in Figure 5 below.
Combination of ControlNet Depth and Canny Edge
NeRF
Our NeRF model, implemented in PyTorch, consists of an input layer, multiple hidden layers with optional skip connections, and an output layer that handles view direction information when provided. The model uses ReLU activation functions and outputs RGB values and opacity (alpha) for each point in the scene. The architecture is based on the implementation described by McGough (2022) in their article “It's NeRF From Nothing: Build A Complete NeRF with PyTorch” [14]
Baseline Outcomes
Outcomes Experiments below are based on 4,500 iterations.
Baseline outcomes
Image Stylization Techniques
Inspired by the methods in “StylizedNeRF: Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning” [3], stylization from an image was naively applied to training images, utilizing neural style transfer to apply stylization to images, following the technique outlined in "A Neural Algorithm of Artistic Style" (Gatys et al., 2015) [15].
Sample Content Image and Stylization Image applied
Outcomes The geometry of the initial image was tested with a high pass filter. Variation loss and the weights of style and content images were changed to preserve geometry. These experiments were not successful, and I’m planning to move on to other methods.
Content Image with Style Applied
High Pass Filter Before and After Stylization
NeRF Scene after 4,500 iterations
Experimental Results
These methods were able to create consistent changes across images. We saw some slight improvement in NeRF scene quality as details increased in ControlNet output, even if those details were not entirely consistent across images Overall however, we were not able to create high-quality NeRF scenes from images modified with ControlNet.
ControlNet Canny Edges
ControlNet with Canny Edges created consistency, but NeRF training did not get to the quality we had hoped for.
Image output and NeRF training with ControlNet Canny Edges
ControlNet Depth Map
Depth maps does not have the quality of detail or retention of original geometry that Canny Edges has. The model overfits to the Cross-attention layer, creating disfigured images.
Image output and NeRF training with ControlNet Depth
Combining ControlNets
The proof of concept experiment combining depth map and canny edge ControlNet inputs with Automatic1111 led to highly detailed and structurally correct images. However, these were saved out of order without enough time to correct them before the paper deadline.
Image output combining ControlNet models, proof of concept
Discussion
Limitations
With the method discussed here there are a number of limitations in the ways that NeRF scenes can be modified. Because of the dependence on structural forms of canny edges and depth maps created from photographs or images, changes to the actual form or extreme textural elements are not possible.
ControlNet with Canny Edges and Depth, Prompt “bulldozer, covered in large and sharp spikes, prickly texture like a cactus” doesn’t show consistent change in surface texture.
Additionally, when only relying on text prompts and ControlNet, fine-tuned control, such as colors or details of specific elements, is very difficult.
Conclusion
In this paper, we addressed a number of possible ways to create novel NeRF scenes based on image manipulation, with the main way being a method to use ControlNet to consistently modify images using before training a new NeRF scene. This work was more difficult than initially expected, many days were spent getting ControlNet pipelines working. However, there are early results that are promising, and it was an interesting exercise.
Additionally, it is difficult to add additional layers to existing ControlNet pipelines. I was not able to find any previous work that used cross-attention while combining ControlNet models, so hopefully the code produced here can provide the start to that work being possible.
References
[1] Haque, A., Tancik, M., Efros, A. A., Holynski, A., & Kanazawa, A. (2023). Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. arXiv preprint arXiv:2303.12789. Retrieved from https://arxiv.org/abs/2303.12789
[2] Zhang, K., Kolkin, N., Bi, S., Luan, F., Xu, Z., Shechtman, E., & Snavely, N. (2022). ARF: Artistic Radiance Fields. In Proceedings of the European Conference on Computer Vision (ECCV).
[3] Huang, Y.-H., He, Y., Yuan, Y.-J., Lai, Y.-K., & Gao, L. (2022). StylizedNeRF: Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Chen, X., Zhang, Q., Li, X., Chen, Y., Feng, Y., Wang, X., & Wang, J. (2022). Hallucinated neural radiance fields in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 12943-12952).
[5] Cai, S., Obukhov, A., Dai, D., & Van Gool, L. (2022). Pix2NeRF: Unsupervised Conditional p-GAN for Single Image to Neural Radiance Fields Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3981-3990).
[6] Chong, M. J., & Forsyth, D. (2021). GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!). arXiv preprint arXiv:2106.06561. Retrieved from https://arxiv.org/abs/2106.06561
[7] Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J. J., & Kemelmacher-Shlizerman, I. (2022). StyleSDF: High-Resolution 3D-Consistent Image and Geometry Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 13503-13513).
[8] Wang, C., Chai, M., He, M., Chen, D., & Liao, J. (2021). CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. arXiv preprint arXiv:2112.05139. Retrieved from https://arxiv.org/abs/2112.05139
[9] Yuan, Y.-J., Sun, Y.-T., Lai, Y.-K., Ma, Y., Jia, R., & Gao, L. (2022). NeRF-Editing: Geometry Editing of Neural Radiance Fields. arXiv preprint arXiv:2205.04978. Retrieved from https://arxiv.org/abs/2205.04978
[10] Raj, A., Kaza, S., Poole, B., Niemeyer, M., Mildenhall, B., Ruiz, N., Zada, S., Aberman, K., Rubenstein, M., Barron, J., Li, Y., & Jampani, V. (2023). DreamBooth3D: Subject-Driven Text-to-3D Generation. arXiv preprint arXiv:2303.13508. Retrieved from https://arxiv.org/abs/2303.13508
[11] Hugging Face. (n.d.). Text-to-Image Generation with ControlNet Conditioning. Retrieved May 16, 2023, from https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet
[12] Khachatryan, Levon, et al. "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators." arXiv preprint arXiv:2303.13439, 2023, https://doi.org/10.48550/arXiv.2303.13439.
[13] Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2019). Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer. arXiv preprint arXiv:1907.01341.
[14] McGough, M. (2022). It's NeRF From Nothing: Build A Complete NeRF with PyTorch. Retrieved from https://towardsdatascience.com/its-nerf-from-nothing-build-a-vanilla-nerf-with-pytorch-7846e4c45666
[15] Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A Neural Algorithm of Artistic Style. arXiv preprint arXiv:1508.06576v2. Retrieved from https://arxiv.org/abs/1508.06576