Diffusion Diffusion Model Explained: The Core Mechanisms Driving High-Quality 3D Content Generation (AI+3D Product Manager Note S2E05)

In the field of AI, Diffusion models have become one of the core technologies for generating high-quality 3D content. From the brilliance of 2D images to the new continent of 3D creation, diffusion models are capable of generating not only realistic 3D models but also creative generation based on text descriptions. This article will provide an in-depth analysis of how diffusion models work, explore their application paths in 3D content generation, and how product managers can use this technology to drive product innovation.

Introduction: From the brilliance of two-dimensional images to the new world created by three-dimensional

In the previous note (S2E04), we took a deep dive into NeRF (Neural Radiation Field), a revolutionary technology, and how it can “remember” light to reconstruct and render 3D scenes in high fidelity. NeRF’s prowess lies in its unparalleled realism and ability to accurately reproduce existing scenes, making it pivotal in the realm of 3D reconstruction and digitization. However, when we talk about AI’s “creativity”, especially from scratch and based on high-level semantics (such as text), another protagonist takes center stage—and it’s the Diffusion Model.

If NeRF is more like a highly skilled “digital photographer and restorer”, dedicated to faithfully restoring reality; The Diffusion model is more like an imaginative “digital sculptor”, able to shape never-before-seen forms out of chaos. It was the great success of the Diffusion model in the text-to-image field (which gave rise to phenomenal applications such as Midjourney, Stable Diffusion, DALL-E, etc.) that completely ignited the flame of AIGC and quickly “radiated” its powerful generation capabilities to the three-dimensional field.

Understanding how Diffusion models work and how they can be cleverly applied to drive the generation of 3D content is crucial for product managers. This is not only related to whether we can grasp the pulse of the core and most active technology in the current AI generative 3D field, but also directly affects our evaluation of relevant tool capabilities, the definition of product functions, and the judgment of future technology trends.

This note (S2E05) will focus on demystifying the Diffusion model for product managers. Together, we’ll explore:

What exactly is the core idea of the Diffusion model? How does it achieve the orderly generation from “chaos” to “order”?

How are powerful 2D image diffusion models “dimensionality reduction” used to create 3D objects? (i.e. the mystery of Score Distillation Sampling)

What about the other path of diffusion directly on 3D data?

From a product perspective, what are the unique advantages and challenges of Diffusion-based 3D generation methods?

Our goal is to empower product managers to not only “know what” but “why” the Diffusion model to make smarter, more forward-thinking decisions in product practice.

1. The core idea of the Diffusion Model: the orderly generation from “chaos” to “order”

When you first encounter the Diffusion model, the mathematical principles may seem obscure. But its core idea is very intuitive, even with a hint of philosophical meaning. It simulates a process of going from order to disorder and then learning how to restore order from disorder.

1. An intuitive metaphor: the melting and reshaping of ice sculptures

Imagine we have a crystal clear ice sculpture (representing a clear, high-quality sample of data, such as a picture or a 3D model).

  • Forward Process: We placed the ice sculpture in a gradually warming environment and watched it melt little by little at a fixed rate. After many timesteps, the ice sculpture eventually melts completely into a shapeless puddle of water (representing pure, random noise). This process of “melting” is known, fixed, and irreversible. We know exactly how much the ice sculpture will melt at each point in time.
  • Reverse Process: Now, the real challenge comes. Can we learn a “magic” that can reverse this water back to its original ice sculpture form? This is the “magic” that the Diffusion model is trying to learn. It observes the melting process of countless different ice sculptures and trains a neural network to learn the steps of “reverse melting”. At each point in time, the network looks at the current “semi-melted” ice water mixture and predicts what it should look like “last second” (i.e., how to “freeze back” a little bit).
  • Create new ice sculptures: Once this “reverse melting” magician (neural network) is trained, we can give it a random splash of water (pure noise) and let it perform “freeze” magic step by step. After the same amount of time, it can “reshape” a new, structurally complete, and detailed ice sculpture from this random water – this ice sculpture is neither one it has seen before, but it conforms to the characteristics and beauty that all “ice sculptures” should have.

This metaphor, while not entirely precise, vividly reveals the core of the Diffusion model: the ability to “create” (generate samples) from “nothing” (pure noise) by learning the inverse of a controlled “destruction” (noise) process.

2. Forward Process: Orderly “Destruction”

Mathematically, this process is called the Diffusion Process. It defines a Markov chain that starts with a primitive, clean data sample x_0 (e.g., a picture) and gradually adds a small amount of Gaussian noise to it over T discrete time steps.

a. Gradual Noise Increase:

At each time step t, the data x_t at t moment is generated from the data at moment t-1 x_{t-1}, by adding a Gaussian noise with a mean of 0 and a variance of β_t. This β_t is a preset small constant that usually increases with t, and this sequence {β_1, …, β_T} is called a variance schedule.

b. Process certainty:

The entire forward nourishment process is fixed and non-learning. Given the x_0 and the table of variance, we can accurately calculate the distribution of the noisy data x_t at any given moment. An important mathematical feature is that x_t can be represented directly by x_0 without the need for step-by-step calculations, which greatly simplifies the training process.

c. Final Status:

When the timestep T is large enough (e.g. T=1000), the final x_T will be infinitely close to a standard isotropic Gaussian distribution, i.e. pure random noise, independent of the original data x_0.

The significance of this forward process is that it creates a large number of “semi-finished” data x_t with varying degrees of noise, and the correspondence between them and the original clean data x_0, providing a perfect training sample for the next reverse process learning.

3. Reverse Process: Learn the art of “repairing”

This is the core learning task of the Diffusion model. Our goal is to train a neural network ε_θ (with a parameter θ, usually using a U-Net-like architecture) to learn the inverse of the forward process.

a. Learning Objectives:

In theory, the reverse process requires solving a posterior probability p(x_{t-1} | x_t that is difficult to calculate directly. But mathematical derivation shows that if β_t is small enough, this inverse process can also be approximated as a Gaussian distribution. Further, it can be demonstrated that training neural networks to directly predict the noise ε added to x_{t-1} at moment t is an equivalent and more stable learning goal.

b. Training process:

A clean sample x_0 is randomly drawn from the dataset.

A time step T (from 1 to T) is randomly selected.

According to the formula of the forward process, the noisy sample x_t corresponding to the t moment is calculated directly from x_0, and the real noise ε added at this time is obtained.

The noisy sample x_t and time step t are input into the neural network ε_θ to obtain the noise predicted by the network ε_θ(x_t, t).

Calculate the difference between the predicted noise ε_θ(x_t, t) and the true noise ε (usually mean square error L2 Loss).

Use gradient descent to optimize the network parameter θ to minimize this loss.

The above process is repeated until the network can accurately predict the noise contained in any noisy sample at any time step.

c. Generate new samples (samples):

Once the network ε_θ training is complete, we can start with a pure random noise x_T, iterate T times, and denoising step by step:

From t=T to t=1, at each time step t:

Input the current noisy data x_t into the network to get the predicted noise ε_θ(x_t, t).

Based on the predicted noise, a specific updated formula is used to calculate the cleaner data x_{t-1} from the previous time step from x_t. This updated formula also usually adds some randomness to increase the variety of generates.

After T-step iteration, the final x_0 is a new, high-quality sample generated by the model.

4. Guidance: From “Random Creation” to “On-Demand Generation”

The above process describes an Unconditional Generation model that generates random samples that match the distribution of training data, but we have no control over what it generates. To achieve “text-to-image” or “text-to-3D”, we need to introduce Conditional Generation, or Guidance.

a. Early Methods (Classifier Guidance):

An early approach is to train an additional classifier to “guide” the generation process in the direction of the target category (e.g., “cat”) with the gradient of the classifier at each step of the denoising.

b. Classifier-Free Guidance (CFG):

This is the most mainstream and effective channeling method, proposed by Ho & Salimans in 2022. The idea is very ingenious:

Training Phase:At training, conditional information c (such as embedding of a text prompt) is empty (i.e., unconditional) with a certain probability (e.g., 10%). In this way, the same denoising network ε_θ learns to predict both unconditionally ε_θ(x_t, t) and conditionally predict ε_θ(x_t, t, c).

Sampling Phase:At the time of generation, we can calculate both the unconditionally predicted noise and the conditionally predicted noise. The final noise used for update is a weighted sum of the conditional prediction noise and the “difference between conditional and unconditional prediction”.

e_final = e_unconditional + w * (e_conditional – e_unconditional)

where w is the Guidance Scale.

Intuitive understanding:(ε_conditional – ε_unconditional) can be seen as “the pure direction brought about by conditional information c”. By adjusting w, we can control how much the build process “compliant” with conditional guidelines. The larger the w, the more relevant the generated result is to the condition (but may also sacrifice some diversity and realism); The smaller the w, the more room the model has to “play freely”. CFG is one of the key technologies that allows models like Midjourney, Stable Diffusion, and others to generate images that fit the text description.

5. Representative techniques/models/tools/cases/literature and discussions

Core Papers:

[Source: Denoising Diffusion Probabilistic Models – https://arxiv.org/abs/2006.11239]

Classifier IrrelevantGuide Paper:

[Source: Classifier-Free Diffusion Guidance – https://arxiv.org/abs/2207.12598]

Excellent Visual Explanations:

[Source: What are Diffusion Models?] (Lilian Weng) – https://lilianweng.github.io/posts/2021-07-11-diffusion-models/]

U-Net Architecture Papers:

[Source: U-Net: Convolutional Networks for Biomedical Image Segmentation – https://arxiv.org/abs/1505.04597]

Stable Diffusion Implementation Reference:

[Source: High-Resolution Image Synthesis with Latent Diffusion Models – https://arxiv.org/abs/2112.10752]

2. From 2D brilliance to 3D creation: the main application path of diffusion models in 3D generation

After successfully applying the Diffusion model to 2D image generation, researchers naturally set their sights on the more challenging field of 3D. However, translating successful 2D experiences directly into 3D presents significant challenges:

  • Curse of Dimensionality:Representations of 3D data, such as 128x128x128 voxel meshes, are much more dimensional than 2D images (such as 512×512) and require significant computational and memory resources.
  • Training data scarcity:Compared with the hundreds of millions of annotated images on the Internet, high-quality, large-scale, and diverse 3D model datasets are still relatively scarce.
  • Structural Complexity:Three-dimensional objects have complex topologies and spatial relationships, and it is much more difficult for the model to learn to generate these structures directly than to generate a pixel mesh.

To address these challenges, researchers have explored two main technological paths.

1. Path 1: Direct 3D Diffusion

The idea of this path is most straightforward: apply the core mechanism of the diffusion model to a representation of a certain 3D data and train a diffusion model specifically designed to generate 3D data.

a. Voxel-based Diffusion:

Principle:Discretize 3D space into regular voxel grids, each of which can store occupancy information (0 or 1) or richer features (e.g., SDF values, colors). The diffusion model is then trained to generate these voxel meshes.

Pros and cons:The concept is simple and easy to combine with 2D convolutional networks (which extend to 3D convolution). However, the main drawback is that compute and memory costs increase with resolution, making it difficult to generate high-resolution, detailed models. The generated models also tend to have a pronounced “blocky feel”.

b. Point-Cloud-based Diffusion:

Principle:Representing a 3D object as a collection of unordered three-dimensional points (Point Cloud). Train a diffusion model to generate the coordinates of these points.

Representative work:OpenAI’s Point-E is a prime example. It is generated in two stages: first, a diffusion model quickly generates a low-resolution (e.g., 1024 points) point cloud based on text or image input; Then, another more powerful diffusion model generates a high-resolution point cloud (e.g., 4096 points) on the condition of a low-resolution point cloud.

Pros and cons:Point cloud representations are relatively flexible and not limited by resolution. Point-E generates very quickly (only a few seconds on the GPU). However, the main disadvantage is that the point cloud itself has no surface topology information, and reconstructing a high-quality surface mesh from the generated point cloud is an additional and challenging step, and the details of the final model are usually limited.

c. Implicit Function Parameter-based Diffusion:

Principle:This approach is more ingenious. It starts by training an Autoencoder that encodes the 3D model into a low-dimensional implicit function representation (or a compact feature vector), from which the decoder reconstructs the 3D model (usually SDF or other form). Then, train a diffusion model to generate this low-dimensional, compact feature vector.

Representative work:OpenAI’s Shap-E is a standout example of this. It demonstrates diffusion directly in the latent space of implicit function parameters, enabling the generation of higher-quality, more detailed 3D models than Point-E, and is also very fast.

Pros and cons:It combines the power of implicit representation with the generative advantages of diffusion models. The inference speed is fast and the generation quality is relatively good. However, training such autoencoders and diffusion models still requires a large amount of 3D data, and the upper limit of the quality of the generated model is limited by the autoencoder’s reconstruction capabilities.

d. “Direct 3D Diffusion” from the Product Manager’s Perspective:

Core Benefits:Fast. Because it is a direct 3D representation of the target, the inference process (sampling process) is often much faster than the optimization methods described below, and may even enable near real-time generation. This is attractive for use cases that require fast response, interactive generation.

Core Challenges:Quality and data dependency. Currently, models generated by direct 3D diffusion are generally not comparable to top optimization methods in terms of geometric details and texture realism. What’s more, they are highly dependent on large-scale, high-quality 3D training datasets. Obtaining and processing this data is a significant engineering and cost challenge.

2. Path 2: Knowledge Distillation via 2D Diffusion

This path is currently the mainstream method capable of generating the highest quality text-to-3D results. The core idea is that instead of training an expensive and difficult 3D diffusion model from scratch, it is better to cleverly use an already trained and extremely powerful 2D text-to-image diffusion model (such as Imagen, Stable Diffusion) as a “universal art critic” or “source of knowledge” to guide the optimization process of a 3D representation. This process is known as knowledge distillation.

a. Score Distillation Sampling (SDS) Details:

Score Distillation Sampling (SDS), proposed by Google Research in the DreamFusion paper, is a key algorithm for enabling this distillation of knowledge. We can think of it as an iterative “sculpting” process:

Step 1: Prepare the “stone”. We start by initializing a differentiable three-dimensional representation. This “stone” can be a NeRF, an SDF (symbolic distance function) field, or a micronically rendered mesh representation. The key is that we must be able to obtain a 2D image from any perspective from this 3D representation through a differentiable rendering process.

Step 2: Find an “art master”. We brought in a successful “master artist” – a powerful pre-trained 2D Text-to-Image diffusion model. The master has seen a huge number of images and texts and knows what “a red Ferrari sports car” should look like from all angles.

Step 3: Start the “sculpt” cycle.

Choose a random angle: In each optimization iteration, we randomly select a virtual camera’s viewing angle (azimuth, pitch, distance, etc.).

Take a “snapshot”: From this random perspective, our current 3D “stone” is rendered into a 2D image through differentiable rendering.

Please “comment” on the master: We showed this rendered “snapshot” to the “master artist” (2D diffusion model) along with our creative goals, such as the text prompt “a red Ferrari sports car”. We simulated a noising process on the “snapshot” and then asked the master: “If you want to make this noisy ‘snapshot’ more like a ‘red Ferrari sports car’, how should it be modified (denoise)?” ”

Get the “sculpt” direction: The “master artist” will give a “review” – a gradient direction that points to the “better image” (this gradient is called a “score”). This gradient tells us which pixels of the current “snapshot” need to be lightened, darkened, and reddened to better match the text description.

“Chisel a small knife”: We pass this “comment” (gradient) from the 2D image space back to the parameters of our 3D “stone” (represented in differentiable 3D) by backpropagating it, and make a small update to it. The purpose of this update is to allow our “stone” to be “photographed” from the same angle next time it can take a photo that is more satisfactory to the “master”.

Step 4: Repeat tens of millions of times. We repeat the third step over and over again, “taking pictures” and “asking the master to comment and engrave” from thousands of different random perspectives. After a long period of iterative optimization, the initially shapeless “stone” is gradually carved into a three-dimensional model that looks like “a red Ferrari sports car” from all perspectives.

b. Representative work and evolution:

  • DreamFusion (Google):Pioneering the SDS method, using Imagen as a 2D prior and NeRF as a 3D representation, demonstrating stunning generation quality.
  • Magic3D (Nvidia):Improvements to DreamFusion include a multi-stage optimization strategy (from low to high resolution) and the use of more efficient 3D representations such as sparse voxel meshes, significantly improving generation speed and quality.
  • Fantasia3D, Text2Tex, etc.:Focus on improving the quality and editability of generated textures.
  • ProlificDreamer, VSD (Variational Score Distillation):Theoretical improvements to the SDS algorithm itself are designed to address the problems of insufficient diversity and oversaturation that SDS can cause, and to produce more realistic and diverse results.

c. “Knowledge distillation” from the perspective of product managers:

Core Benefits:High quality and does not rely on 3D data. This is its biggest advantage. It leverages the knowledge of powerful T2I models trained on internet-scale 2D image data to generate the most detailed and semantically accurate text-to-3D results available today, without the theoretical need for any 3D training data.

Core Challenges:Extremely slow and computationally expensive. This optimization-based approach requires hours or even longer of iterative training of a single model, which is computationally expensive. This makes it difficult to use in applications that require quick responses or real-time interaction. In addition, it faces a series of challenges such as 3D consistency (Janus problem), controllability, and output availability.

3. Representative techniques/models/tools/cases/literature and discussions

Direct 3D diffusion stands for:

[Source: Point-E: A System for Generating 3D Point Clouds from Complex Prompts – https://arxiv.org/abs/2212.08751]

Implicit function parameter spread:

[Source: Shap-E: Generating 3D Assets with Text or Images – https://arxiv.org/abs/2305.02463]

Knowledge Distillation (SDS) stands for:

[Source: DreamFusion: Text-to-3D using 2D Diffusion – https://arxiv.org/abs/2209.14988]

SDS Methodology Improvements:

[Source: Magic3D: High-Resolution Text-to-3D Content Creation – https://arxiv.org/abs/2211.10440]

Variational Fraction Distillation:

[Source: ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation – https://arxiv.org/abs/2305.16213]

3. Diffusion model from the perspective of product managers: advantages, challenges and productization thinking

After understanding the two main technical paths for Diffusion models in the field of 3D generation, as product managers, we need to take a step back, look at the opportunities and challenges presented by this technology from a higher level, and think about how to effectively productize it.

1. Key Advantages

Diffusion-based 3D generation technology, particularly through the SDS path, demonstrates several significant advantages over previous methods:

a. The ceiling for generative quality and diversity is extremely high:

Diffusion models themselves outperform earlier generative models such as GANs in fidelity and diversity. By leveraging powerful 2D T2I models as a priori, SDS-like methods are able to generate 3D models with stunning detail, complex geometry, and realistic textures with a mass ceiling far exceeding previous techniques. At the same time, the generated results also show a rich diversity.

b. Strong semantic understanding and creative skills:

They inherit the deep understanding of natural language by large language models and T2I models, and can well capture abstract concepts, artistic styles, and complex combinations in prompts and transform them into three-dimensional forms. This allows it to do real “idea generation” rather than just geometric reconstruction.

c. Flexibility and Scalability:

The core mechanism of the Diffusion model is flexible and can be applied to a variety of different data representations, and can be easily introduced into various forms of conditional guidance (text, images, sketches, segmented plots, etc.). This leaves a huge room for imagination for its future functional expansion and multimodal interactive product design.

d. Not dependent on 3D training data (SDS path):

A revolutionary advantage of the SDS path is that it theoretically eliminates the need for paired 3D datasets to be trained, which significantly lowers the barrier to entry for data acquisition, allowing models to generate objects that have never been seen in the training set or even do not exist in reality.

2. Significant Challenges

Opportunities and challenges always coexist, and Diffusion models face a number of urgent challenges in 3D applications:

a. The sharp contradiction between speed and cost:

This is the biggest obstacle to productization at present. SDS-based approaches, while high-quality, have hourly generation times (optimization times) and high GPU resource consumption, making them difficult to meet most user scenarios that require rapid iterations or immediate feedback. Direct 3D diffusion methods are fast but often of less than satisfactory quality. Finding a balance between quality, speed, and cost is a core challenge for all related products.

b. The problem of controllability is still prominent:

Although Classifier-Free Guidance provides a level of semantic control, users need much more control than that for 3D models. It is still very difficult to accurately control the size, scale, topology, connection between parts, materials in specific areas, etc. The current generation process is still largely “unpredictable”.

c. “Last Mile” Issues for Output Availability:

This is a common problem. Regardless of the path, the final output 3D model (whether it’s a mesh extracted from NeRF/SDF or a directly generated mesh/point cloud) often has serious topology errors, missing UVs, or clutter that can’t be used directly in professional animation, gaming, or industrial design processes. They require a lot of “manual post-processing” that is even more time-consuming than remodeling to become “usable”.

d. “Overfitting” and “Bias” Risks:

SDS-like methods can sometimes be “over-optimized” to match 2D renderings, resulting in some unnatural, flattened “sticker feel” or artifacts in the resulting 3D model. At the same time, they fully inherit the full data bias of the 2D T2I models they rely on, potentially generating content with stereotypes or cultural biases.

3. Productization Thoughts & Opportunities

Faced with these advantages and challenges, product managers can think about productization opportunities from the following perspectives:

a. Precise targeting of target users and application scenarios:

  • Rapid Prototyping and Concept Design:For professional users such as designers and artists, it can provide a tool that sacrifices some quality and control, but can quickly generate a large number of creative prototypes. The core value of products lies in “inspiring” and “accelerating iteration”.
  • UGC vs. Personalized Entertainment:For C-end users, it can provide a simple and interesting personalized 3D content generation tool (such as avatars, props, decorations). The core of the product is “fun”, “easy to share”, and the requirements for technical availability are relatively low.
  • Professional asset production line:For teams that need to use the product for production, the product must focus on solving the “last mile” problem. The value of generative functions alone is limited, and providing a one-stop solution from generation to automated post-processing (e.g., AI retopology, automatic UV expansion, and material optimization) is the real core value.

b. Innovative Interaction Design for Controllability:

  • Advanced Prompt Engineering Interface:Design interfaces that guide users to write structured, more precise prompts.
  • Iterative, multimodal editing:Allows users to modify and regenerate specific parts of the model through sketches, masks, or more specific instructions during or after generation.
  • Interpretability and Parametric Control:Explore exposing relatively interpretable parameters that have a significant impact on the generated results to advanced users for fine-tuning.

c. Optimizing workflows and managing user expectations:

  • Asynchronous workflow design:For time-consuming generation tasks, the product must be designed in asynchronous mode to optimize the user waiting experience through queue systems, task management interfaces, completion notifications, etc.
  • Transparent Cost and Time Estimates:Clearly inform users how long and how much the build task can expect to take (if billed by resource) for different quality/sizes.
  • Tiered Services and Pricing:Different tiers of services are available, such as fast but low-quality preview (possibly using direct 3D diffusion methods) and time-consuming but high-quality final versions (using SDS methods) with different pricing strategies.

d. Exploration of hybrid technology paths:

Combining the strengths of different technology paths can be an effective product strategy. For example, a base shape is generated using a fast direct 3D diffusion method (such as Shap-E), and then the user makes initial modifications based on it, and finally the shape is refined in detail and texture using time-consuming but high-quality SDS methods.

4. Representative techniques/models/tools/cases/literature and discussions

Commercial Product Cases:

[Source: Luma AI Genie – Text to 3D Generation Platform – https://lumalabs.ai/genie]

AI+3D Creation Tool Analysis:

[Source: Meshy AI – Text & Image to 3D Generation – https://www.meshy.ai/]

Interaction Design Research:

[Source: Human-AI Collaboration in Creative Applications – https://arxiv.org/abs/2204.02883]

Automated Post-Processing Technology:

[Source: Neural Mesh Simplification – https://openaccess.thecvf.com/content/CVPR2022/papers/Potamias_Neural_Mesh_Simplification_CVPR_2022_paper.pdf]

AIGC Product Design Guidelines:

[Source: Design Guidelines for AI-Generated Content Systems – https://dl.acm.org/doi/10.1145/3544548.3581368]

Conclusion: Harness the engine of creativity and find value in challenges

The Diffusion diffusion model, with its powerful generative ability to nurture order from chaos, has undoubtedly become one of the core engines driving the current wave of AI+3D creativity. We dissect its ingenious working—mastering the art of “creation” by learning the inverse process of “destruction”, and explore its two main application paths in the 3D field: “direct diffusion” that pursues speed and relies on 3D data, and “distillation optimization” (SDS), which pursues quality and cleverly uses 2D knowledge.

We see that this technological path is full of exciting possibilities: it responds to human linguistic creativity with unprecedented quality and diversity, dramatically lowers the barrier to entry for 3D creation, and offers hope for solving bottlenecks in content production at scale. But at the same time, we must face up to the serious challenges it faces as an emerging technology: the contradiction between speed and cost, the problem of controllability, and the “last mile” of output results that are directly “production-ready”.

For our product managers, harnessing the powerful engine of creativity in the Diffusion model means a delicate balance between opportunity and challenge. Our task is not just to demonstrate its amazing generative capabilities, but also to deeply understand the technical trade-offs behind it and design products, build workflows, and manage user expectations around its limitations. The real value of a product often lies in how to solve the toughest challenges for users – how to make generation faster and more controllable? How do I make the output more “usable”? How do you design the best interactive experience that inspires and guides users’ creativity?

Understanding Diffusion models as a powerful generative tool, combined with our previous knowledge of NeRF as a powerful reconstruction tool, gives us a more comprehensive understanding of the two pillars of current AI+3D technology. In the next note (S2E06), we will take a break from the specific generative algorithms and explore a more basic but equally important topic: the characteristics of the different 3D data representations (Mesh, Voxel, Point Cloud, SDF, etc.) that underpin these AI models, and how their selection will profoundly affect the performance of AI models and the direction of product design.

End of text
 0