LayoutDiT: Exploring Content-Graphic Balance in Layout Generation with Diffusion Transformer

1Tsinghua University 2Tencent 3HKUST
*Equal Contribution Corresponding Author
Teaser Image

(a) Given a background image, our method generates reasonable and visually appealing layouts, which can be turned into a beautiful brand logo and advertising text via rendering. (b) Layouts generated by SOTA methods suffer from issues such as blocking key image areas and overlapping with each other. In contrast, our approach produces a layout that shows well-structured graphic design and seamlessly integrates with the image content.

Abstract

Layout generation is a foundation task of graphic design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually appealing layouts, including blocking, overlapping, small-sized, or spatial misalignment. We found that these methods overlook the crucial balance between learning content-aware and graphic-aware features. This oversight results in their limited ability to model the graphic structure of layouts and generate reasonable layout arrangements. To address these challenges, we introduce LayoutDiT, an effective framework that balances content and graphic features to generate high-quality, visually appealing layouts. Specifically, we first design an adaptive factor that optimizes the model's awareness of the layout generation space, balancing the model's performance in both content and graphic aspects. Secondly, we introduce a graphic condition, the saliency bounding box, to bridge the modality difference between images in the visual domain and layouts in the geometric parameter domain. In addition, we adapt a diffusion transformer model as the backbone, whose powerful generative capability ensures the quality of layout generation. Benefiting from the properties of diffusion models, our method excels in constrained settings without introducing additional constraint modules. Extensive experimental results demonstrate that our method achieves superior performance in both constrained and unconstrained settings, significantly outperforming existing methods.

Method

Architecture Image

The overview of our framework. The inputs are Gaussian noise \(l_T\), image \(I\) with its saliency map \(S\), and saliency bounding box \(B\). The layout encoder and decoder serve as the main backbone, both composed of a series of transformer blocks. Image features \(F_I\) and box features \(F_B\) are extracted by the image encoder and bounding encoder respectively, and are incorporated into the backbone through cross-attention modules. The CGBFP module takes \(F_I\) and layout representations \(F_L\) as inputs to predict a balance factor \(\omega\), which modulates the cross-attention interactions. Finally, the framework generates high-quality and visually appealing layouts \(l_0\).

Experiments

We present the comparison results of our method with other baselines, including:

  • GAN-Based models: CGL-GAN[1], DS-GAN[2]
  • VAE-Based models: ICVT[3]
  • Diffusion-Based models: LayoutDM[4] (Equipped with an identical Image Encoder to handle input images.)
  • RAG-Based models: RALF[5]

More comparison results can be found in our paper.

Unconstrained Generation
Visual-Comparison Image
Constrained Generation(By CGB-DM)
Constrain Image

More Comparisons

References

[1] Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. Composition-aware graphic layout gan for visual-textual presentation designs. arXiv preprint arXiv:2205.00303, 2022.12.

[2] Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. Posterlayout: A new benchmark and approach for content-aware visual-textual presentation layout. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.

[3] Yunning Cao, Ye Ma, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, and Yuning Jiang. Geometry aligned variational transformer for image-conditioned layout generation. In ACM Int. Conf. Multimedia, 2022.

[4] Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. Layoutdm: Discrete diffusion model for controllable layout generation. In Int. Conf. Comput. Vis., 2023.

[5] Daichi Horita, Naoto Inoue, Kotaro Kikuchi, Kota Yamaguchi, and Kiyoharu Aizawa. Retrieval-augmented layout transformer for content-aware layout generation. 2023.

BibTeX


      @misc{li2024cgbdmcontentgraphicbalance,
            title={CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model}, 
            author={Yu Li and Yifan Chen and Gongye Liu and Jie Wu and Yujiu Yang},
            year={2024},
            eprint={2407.15233},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2407.15233}, 
           }