LayoutDiT: Exploring Content-Graphic Balance in Layout Generation with Diffusion Transformer

¹Tsinghua University ²Tencent ³HKUST

^*Equal Contribution ^†Corresponding Author

Abstract

Layout generation is a foundation task of graphic design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually appealing layouts, including blocking, overlapping, small-sized, or spatial misalignment. We found that these methods overlook the crucial balance between learning content-aware and graphic-aware features. This oversight results in their limited ability to model the graphic structure of layouts and generate reasonable layout arrangements. To address these challenges, we introduce LayoutDiT, an effective framework that balances content and graphic features to generate high-quality, visually appealing layouts. Specifically, we first design an adaptive factor that optimizes the model's awareness of the layout generation space, balancing the model's performance in both content and graphic aspects. Secondly, we introduce a graphic condition, the saliency bounding box, to bridge the modality difference between images in the visual domain and layouts in the geometric parameter domain. In addition, we adapt a diffusion transformer model as the backbone, whose powerful generative capability ensures the quality of layout generation. Benefiting from the properties of diffusion models, our method excels in constrained settings without introducing additional constraint modules. Extensive experimental results demonstrate that our method achieves superior performance in both constrained and unconstrained settings, significantly outperforming existing methods.

Method

The overview of our framework. The inputs are Gaussian noise \(l_T\), image \(I\) with its saliency map \(S\), and saliency bounding box \(B\). The layout encoder and decoder serve as the main backbone, both composed of a series of transformer blocks. Image features \(F_I\) and box features \(F_B\) are extracted by the image encoder and bounding encoder respectively, and are incorporated into the backbone through cross-attention modules. The CGBFP module takes \(F_I\) and layout representations \(F_L\) as inputs to predict a balance factor \(\omega\), which modulates the cross-attention interactions. Finally, the framework generates high-quality and visually appealing layouts \(l_0\).

More Comparisons

Unconstrained visual comparison on PKU

Unconstrained visual comparison on CGL

Constrained visual comparison on PKU

Constrained visual comparison on CGL

References

[1] Min Zhou, Chenchen Xu, Ye Ma, Tiezheng Ge, Yuning Jiang, and Weiwei Xu. Composition-aware graphic layout gan for visual-textual presentation designs. arXiv preprint arXiv:2205.00303, 2022.12.

[2] Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. Posterlayout: A new benchmark and approach for content-aware visual-textual presentation layout. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.

[3] Yunning Cao, Ye Ma, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, and Yuning Jiang. Geometry aligned variational transformer for image-conditioned layout generation. In ACM Int. Conf. Multimedia, 2022.

[4] Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, and Kota Yamaguchi. Layoutdm: Discrete diffusion model for controllable layout generation. In Int. Conf. Comput. Vis., 2023.

[5] Daichi Horita, Naoto Inoue, Kotaro Kikuchi, Kota Yamaguchi, and Kiyoharu Aizawa. Retrieval-augmented layout transformer for content-aware layout generation. 2023.

BibTeX

@misc{li2024cgbdmcontentgraphicbalance, title={CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model}, author={Yu Li and Yifan Chen and Gongye Liu and Jie Wu and Yujiu Yang}, year={2024}, eprint={2407.15233}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.15233}, }