High fidelity font synthesis for CJK languages

zi2zi-JiT: Font Synthesis with Pixel Space Diffusion Transformers

中文版

Overview

zi2zi-JiT is a conditional variant of JiT (Just image Transformer) designed for Chinese font style transfer. Given a source character and a style reference, it synthesizes the character in the target font style.

The architecture, illustrated above, extends the base JiT model with three components:

Content Encoder — a CNN that captures the structural layout of the input character, adapted from FontDiffuser.

— a CNN that captures the structural layout of the input character, adapted from FontDiffuser. Style Encoder — a CNN that extracts stylistic features from a reference glyph in the target font.

— a CNN that extracts stylistic features from a reference glyph in the target font. Multi-Source In-Context Mixing — instead of conditioning on a single category token as in the original JiT, font, style, and content embeddings are concatenated into a unified conditioning sequence.

Training

Two model variants are available — JiT-B/16 and JiT-L/16 — both trained for 2,000 epochs on a corpus of over 400+ fonts (70% simplified Chinese, 20% traditional Chinese, 10% Japanese), totalling 300k+ character images. For each font, the max number of characters used for training is capped at 800

... continue reading