ICL is the most significant emergent capability of LLM, and combined with RLHF are the sole reason why LLM become prevalent—no training required, so end users can interact with the powerful model with near-zero learning cost.
People have been using image generation models in significantly different ways compared to LLM. While tuning-free approaches like InstantID seems promising, most in-production image models requires finetuning to consistently produce certain styles or objects. Companies like civit.ai and liblib.art are built based upon this reality.
IC-LoRA represents a new UI paradigm for image models. In-context image generation can be formulated as: Given n
input images and (n + m)
instructions acting as conditions for generation, generate m
output images.
Note that this is a very general formulation, similar to how you would formulate single/multi-reference video generation. In fact, a new work in image editting (https://arxiv.org/pdf/2412.07774) formulate image generation and editing tasks as discontinuous frame generation.
The input conditions and output slots are concatenated as one image, where the output slots are completely masked during denoising. In my experiments, base DiT models already show some in-context generation capability:
This is likely because DiT architecture explicitly model attention among image patches, which is what intuitively makes in-context generation work. Thus, small amount of finetuning is enough to fully surface the in-context generation capability in the DiT models. I was able to reproduce the movie-shot generation example in the paper with just 10 images in 1 epoch of finetuning.
The pixel-level in-context detail preservation is also stunning:
This definitely reminded me of finetuning LLM for translating story from some user-provided ideas—the model is able to learn to preserve the key ideas provided as input conditions, while also utilizing the rules (e.g. narrative structures) encoded in the SFT data to expand the simple ideas into well-written stories.
So the question is, how far could we go? I’ve started building a dataset to see if Flux.1 gets to learn more complex image translation capability—not just simply preserving the condition pixels, but to translate it by some complex visual rules into the output image. Until next time!