Spaces:
Paused
Paused
File size: 8,967 Bytes
f6efbfb 0bc84fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
title: My UNO App
sdk: gradio
emoji: π
colorFrom: red
colorTo: red
---
<h3 align="center">
<img src="assets/logo.png" alt="Logo" style="vertical-align: middle; width: 40px; height: 40px;">
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
</h3>
<p align="center">
<a href="https://github.com/bytedance/UNO"><img alt="Build" src="https://img.shields.io/github/stars/bytedance/UNO"></a>
<a href="https://bytedance.github.io/UNO/"><img alt="Build" src="https://img.shields.io/badge/Project%20Page-UNO-yellow"></a>
<a href="https://arxiv.org/abs/2504.02160"><img alt="Build" src="https://img.shields.io/badge/arXiv%20paper-UNO-b31b1b.svg"></a>
<a href="https://huggingface.co./bytedance-research/UNO"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=Model&color=orange"></a>
<a href="https://huggingface.co./spaces/bytedance-research/UNO-FLUX"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=demo&color=orange"></a>
</p>
><p align="center"> <span style="color:#137cf3; font-family: Gill Sans">Shaojin Wu,</span><sup></sup></a> <span style="color:#137cf3; font-family: Gill Sans">Mengqi Huang</span><sup>*</sup>,</a> <span style="color:#137cf3; font-family: Gill Sans">Wenxu Wu,</span><sup></sup></a> <span style="color:#137cf3; font-family: Gill Sans">Yufeng Cheng,</span><sup></sup> </a> <span style="color:#137cf3; font-family: Gill Sans">Fei Ding</span><sup>+</sup>,</a> <span style="color:#137cf3; font-family: Gill Sans">Qian He</span></a> <br>
><span style="font-size: 16px">Intelligent Creation Team, ByteDance</span></p>
<p align="center">
<img src="./assets/teaser.jpg" width=95% height=95%
class="center">
</p>
## π₯ News
- [04/2025] π₯ Update fp8 mode as a primary low vmemory usage support. Gift for consumer-grade GPU users. The peak Vmemory usage is ~16GB now. We may try further inference optimization later.
- [04/2025] π₯ The [demo](https://huggingface.co./spaces/bytedance-research/UNO-FLUX) of UNO is released.
- [04/2025] π₯ The [training code](https://github.com/bytedance/UNO), [inference code](https://github.com/bytedance/UNO), and [model](https://huggingface.co./bytedance-research/UNO) of UNO are released.
- [04/2025] π₯ The [project page](https://bytedance.github.io/UNO) of UNO is created.
- [04/2025] π₯ The arXiv [paper](https://arxiv.org/abs/2504.02160) of UNO is released.
## π Introduction
In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.
## β‘οΈ Quick Start
### π§ Requirements and Installation
Install the requirements
```bash
## create a virtual environment with python >= 3.10 <= 3.12, like
# python -m venv uno_env
# source uno_env/bin/activate
# then install
pip install -r requirements.txt
```
then download checkpoints in one of the three ways:
1. Directly run the inference scripts, the checkpoints will be downloaded automatically by the `hf_hub_download` function in the code to your `$HF_HOME`(the default value is `~/.cache/huggingface`).
2. use `huggingface-cli download <repo name>` to download `black-forest-labs/FLUX.1-dev`, `xlabs-ai/xflux_text_encoders`, `openai/clip-vit-large-patch14`, `bytedance-research/UNO`, then run the inference scripts. You can just download the checkpoint in need only to speed up your set up and save your disk space. i.e. for `black-forest-labs/FLUX.1-dev` use `huggingface-cli download black-forest-labs/FLUX.1-dev flux1-dev.safetensors` and `huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors`, ignoreing the text encoder in `black-forest-labes/FLUX.1-dev` model repo(They are here for `diffusers` call). All of the checkpoints will take 37 GB of disk space.
3. use `huggingface-cli download <repo name> --local-dir <LOCAL_DIR>` to download all the checkpoints mentioned in 2. to the directories your want. Then set the environment variable `AE`, `FLUX_DEV`(or `FLUX_DEV_FP8` if you use fp8 mode), `T5`, `CLIP`, `LORA` to the corresponding paths. Finally, run the inference scripts.
4. **If you already have some of the checkpoints**, you can set the environment variable `AE`, `FLUX_DEV`, `T5`, `CLIP`, `LORA` to the corresponding paths. Finally, run the inference scripts.
### π Gradio Demo
```bash
python app.py
```
**For low vmemory usage**, please pass the `--offload` and `--name flux-dev-fp8` args. The peak memory usage will be 16GB. Just for reference, the end2end inference time is 40s to 1min on RTX 3090 in fp8 and offload mode.
```bash
python app.py --offload --name flux-dev-fp8
```
### βοΈ Inference
Start from the examples below to explore and spark your creativity. β¨
```bash
python inference.py --prompt "A clock on the beach is under a red sun umbrella" --image_paths "assets/clock.png" --width 704 --height 704
python inference.py --prompt "The figurine is in the crystal ball" --image_paths "assets/figurine.png" "assets/crystal_ball.png" --width 704 --height 704
python inference.py --prompt "The logo is printed on the cup" --image_paths "assets/cat_cafe.png" "assets/cup.png" --width 704 --height 704
```
Optional prepreration: If you want to test the inference on dreambench at the first time, you should clone the submodule `dreambench` to download the dataset.
```bash
git submodule update --init
```
Then running the following scripts:
```bash
# evaluated on dreambench
## for single-subject
python inference.py --eval_json_path ./datasets/dreambench_singleip.json
## for multi-subject
python inference.py --eval_json_path ./datasets/dreambench_multiip.json
```
### π Training
```bash
accelerate launch train.py
```
### π Tips and Notes
We integrate single-subject and multi-subject generation within a unified model. For single-subject scenarios, the longest side of the reference image is set to 512 by default, while for multi-subject scenarios, it is set to 320. UNO demonstrates remarkable flexibility across various aspect ratios, thanks to its training on a multi-scale dataset. Despite being trained within 512 buckets, it can handle higher resolutions, including 512, 568, and 704, among others.
UNO excels in subject-driven generation but has room for improvement in generalization due to dataset constraints. We are actively developing an enhanced modelβstay tuned for updates. Your feedback is valuable, so please feel free to share any suggestions.
## π¨ Application Scenarios
<p align="center">
<img src="./assets/simplecase.jpg" width=95% height=95%
class="center">
</p>
## π Disclaimer
<p>
We open-source this project for academic research. The vast majority of images
used in this project are either generated or licensed. If you have any concerns,
please contact us, and we will promptly remove any inappropriate content.
Our code is released under the Apache 2.0 License,, while our models are under
the CC BY-NC 4.0 License. Any models related to <a href="https://huggingface.co./black-forest-labs/FLUX.1-dev" target="_blank">FLUX.1-dev</a>
base model must adhere to the original licensing terms.
<br><br>This research aims to advance the field of generative AI. Users are free to
create images using this tool, provided they comply with local laws and exercise
responsible usage. The developers are not liable for any misuse of the tool by users.</p>
## π Updates
For the purpose of fostering research and the open-source community, we plan to open-source the entire project, encompassing training, inference, weights, etc. Thank you for your patience and support! π
- [x] Release github repo.
- [x] Release inference code.
- [x] Release training code.
- [x] Release model checkpoints.
- [x] Release arXiv paper.
- [x] Release huggingface space demo.
- [ ] Release in-context data generation pipelines.
## Related resources
- [https://github.com/jax-explorer/ComfyUI-UNO](https://github.com/jax-explorer/ComfyUI-UNO) a ComfyUI node implementation of UNO by jax-explorer.
## Citation
If UNO is helpful, please help to β the repo.
If you find this project useful for your research, please consider citing our paper:
```bibtex
@article{wu2025less,
title={Less-to-More Generalization: Unlocking More Controllability by In-Context Generation},
author={Wu, Shaojin and Huang, Mengqi and Wu, Wenxu and Cheng, Yufeng and Ding, Fei and He, Qian},
journal={arXiv preprint arXiv:2504.02160},
year={2025}
}
``` |