controlnet-DharunSN/model_out

These are controlnet weights trained on stabilityai/stable-diffusion-2-1-base with new type of conditioning. You can find some example images below. NOTE: This is a low precision model so image quality and charactersitics may be lagging

prompt: a white hoodie shirt on a size four model in a beach setting DensePose Condition: Image Generated: prompt: a green jumper shirt and white pants with a green overcoat on top Image Generated:

Intended uses & limitations

How to use

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from transformers import CLIPTokenizer, CLIPTextModel
from PIL import Image
import torch

# Load pre-trained components
controlnet = ControlNetModel.from_pretrained("path/to/your/controlnet-model")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base",
    controlnet=controlnet,
    tokenizer=tokenizer,
    text_encoder=text_encoder,
    torch_dtype=torch.float16,
).to("cuda")

# Example: Generate an image of a red jacket on a model pose
pose_image = Image.open("path/to/pose.png").convert("RGB").resize((512, 512))
prompt = "a red leather jacket with silver zippers, worn on a casual street-style model"

image = pipe(prompt=prompt, control_image=pose_image, num_inference_steps=30).images[0]
image.save("output.png")

Limitations and bias

Pose Alignment Errors: The model may fail to accurately align garments with extremely dynamic or occluded body poses.

Fabric Simulation: Lacks realistic physical behavior of fabrics like wrinkles, folds, or flowing movement.

Resolution Constraints: Default generation is 512×512. Upscaling may lose fidelity unless further post-processing is used.

Model Drift in Edge Cases: Struggles with rare combinations of garment types and unconventional descriptions.

Dataset Bias: DeepFashion and related datasets often overrepresent certain body types, genders, and skin tones, which can skew model generalization.

Style Bias: High fashion or Western clothing styles are more common in training data, leading to poorer performance for traditional or niche designs.

Recommendations: Augment training data with underrepresented demographics and clothing styles

Use reinforcement or adversarial training to improve physics realism and fairness

Apply domain adaptation for traditional clothing categories

Training details

Training Data:

DeepFashion: 400k images with clothing types, poses, and attributes (https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)

DensePose / OpenPose: For extracting skeletal keypoints and human pose conditioning maps

Text Descriptions: Generated or curated captions describing clothing, structure, and materials

Fashion-Design-10K, Fabric-Texture-2K, Fashion-Model-5K: Supporting datasets for garment diversity, material realism, and body types

Training Setup:

Hardware: NVIDIA L4 GPU (32GB VRAM)

Precision: Mixed precision (torch.float16)

Optimizer: 8-bit AdamW with learning rate 1e-5

Gradient Accumulation: 8–16 steps to support large batch emulation

Epochs: 1–3 depending on overfitting trends

Batch Size: Effective size of 32 (with accumulation)

Training Duration: Approx. 12–15 hours for 1 full epoch on 400k images

DharunSN
/

model_out

controlnet-DharunSN/model_out

Intended uses & limitations

How to use

Limitations and bias

Training details

Model tree for DharunSN/model_out