Can we have a Llama-3.1-8B-Lexi-Uncensored-V2_fp8_scaled.safetensors
https://huggingface.co./Orenguteng/Llama-3.1-8B-Lexi-Uncensored-V2/tree/main
Can we have fp8_scaled.safetensors for this ?
There's no benefit from using it. The model itself is still censored, at least partially, depending what seed you end up with (if random).
I don't know what's with this model, it behaves as if a fixed seed is not fixed, yet for the life of me I can't get a different person on a random seed. I let the model hallucinate at CFG 8 and it will still give me the same person on a random seed. I suppose that's great for character consistency, if one desires it, but if you want random people...then the prompt has to change because some keywords totally affix the character it creates.
I also tried the abliterated version of T5 along with that uncensored Llama, and it made no difference either. There were subtle differences in the subject/background detail, but not in the censoring of characters.
Could you pls tell me how to combine the split files into one single safetensors file?
I called deepseek to write this code to convert the model to fp8 with a single file. It works with ComfyUI.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
from tqdm import tqdm
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def quantize_tensor_to_fp8(tensor, fp8_dtype=torch.float8_e4m3fn):
if tensor.dtype in (torch.float16, torch.float32):
max_val = tensor.abs().max()
scale = max_val / torch.finfo(fp8_dtype).max if max_val != 0 else 1.0
tensor_fp8 = (tensor / scale).to(fp8_dtype)
return tensor_fp8, scale
return tensor, None
def convert_model_to_scaled_fp8(model, skip_layernorm=True, fp8_dtype=torch.float8_e4m3fn):
logger.info(f"Converting model weights to scaled {fp8_dtype}...")
for name, param in tqdm(model.named_parameters(), desc="Quantizing weights"):
if param.is_floating_point():
if skip_layernorm and "input_layernorm" in name:
logger.debug(f"Skipping {name} (LayerNorm)")
continue
param_fp8, scale = quantize_tensor_to_fp8(param.data, fp8_dtype)
if scale is not None:
param.data = param_fp8
model.register_buffer(f"{name}_scale", scale.to(param.device))
logger.info("Model converted to scaled FP8 (excluding LayerNorm).")
return model
def main():
# Config
model_name = "/root/ComfyUI/models/text_encoders/Llama3.1-8B-Chinese-Chat" # Your model
output_dir = "./llama3.1_8b_fp8_scaled"
# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
fp8_dtype = torch.float8_e4m3fn
# Load model(FP16)
logger.info(f"Loading {model_name}...")
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda"
)
# Convert to scaled FP8(跳过 LayerNorm)
model = convert_model_to_scaled_fp8(model, skip_layernorm=True, fp8_dtype=fp8_dtype)
# Save model(with Scale parameter)
logger.info(f"Saving model to {output_dir}...")
model.save_pretrained(output_dir, max_shard_size='1024GB')
if __name__ == "__main__":
main()
Modify model_name
and output_dir
in this code. (line 36 and 37)
This code require transformers library. And your VRAM is enough for loading fp16 weight.
Thank you.
sorry for the delayed reply...that code above is way too long.
run this in the model folder and you'll get a model.safetensors
from safetensors.torch import load_file, save_file
from transformers import AutoTokenizer, AutoModel
import torch
load1 = load_file("model-00001-of-00004.safetensors")
load2 = load_file("model-00002-of-00004.safetensors")
load3 = load_file("model-00003-of-00004.safetensors")
load4 = load_file("model-00004-of-00004.safetensors")
# Once everything looks right lets unpack this
merged_state_dict = {**load1, **load2, **load3, **load4}
save_file(merged_state_dict, "model.safetensors")
Much easier to use recently updated https://github.com/city96/ComfyUI-GGUF QuadrupleCLIPLoader (GGUF)
node and works with QuantFactory/DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored-GGUF and likely other common GGUFs in limited testing.
I tested it with a quantized t5-xxl-encoder GGUF from city96. Good luck!
Not seeing any difference here with the DarkIdol version of Llama. The output still appears semi-censored so I doubt the issue is with the text encoders.
Much easier to use recently updated https://github.com/city96/ComfyUI-GGUF
QuadrupleCLIPLoader (GGUF)
node and works with QuantFactory/DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored-GGUF and likely other common GGUFs in limited testing.I tested it with a quantized t5-xxl-encoder GGUF from city96. Good luck!
Yeah, looking at some of the python code it seems to only be using the token embedding from the llama-3.1-8b model and not doing any actual inferencing huh
There's no benefit from using it. The model itself is still censored, at least partially, depending what seed you end up with (if random).
I don't know what's with this model, it behaves as if a fixed seed is not fixed, yet for the life of me I can't get a different person on a random seed. I let the model hallucinate at CFG 8 and it will still give me the same person on a random seed. I suppose that's great for character consistency, if one desires it, but if you want random people...then the prompt has to change because some keywords totally affix the character it creates.
I also tried the abliterated version of T5 along with that uncensored Llama, and it made no difference either. There were subtle differences in the subject/background detail, but not in the censoring of characters.
There is a benefit, we have a WIP finetune of HiDream which is uncensored now, if you use default T5,CLIP you will still be hindered, if you use a uncensored version of llama3.1 8b instruct then you should get it working.
I suggest avoiding T5 and other CLIP for the time being unless you know what you are doing.
There is already a working finetune here:
https://huggingface.co./e-n-v-y/hidream-uncensored
All you need is to replace the original with an uncensored version of llama 3.1 8B instruct in the CLIP loader and should be good to go.
sorry for the delayed reply...that code above is way too long.
run this in the model folder and you'll get a model.safetensors
from safetensors.torch import load_file, save_file from transformers import AutoTokenizer, AutoModel import torch load1 = load_file("model-00001-of-00004.safetensors") load2 = load_file("model-00002-of-00004.safetensors") load3 = load_file("model-00003-of-00004.safetensors") load4 = load_file("model-00004-of-00004.safetensors") # Once everything looks right lets unpack this merged_state_dict = {**load1, **load2, **load3, **load4} save_file(merged_state_dict, "model.safetensors")
Thanks @mancub its also worth noting, @drguolai approach was to make a a single safetensors file that is 'fp8' quantized from the instruct model weights.
Main consideration is hardware limitations, specifically memory.
For people with great hardware or patience(using partial loading/fallback to system memory, or if you want your storage device to degrade quicker: in your SWAP space), ~16GB of memory use for a text-encoder's weights is fine. others like me who have a RTX 4070 (12GB VRAM) want as much of it in VRAM as possible, granted i have 64GB of system memory which mitigates OOM's somewhat, it still wouldn't be comfortable for me using full precision (unquantized).
the "hidream_i1_full_fp8.safetensors" file is 17.1GB, that same model without quantization is ~26.1 GB for weights in memory. And the inference memory usage can be significantly more depending on batch and context/sequence length used.
Your approach is great for full precision, just wanted to provide some context that these are distinct goals, for the benefit of some people who may be reading this thread who are new.
Side Note: this is a general warning, please don't use a large SWAP space for AI inference on these models, it degrades your hard drive quicker (and is snail slow), for SSD's it WILL actively kill it under heavy load, especially if for long periods. i know because one of my friends, did it to his relatively new 10TB NVME SDD, poor bugger , it's not fun and can end in literal tears. (of course dipping into SWAP a bit is okay if its a small amount(only a few GB's) and only occasional, i recommend disabling SWAP use for things like this where read/writes are very frequent and over large sectors). RAM is one of the cheaper hardware components to buy, if you are in a position to do that, its a good thing to get if you need it.
Back when I was building this workstation (years ago) I somehow ended up with 256GB of RAM...Yeah, was a total overkill at the time, but nowadays it comes pretty handy so there's not much swapping happening in the OS.