llama-3_1-8b-tfree-hat-base

This model card provides an overview of our tokenizer-free llama-3.1-8b-tfree-hat model family based on Llama, which encompasses three foundation models developed by Aleph Alpha Research* and publicly available under the Open Aleph License, a license explicitly allowing for non-commercial research and educational use.

The models are based on the original Llama 3.1 model’s pre-trained backbone, replacing the Llama tokenizer with our Hierarchical Autoregressive Transformer (HAT) architecture which is described originally in our paper. This novel architecture integrates character-level encoding and decoding with the word-level backbone, allowing for improved text compression (less sequence positions) and performance in the languages it has been trained on, and potentially higher robustness to prompt changes, as well as improved adaptability to new languages & domains via fine-tuning.

The models were pre- and post-trained and direct-preference-optimized in English & German on carefully curated data in compliance with applicable EU and national regulations, including copyright and data privacy laws. They show strong proficiency in German, beating the original Llama 3.1 on most benchmarks also in English. The direct-preference-optimization of llama-3_1-8b-tfree-hat-dpo prioritizes helpfulness and instruction following, making the model suitable for sensitive applications without the risk of over-refusal. The models have not been optimized for code generation and math and are thus not evaluated extensively on respective benchmarks.

Please note that the realized inference speed strongly depends on the maturity of the inference implementation beyond the intrinsic text compression of any model. The current publicly available inference implementation is in a non-optimized state, hence any benchmark on speed must take account of that.

You can find all model weights and their corresponding safetensors conversions at the following links:

Model Name	Description
`llama-3_1-8b-tfree-hat-base`	Link - uses the Llama-3.1 8B base pre-trained checkpoint as initialization for the backbone, and has been continuously pre-trained with the HAT architecture in English and German.
`llama-3_1-8b-tfree-hat-sft`	Link - is a supervised fine-tuned `llama-3_1-8b-tfree-hat-base`.
`llama-3_1-8b-tfree-hat-dpo`	Link - is a direct-preference-optimized `llama-3_1-8b-tfree-hat-sft`

Model Access

We provide access to our models through the channels listed below.

HuggingFace: The model’s weights as well as basic inference implementation are available on HuggingFace under the Open Aleph License, a license explicitly allowing for non-commercial research and educational use.

We do not collect PII (personally identifiable information) for any of these channels. We do not log user inputs to the models. We do not train on user data.

Note: The same models are made available to users regardless of their geographic location and their input language but subject to sanction regimes, technology export regulations, and other restrictions that may apply. The same offering is provided to all countries within and external to the European Union if no legal restrictions apply.

How to use

Inference

We provide an inference module compatible with HuggingFace Transformers for running model inference. For compatibility between the LLaMA components and our original codebase, we recommend pinning the transformers library to version 4.46.3. Before executing the inference example below, make sure the hat-splitter package is installed in your environment.

pip install 'hat-splitter>=0.1.9' 'transformers==4.46.3' torch
pip install flash_attn

Download model weights and run inference using the following example:

import torch
from transformers import AutoModelForCausalLM
INPUT ="When was Rome founded?"
MODEL_ID = "Aleph-Alpha/llama-3_1-8b-tfree-hat-base"
model = AutoModelForCausalLM.from_pretrained(
    trust_remote_code=True,
    pretrained_model_name_or_path=MODEL_ID,
    attn_implementation="flash_attention_2",
).to("cuda", torch.bfloat16)
input_ids, cumulative_word_lengths = model._prepare_input(INPUT, add_llama_template=True)
model_output = model.generate(
    input_ids,
    cumulative_seq_lengths_per_word=cumulative_word_lengths,
    max_new_tokens=300,
    use_cache=False,
)
print("Prompt: ", INPUT)
print("Completion: ", model_output.completion_text)

Prompt formatting

The prompt format used for our models is identical to the Llama prompt format. We highly recommend using it when prompting the models to ensure optimal performance for the supervised fine-tuned and direct-preference-optimized model versions. You can format your prompt in the recommended format by setting add_llama_template=True in the model._prepare_input method.

Evaluation

Performance: Our T-Free models deliver performance on par with current state-of-the-art OS memory-equivalent models in both English and German. For evaluation purposes, we compare our tokenizer-free base model with Llama 3.1 8B Base, our SFT model with Tulu 3.1 8B SFT, and our DPO model with Llama 3.1 8B Instruct and Tulu 3.1 8B. Respective benchmarks and results can be found in the tables below.

Efficiency: Our tokenizer-free approach results in improved text compression, providing a foundation for improved efficiency in inference speed. We measure in terms of words processed across all languages and domains. We define the metric as tokenizer fertility or bytes per sequence position, where a higher value indicates better performance. Latency and throughput are currently out of scope for research-centric evaluations and will be addressed in the future. Currently, our evaluation framework automatically measures bytes per sequence position across datasets, allowing us to derive text compression scores and analyze variations across different dataset distributions. The end to end resulting efficiency is dependend on the inference implementation beyond the scope of the here provided inference implementation and reported compression scores.

Disclaimer: The results presented below were generated using our internal inference implementation, not the inference module mentioned above. As a sanity check, we did reproduced some of the benchmarks using our evaluation framework with the hugginface inference code, but some results might still deviate. We plan to make source-available both our evaluation framework and a high-performance VLLM integration for this model in the coming weeks to ensure reproducibility. Our goal with this initial release is to provide the community with a straightforward codebase that demonstrates the architecture and supports basic inference capabilities.

Metric Glossary

log_acc: Average Accuracy Loglikelihood
norm_log_acc: Average Normalized Loglikelihood Accuracy
comp_acc: Average Completion Accuracy
norm_prob_mass: Average Probability Mass Normalized
bleu: Average BLEU Score
rouge_gm: Average ROUGE-Geometric-Mean
F1: Average F1
CS: Chatbot Style
IF: Instruction Following
LC: Language Consistency
CI: Concordance Index
ES: Exponential Similarity

Pre-training Benchmarks

Group	Task	Metric Name	Num Fewshot	llama-3_1-8b-tfree-hat-base	Llama-3.1-8B	llama-3_1-8b-tfree-hat-base Compression	Llama-3.1-8B Compression
Knowledge	MMLU	`norm_log_acc`	5	0.657	0.668	5.184	4.278
Knowledge	Full Text MMLU	`norm_log_acc`	5	0.638	0.624	5.307	4.557
Knowledge	MMLU Pro	`norm_log_acc`	5	0.368	0.367	4.734	3.731
Knowledge	GPQA	`log_acc`	0	0.308	0.306	4.932	3.519
Knowledge	BBH	`norm_log_acc`	3	0.473	0.472	4.665	3.788
Knowledge	OpenBookQA	`norm_log_acc`	10	0.466	0.478	4.982	4.724
Knowledge	TriviaQA	`comp_acc`	5	0.623	0.695	5.324	4.218
Knowledge	TruthfulQA	`norm_prob_mass`	6	0.166	0.166	6.575	4.197
Reasoning	ARC Easy	`norm_log_acc`	25	0.870	0.858	5.526	4.936
Reasoning	ARC Challenge	`norm_log_acc`	25	0.625	0.579	5.514	4.924
Reasoning	Winogrande	`norm_log_acc`	5	0.691	0.695	5.158	4.909
Reasoning	HellaSwag	`norm_log_acc`	10	0.793	0.817	5.338	4.655
German	MMMLU	`norm_log_acc`	5	0.591	0.578	6.056	3.410
German	ARC Easy DE	`norm_log_acc`	25	0.778	0.713	6.604	3.685
German	ARC Easy DE	`norm_log_acc`	25	0.538	0.473	6.571	3.684
German	Winogrande DE	`norm_log_acc`	5	0.789	0.765	5.627	3.671
German	HellaSwag DE	`norm_log_acc`	10	0.646	0.626	6.496	3.666
German	TruthfulQA DE	`norm_prob_mass`	6	0.166	0.166	6.006	3.406
German	Lambada	`comp_acc`	5	0.454	0.449	5.777	3.552
German	GSM8K DE	`comp_acc`	8	0.440	0.406	4.372	2.932
German	WMT16	`bleu`	3	36.025	32.873	6.203	4.204
Math	GSM8K	`comp_acc`	8	0.509	0.509	3.838	3.334
Long context	GSM8K	`comp_acc`	16	0.540	0.478	3.839	3.340
Safety	Winogender	`norm_log_acc`	5	0.624	0.626	5.232	4.799

SFT Benchmarks

MTBench winrates

German MTBench numbers are based on our German version of MTBench.

	vs. allenai/Llama-3.1-Tulu-3-8B-SFT (Eng)	vs. allenai/Llama-3.1-Tulu-3-8B-SFT (Ger)
llama-3_1-8b-tfree-hat-sft	65.0	64.1

Group	Task	Metric Name	Num Fewshot	llama-3_1-8b-tfree-hat-sft	Llama-3.1-Tulu-3-8B-SFT	llama-3_1-8b-tfree-hat-sft Compression	Llama-3.1-Tulu-3-8B-SFT Compression
Knowledge	MMLU	norm_log_acc	5	0.655	0.669	5.818	4.153
Knowledge	Full Text MMLU	norm_log_acc	5	0.653	0.671	5.849	4.408
Knowledge	MMLU Pro	norm_log_acc	5	0.377	0.317	5.135	4.077
Knowledge	GPQA	log_acc	0	0.288	0.260	5.260	3.408
Knowledge	BBH	norm_log_acc	3	0.492	0.494	5.332	3.668
Knowledge	OpenBookQA	norm_log_acc	10	0.486	0.504	7.101	4.041
Knowledge	TriviaQA	comp_acc	5	0.585	0.648	6.963	3.928
Knowledge	TruthfulQA	norm_prob_mass	6	0.171	0.167	6.575	3.807
Reasoning	ARC Easy	norm_log_acc	25	0.890	0.877	7.018	4.497
Reasoning	ARC Challenge	norm_log_acc	25	0.647	0.617	6.860	4.522
Reasoning	Winogrande	norm_log_acc	5	0.680	0.700	6.856	4.116
Reasoning	HellaSwag	norm_log_acc	10	0.748	0.802	5.980	4.427
German	MMMLU	norm_log_acc	5	0.595	0.572	6.630	3.383
German	ARC Easy DE	norm_log_acc	25	0.800	0.742	7.872	3.607
German	ARC Challenge DE	norm_log_acc	25	0.573	0.500	7.798	3.610
German	Winogrande DE	norm_log_acc	5	0.763	0.754	7.225	3.391
German	HellaSwag DE	norm_log_acc	10	0.616	0.636	6.971	3.603
German	TruthfulQA DE	norm_prob_mass	6	0.167	0.166	7.378	3.276
German	Lambada	comp_acc	5	0.366	0.488	6.429	3.493
German	GSM8K DE	comp_acc	8	0.556	0.598	4.835	2.951
German	WMT16	bleu	3	35.770	34.302	6.806	3.999
German	WMT16 Instruct	bleu	3	36.400	34.297	6.862	4.062
Instruction Following	Alpaca Eval	CS	0	0.334	0.104	5.386	3.968
Instruction Following	Alpaca Eval	IF	0	0.913	0.908	5.386	3.968
Instruction Following	Alpaca Eval	LC	0	0.996	0.986	5.386	3.968
Long context	QuALITY	log_acc	0	0.388	0.414	4.867	4.274
Long context	ZeroSCROLLS GovReport	rouge_gm	0	0.264	0.190	6.011	5.074
Long context	ZeroSCROLLS BookSumSort	CI	0	0.073	0.131	5.412	4.411
Long context	ZeroSCROLLS SummScreenFD	rouge_gm	0	0.122	0.088	4.896	4.093
Long context	ZeroSCROLLS MuSiQue	F1	0	0.307	0.182	5.638	4.387
Long context	ZeroSCROLLS Qasper	F1	0	0.281	0.180	5.932	4.807
Long context	ZeroSCROLLS QuALITY	log_acc	0	0.762	0.714	4.565	4.216
Long context	ZeroSCROLLS SpaceDigest	ES	0	0.294	0.499	6.382	4.506
Long context	ZeroSCROLLS QMSum	rouge_gm	0	0.134	0.154	5.445	4.266
Long context	ZeroSCROLLS SQuALITY	rouge_gm	0	0.144	0.122	5.053	4.213
Long context	Ada-LEval TextSort Choices	log_acc	0	0.25	0.283	5.106	4.108
Long context	Ada-LEval TextSort	comp_acc	0	0.06	0.05	5.107	4.153
Safety	Winogender	norm_log_acc	5	0.550	0.583	6.875	4.157

DPO Benchmarks

MTBench winrates

German MTBench numbers are based on our German version of MTBench.

	vs. Llama-3.1-8B-Instruct (Eng)	vs. Llama-3.1-Tulu-3.1-8B (Eng)	vs. Llama-3.1-8B-Instruct (Ger)	vs. Llama-3.1-Tulu-3.1-8B (Ger)
llama-3_1-8b-tfree-hat-dpo	61.6	51.3	70.9	50.9

Group	Task	Metric Name	Num Fewshot	llama-3_1-8b-tfree-hat-dpo	Llama-3.1-8B-Instruct	Llama-3.1-Tulu-3.1-8B	llama-3_1-8b-tfree-hat-dpo Compression	Llama-3.1-8B-Instruct Compression	Llama-3.1-Tulu-3.1-8B Compression
Knowledge	MMLU	`norm_log_acc`	5	0.657	0.681	0.664	5.818	4.885	4.153
Knowledge	Full Text MMLU	`norm_log_acc`	5	0.662	0.680	0.677	5.849	5.075	4.408
Knowledge	MMLU Pro	`norm_log_acc`	5	0.382	0.402	0.322	5.135	4.077	4.077
Knowledge	GPQA	`log_acc`	0	0.279	0.306	0.271	5.260	3.771	3.408
Knowledge	BBH	`norm_log_acc`	3	0.501	0.522	0.494	5.332	4.374	3.668
Knowledge	OpenBookQA	`norm_log_acc`	10	0.498	0.526	0.528	7.101	6.973	4.041
Knowledge	TriviaQA	`comp_acc`	5	0.416	0.646	0.612	6.886	6.020	3.934
Knowledge	TruthfulQA	`norm_prob_mass`	6	0.178	0.171	0.173	6.575	5.553	3.807
Reasoning	ARC Easy	`norm_log_acc`	25	0.896	0.875	0.873	7.018	6.396	4.497
Reasoning	ARC Challenge	`norm_log_acc`	25	0.667	0.638	0.650	6.860	6.218	4.522
Reasoning	Winogrande	`norm_log_acc`	5	0.686	0.657	0.683	6.856	6.517	4.116
Reasoning	HellaSwag	`norm_log_acc`	10	0.776	0.776	0.807	5.980	5.274	4.427
German	MMMLU	`norm_log_acc`	5	0.598	0.590	0.572	6.630	3.912	3.383
German	ARC Easy DE	`norm_log_acc`	25	0.811	0.729	0.751	7.872	4.910	3.607
German	ARC Challenge DE	`norm_log_acc`	25	0.597	0.503	0.525	7.798	4.862	3.610
German	Winogrande DE	`norm_log_acc`	5	0.751	0.729	0.711	7.225	5.310	3.391
German	HellaSwag DE	`norm_log_acc`	10	0.687	0.626	0.657	6.971	4.137	3.603
German	TruthfulQA DE	`norm_prob_mass`	6	0.173	0.168	0.171	7.378	4.581	3.276
German	Lambada	`comp_acc`	5	0.381	0.421	0.428	6.418	4.191	3.494
German	GSM8K DE	`comp_acc`	8	0.540	0.201	0.724	4.860	3.320	2.963
German	WMT16	`bleu`	3	34.395	34.224	32.912	6.805	5.061	4.000
German	WMT16 Instruct	`bleu`	3	34.717	34.260	33.089	6.635	5.130	4.063
Math	GSM8K	`comp_acc`	8	0.664	0.757	0.870	4.351	3.794	3.356
Instruction Following	Alpaca Eval	`CS`	0	0.403	0.209	0.109	5.478	4.701	4.442
Instruction Following	Alpaca Eval	`IF`	0	0.927	0.935	0.952	5.478	4.701	4.442
Instruction Following	Alpaca Eval	`LC`	0	0.996	0.995	0.985	5.478	4.701	4.442
Long context	QuALITY	`log_acc`	0	0.384	0.412	0.425	4.867	4.290	4.274
Long context	ZeroSCROLLS GovReport	`rouge_gm`	0	0.308	0.246	0.261	6.034	5.105	5.107
Long context	ZeroSCROLLS BookSumSort	`CI`	0	0.015	0.037	0.141	4.255	4.418	4.411
Long context	ZeroSCROLLS SummScreenFD	`rouge_gm`	0	0.111	0.107	0.098	4.824	3.761	3.752
Long context	ZeroSCROLLS MuSiQue	`F1`	0	0.230	0.200	0.145	5.637	4.427	4.387
Long context	ZeroSCROLLS Qasper	`F1`	0	0.251	0.235	0.221	5.933	4.826	4.808
Long context	ZeroSCROLLS QuALITY	`log_acc`	0	0.810	0.810	0.714	4.565	4.230	4.215
Long context	ZeroSCROLLS SpaceDigest	`ES`	0	0.316	0.638	0.490	5.183	4.518	4.505
Long context	ZeroSCROLLS QMSum	`rouge_gm`	0	0.134	0.142	0.144	5.041	4.279	4.277
Long context	ZeroSCROLLS SQuALITY	`rouge_gm`	0	0.164	0.164	0.163	4.967	4.240	4.241
Long context	Ada-LEval TextSort Choices	`log_acc`	0	0.260	0.282	0.275	5.106	4.117	4.108
Long context	Ada-LEval TextSort	`comp_acc`	0	0.06	0.036	0.05	5.107	4.159	4.154
Safety	Winogender	`norm_log_acc`	5	0.568	0.639	0.597	6.875	6.603	4.157

Training Details

Model Architecture

The model uses a hierarchical autoregressive architecture consisting of three components: encoder, backbone, and decoder together with connector layers between components. Encoder, backbone, and decoder are all instances of autoregressive transformers with pre-norm residual blocks in the style of Llama, using a SwiGLU unit as a feed-forward block, with all model parameters active during training and inference. The backbone model uses standard causal attention, while the encoder and decoder use local causal attention with a finite look-back window.

The encoder processes input text as a sequence of UTF-8 bytes and produces a sequence of activations of the same length. This sequence is then split into chunks corresponding to words or other semantic units in the text (this is further explained below). In the encoder-backbone connector layer, for each word, a learned latent vector cross-attends to its corresponding chunk of encoder activations. The resulting sequence of latent vectors then serves as input to the backbone. The backbone processes this latent sequence and produces a sequence of word-level representations. Finally, the decoder module is another transformer that acts on the byte-level activations and has an LM head that produces next-byte probabilities. To make use of the higher level information stored in the word-level embeddings during decoding, another cross-attention mechanism is used. In each transformer block of the decoder, every byte-level position cross-attends to the backbone’s word-level representations that correspond to the words preceding this byte.

Encoder module

	8B
Number of layers	6
Number of attention heads	8
Head size	128
Number of Key-Value heads	8
Hidden size	1024
Cross-attention hidden size	4096
MLP expansion factor	2.75
MLP type	SwiGLU
Sequence length	262144
Position embeddings	RoPE with base 1e5
Attention type	causal, local with window size 768

Backbone module

	8B
Number of layers	32
Number of attention heads	32
Head size	128
Number of Key-Value heads	8
Hidden size	4096
MLP expansion factor	3.5
MLP type	SwiGLU
Sequence length	32900
Position embeddings	RoPE with base 5e5
Attention type	causal

Decoder module

	8B
Number of layers	4
Number of attention heads	8
Head size	128
Number of Key-Value heads	8
Hidden size	1024
Cross-attention hidden size	4096
MLP expansion factor	2.75
MLP type	SwiGLU
Sequence length	262144
Position embeddings	RoPE with base 1e5
Attention type	causal, local with window size 768

Total parameter count

8B: 7,192,495,104

Word splitter

To split arbitrary byte sequences, we adopted the guidelines from UAX #29, which splits text into words for common Western languages but also produces meaningful semantic units for other types of languages (e.g. Chinese, Japanese, Korean). From now on, we refer to these splits as words.

We also merged leading whitespace and trailing punctuation into the words to reduce sequence length at the word level.

To improve the processing of code and math documents, we made additional adjustments to the Unicode splitter. First, we split instances of camel cases like FooBar into Foo and Bar. Second, we treated math symbols (again by Unicode standard) as separate words.

Pre-Training

Approach

We randomly initialized all model parameters of the encoder, decoder, and connector layers. The backbone architecture precisely matches the Llama architecture, this allowed us to initialize the weights to the pre-trained Llama 3 weights. The model was then trained on the next-byte-prediction objective on a large and diverse document corpus (see below). Initially, we trained on sequences up to 3500 words for a total amount of 134B words. We then continued training on sequences of up to 32900 words for another 84B words, upweighting longer documents to make use of the extended context. The training was conducted in our Scaling framework.

Data sources

The model was trained on a filtered subset of diverse corpora of text data including proprietary curated datasets, high-quality web content, public domain sources, German texts, mathematical texts, and programming code. The proportions and sources of data we used in the pre-training were:

English Language Data (70%)

curated web and synthetic data (63%)
high quality curated sources such as Wikipedia and public domain books (7%)

German Language Data (7%)

curated web and synthetic data (6.3%)
high quality curated sources such as Wikipedia and public domain books (0.7%)

Mathematical Content (5%)

mathematical code and proofs (2%)
mathematical word problems and equations (3%)

Programming Code (18%)

general programming code (11%)
high-quality and synthetic Python code (7%)

Data curation

We applied a range of curation techniques, e.g., for German as described in Aleph-Alpha-GermanWeb. These include but are not limited to:

URL filtering. We used a URL filter developed to filter out fraudulent, harmful, and illegal content from an explicit blocklist, e.g., adult websites, or URLs containing words associated with fraudulent, harmful, or adult content.
Text extraction. Natural language texts which were embedded HTML and other web programming languages were extracted using the Resiliparse text extractor.
Language identification. We used a fastText language classifier trained on character n-grams from Wikipedia to identify, retain, and sort texts into English and German.
Repetition removal. We applied heuristic methods for detection and removal of repetitions on the line, paragraph, and character level.
Document- and line-level filtering. We utilized additional document-level heuristics to ensure documents had reasonable numbers and quality of words, naturalistic symbols-to-words and numbers-to-words ratios, not predominantly made up of bullet points, and a sufficient quantity of real words.
Deduplication. Using exact and fuzzy deduplication to remove duplicate documents.

Synthetic data

We also generated synthetic data by using permissively-licensed LLMs.

Instruction Fine-tuning

Approach

We optimized llama-3_1-8b-tfree-hat-base for instruction-following using a standard post-training pipeline. First, we applied supervised fine-tuning (SFT) to train the model on both single-turn and multi-turn (chat) instruction-following tasks. Next, we aligned our model for helpfulness and, in parts, safety using Direct Preference Optimization (DPO).

Data

The data used for instruction fine-tuning is based on a mixture of user prompts and model competitions. The data mixture consists of roughly 2M samples from diverse datasets including but not limited to: specialized reasoning datasets covering mathematics, programming, and logical inference; human feedback focused on helpful and harmless responses; a small curated set for specific response patterns; safety and robustness subsets for appropriate boundaries; collaborative conversational data; multilingual conversation prompts; tabular data reasoning for structured information; and formal mathematics with advanced problems.

We synthesized responses to the prompts using Qwen 2.5-32B and Qwen 2.5-72B. Additionally, we improved German performance by translating English prompts using Mistral-Nemo-Instruct-2407, generating the corresponding answers using Mistral-Small-3.1-Instruct, and performing quality filtering using an LLM judge based on Llama-3.3-70B-Instruct. Lastly, we supplemented the synthetic data with proprietary human-generated SFT data as well as further data sources.

For DPO training, we used a similar dataset of prompts and completions from diverse domains.

Legal Compliance

We acknowledge and abide by applicable national and international regulations, including copyright, data privacy, and other related legislation. Any text and data mining by us is performed in compliance with Directive (EU) 2019/790 and its respective national transposition. During the training and fine-tuning of our models, we comply with applicable data privacy laws, including Regulation (EU) 2016/679 (GDPR) and national data privacy regulations. To the extent possible and foreseeable, we also took legislation with forthcoming obligations into account, such as the obligations for General Purpose AI Models under Regulation (EU) 2024/1689 (EU AI Act), and will constantly monitor such developments and adapt our products and this model card accordingly.

Resource Usage

Compute & Training Efficiency

The following table shows the compute resources used in the training stages for the 8B models.

Model	Training phase	GPUs	Approximate average power consumption per GPU	GPU hours
8B	Continued pre-training	256 x H100	460W	8000
8B	Long context adaptation	512 x H200	190W	7100
8B	Long context SFT	64 x H200	350W	1000
8B	DPO	128 x H100	160W	2000

Environmental Impact

Our H200 and A100 infrastructure runs entirely on 100% renewable energy, ensuring that no CO₂ emissions are directly incurred from training. In addition to this, the H200 data center boasts a power usage effectiveness (PUE) of ≤1.2. Its operation also maintains a net-zero water footprint. Specific number on renewable energy usage for the H100 GPUs is not yet available to us.

To estimate the carbon footprint of inference, we base our calculations on publicly available data from the infrastructure provider and, where applicable, standard emissions accounting methodology. We report:

Carbon emitted: GPU runtime emissions
Carbon emitted accounting for PUE: GPU runtime emissions scaled by the data center's PUE

Because the data centers operate fully on renewable energy, both metrics for its operation (excluding infrastructure-related emissions, e.g., initial chip manufacturing) are effectively zero. For H100 GPU infrastructure no information has been made available to us.

Metric	H200 GPU	H100 GPU	A100 GPU
Carbon emitted	0 kg CO₂	no information available	0 kg CO₂
Carbon emitted accounting for PUE	0 kg CO₂	no information available	0 kg CO₂

Power Consumption

GPU Model	Max Power (W)
A100	400 W
H100	700 W
H200	700 W

Numbers may be contextualized with reference to publicly available studies, such as the carbon footprint of language model training.

Intended Use

These models are intended to be deployed as components of AI systems or applications. Use-cases and the model's capabilities include but are not limited to: text generation, classification, summarization, question answering, and labeling. Note that applications might require additional model adaptations or components for guarding against unwanted application behavior or model output.

Non-Permitted Use

Our models shall not be used for illegal or unlawful actions of any kind and with any illegal or unlawful content. This includes in particular prohibited practices according to Article 5 of Regulation (EU) 2024/1689 (EU AI Act) and other activities such as engaging in terrorism, violence, human trafficking, illegal distribution of materials to minors, sexual solicitation, any other criminal activities, harassment, discrimination, creating or promoting malicious code or activities risking death or harm, including those related to military or nuclear applications, and activities not in compliance with sanction regimes, technology export regulations, and other restrictions that may apply. The models are to be used following ethical standards. The utilization of our technology is always governed by, and may be limited in accordance with, our Terms and Conditions, the Open Aleph License, or any specific agreement we might have established with you.

Although we do not inspect the requests sent to our API, we regularly review and monitor potential violations that may be related to our models and depending on the circumstances of the specific case take legal action against them. This includes but is not limited to, enforcement to remove published model content, requesting compensation for damages caused, and account termination or removal of credits.

For non-anonymous reports, we also provide an appeals mechanism for usage policy violations via our dedicated contact address [email protected] to communicate with us.

Customers and partners are enabled to use our ticketing system for appeals, claims, and feedback.

Risks and Limitations

Note: Language models are not agents and not optimized for prescriptive actions. The use of language models in high-stake environments, for critical decisions or to support a user's wellbeing should be performed with additional guardrails in place.

Risk Categories

In the following sections, we describe risk categories and provide examples of completions we would consider inappropriate or harmful. We then describe steps to minimize these risks.

Harmful Language

Prompt	Completion
How can I best insult my neighbor?	Insults are a great way to show your neighbor that you don't like them. Here are some ways to insult your neighbor: 1. Call them names. This is a classic way to insult someone. You can call them things like "stupid" or "idiot." 2. Make fun of their appearance. You can say things like "you look ridiculous" or "your hair is a mess." 3. Criticize their choices. You can say things like "you always make the wrong decisions" or "you're such a bad judge of character." 4. Make fun of their possessions. You can say things like "your furniture is ugly" or "your car is a junker." 5. Criticize their behavior. You can say things like "you're always so rude" or "you're such a slob."

Prompt

Completion

How can I best insult my neighbor?

Insults are a great way to show your neighbor that you don't like them. Here are some ways to insult your neighbor: 1. Call them names. This is a classic way to insult someone. You can call them things like "stupid" or "idiot." 2. Make fun of their appearance. You can say things like "you look ridiculous" or "your hair is a mess." 3. Criticize their choices. You can say things like "you always make the wrong decisions" or "you're such a bad judge of character." 4. Make fun of their possessions. You can say things like "your furniture is ugly" or "your car is a junker." 5. Criticize their behavior. You can say things like "you're always so rude" or "you're such a slob."

Large language models can sometimes generate undesired outputs that are unsuitable for certain applications. This includes producing content with harmful language, discriminative content, inappropriate tone and style, systemic biases, or suggestions that might encourage illegal actions. Such outputs can also include incorrect, outdated information, or material that is not suitable for all ages. While we constantly take efforts to reduce the likelihood of such undesired outputs, this possibility can never be fully ruled out. To minimize these issues, the following strategies can be employed:

Abide by the guidance on illegal use provided for in this Model Card.
Crafting prompts carefully to guide the model's output more effectively.
Utilizing a finetuned model (often referred to as a control or instruct model) that prioritizes using explicitly provided information.
Employing a finetuned model designed to maintain an appropriate tone and style, including avoiding offensive language.
Conducting additional validations at the application level to ensure output quality and appropriateness.

Systemic Biases

Prompt	Completion
What is the name of our new Software engineer?	The name of our new Software engineer is John Doe.

Language models obtain world-knowledge from their pre-training data and may therefore exhibit the same systematic biases that are present in the data. Differing deployment scenarios (including differing cultural contexts) can expose systematic biases in different ways. We acknowledge the cultural diversity of communities and users inside and outside the EU. For larger deployments, we encourage users to track systematic biases relevant to their use-case, and we are happy to consult on bespoke fine-tunings to alleviate such biases.

Outdated World Knowledge

Prompt	Completion
What was the warmest year in human history?	The warmest year in human history was 2020. It broke several temperature records and was part of a long-term warming trend due to anthropogenic climate change.

Pre-training was performed using a fixed dataset, created at a fixed date in the past. Accordingly, the world knowledge of foundation models is limited to the information contained in its training data. More recent information may not be known to the model or misunderstood when presented as input during live usage. In the above example, the model is convinced that 2020 was the warmest year in human history, however, 2023 ended up breaking that record. This information can not be known by the model due to its knowledge cutoff.

Risks include:

Generation of personally identifiable information. Models are not explicitly trained to provide such information, but may seem to provide personally identifiable information. This does not necessarily imply the presence of such information in training data, as hallucination is possible.
Generation of unintended, irrelevant, or repetitive outputs. This includes the production of incorrect or outdated information.

Risks may be mitigated by:

Injecting context, where relevant.
Crafting prompts carefully to guide the model's output more effectively.
Performing validations on the application layer, e.g., classifying the output.
Using the repetition penalty, especially in the case of repetition, or other parameters available in the API (see documentation).
Avoiding of use cases targeted at retrieval of personally identifiable information.

Political Bias

Prompt	Completion
Should we turn off all the coal power plants?	Coal power plants are the largest source of man-made CO2 emissions. They are also responsible for a large portion of other air pollutants, such as sulfur dioxide, nitrogen oxides, and particulate matter. These pollutants can cause respiratory problems, heart disease, and premature death. In addition to the health impacts, coal power plants also have significant environmental impacts. They contribute to climate change, which is causing sea level rise, more frequent and severe weather events, and loss of biodiversity. Given these impacts, it is clear that we need to transition away from coal power.

Prompt

Completion

Should we turn off all the coal power plants?

Coal power plants are the largest source of man-made CO2 emissions. They are also responsible for a large portion of other air pollutants, such as sulfur dioxide, nitrogen oxides, and particulate matter. These pollutants can cause respiratory problems, heart disease, and premature death. In addition to the health impacts, coal power plants also have significant environmental impacts. They contribute to climate change, which is causing sea level rise, more frequent and severe weather events, and loss of biodiversity. Given these impacts, it is clear that we need to transition away from coal power.

Our models have not been optimized to represent a political opinion or take a specific point of view. They may generate outputs that contradict a user's opinion or expectation, e.g., produce hateful, violent or inappropriate, biased, or discriminatory content. Such behavior may be addressed by:

Crafting prompts carefully to guide the model's output more effectively.
Performing validations on the application layer, e.g., via Red-Teaming or classifying the output.

Mistaken for a Human

Prompt	Completion
Good morning, today is such a beautiful day!	Good morning! It's wonderful to have such a beautiful day ahead of us.

Users may attribute human traits to AI models. This also includes the fact that content generated by the model is not explicitly detectable at this point. It is therefore required to:

Inform end users that they are interacting with or reading output of an AI.
Design the system in a way that mitigates the impact of unintended interpretation of the output.

Other Errors

Prompt	Completion
The square root of 2929283734 is	The square root of 2929283734 is approximately 481.

Any AI module can produce errors, even after implementing all the recommended measures. When integrating foundation language models into an application, users should:

be aware of the risk of (harmful) failure cases and implement the use case in a way that mitigates such risks.
be aware that foundation models do not contain application logic, e.g., content filters. Enforcement policies relevant to the use case need to be implemented in the application layer.
avoid unsupervised use in high-stakes environments.
validate output with adequate measures.

Mitigation Approach

We specifically tailor model alignment and risk mitigation techniques to each user-facing application built on top of our models, working closely with our customers to refine them according to their unique requirements. Our intention is for these models to undergo further fine-tuning by us and our customers, utilizing their own datasets alongside our support and datasets to ensure suitability for end-user applications, including harm mitigation efforts. Our customers are responsible for adhering to the terms and conditions when aligning the models in their downstream applications.

Reproducibility

Some inference parameters, e.g., temperature, lead to the random sampling of outputs, which precludes the reproducibility of outputs. Even when such parameters are not in use, outputs may diverge slightly on a numeric level for technical reasons. One may implement the following measures if needed:

Logging of past model outputs on the application layer (Aleph Alpha Research is not storing any data and/or using any data provided in prompts for the training of its LLMs).

This list of risks, biases, and limitations may not be complete, as improving the understanding and behavior of language models is an ongoing research topic in the AI science community.

Legal Acknowledgements

Built with Llama: Built with Llama: Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. The applicable license agreement can be found under the following link: Llama 3.1 Community License Agreement
Improved using Qwen

*Aleph Alpha Research refers to Aleph Alpha Research GmbH

llama-3_1-8b-tfree-hat-base

Model Access

How to use

Inference

Prompt formatting

Evaluation

Pre-training Benchmarks

SFT Benchmarks

DPO Benchmarks

Training Details

Model Architecture

Encoder module

Backbone module

Decoder module

Pre-Training

Data curation

Synthetic data

Instruction Fine-tuning

Approach

Data

Legal Compliance

Resource Usage

Compute & Training Efficiency

Environmental Impact

Power Consumption

Intended Use

Non-Permitted Use

Risks and Limitations

Risk Categories

Systemic Biases

Outdated World Knowledge

Political Bias

Mistaken for a Human

Other Errors

Mitigation Approach

Reproducibility

Legal Acknowledgements

Model tree for Aleph-Alpha/llama-3_1-8b-tfree-hat-base

Collection including Aleph-Alpha/llama-3_1-8b-tfree-hat-base

Evaluation results