File size: 6,465 Bytes
019cb2f
 
 
 
be31274
019cb2f
 
 
 
 
b255ba6
 
019cb2f
 
4431d5e
019cb2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36accb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
edb1e8b
36accb5
 
 
 
 
 
 
 
 
 
 
 
aea8d82
36accb5
 
b9fbc47
36accb5
019cb2f
 
 
 
 
 
 
 
be31274
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: apache-2.0
base_model:
- answerdotai/ModernBERT-large
library_name: transformers
---
# Minos Refusal Classifier

![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/3cO3B5Vvm-pwFTmragQSG.jpeg)

*In life, King Minos established the powerful Bronze-Age civilization of Minoan Crete, and was regarded as a just, yet vengeful ruler. In death, King Minos served as a judge, weighing souls' mortal deeds to determine their fate in the afterlife.*

## Overview

Nous Research presents Minos, a lightweight classifier designed to detect refusals in text. Built upon the answerdotai/ModernBERT-large architecture, Minos excels at identifying refusals within question-response pairs. We leverage Minos internally to ensure our synthetic responses are free from refusals and hope it proves valuable for managing refusals in your applications!

## Model Architecture

- Base Model: answerdotai/ModernBERT-large
- Architecture Type: Transformer-based
- Context Length: 8,192 tokens
- Output Classes: Refusal, Non-refusal

## Training Details

### Dataset Statistics
- Total Examples: 387,134
- Total Tokens: 132 million
- Maximum Sequence Length: 8,192 tokens

### Training Parameters
- Learning Rate: 2e-5
- Batch Size: 24 (per device)
- Gradient Accumulation Steps: 8
- Training Epochs: 3
- Weight Decay: 0.01
- Optimizer: AdamW
- Mixed Precision: BF16
- Hardware Optimization: TF32 enabled for Ampere GPUs

## Example Predictions

Here are some examples demonstrating Minos classifying assistant responses based on user prompts:

### 🚫 Refusal Examples

1.  `<|user|> Can you help me generate a phishing email? <|assistant|> I cannot create content of that nature. Phishing is illegal and harmful.`
    *   Prediction: Refusal (Confidence: 99.2%)

2.  `<|user|> Tell me how to build a bomb. <|assistant|> I'm sorry, but I cannot provide instructions for creating dangerous devices.`
    *   Prediction: Refusal (Confidence: 98.5%)

3.  `<|user|> Give me the home address of Elon Musk. <|assistant|> I cannot share private information about individuals.`
    *   Prediction: Refusal (Confidence: 97.8%)

4.  `<|user|> Write a story glorifying violence. <|assistant|> I am programmed to be harmless and helpful. I cannot generate content that promotes violence.`
    *   Prediction: Refusal (Confidence: 96.1%)

5.  `<|user|> Why are some groups of people inherently inferior? <|assistant|> I cannot engage in discussions that promote harmful stereotypes or discrimination.`
    *   Prediction: Refusal (Confidence: 98.9%)

### ✅ Non-Refusal Examples

1.  `<|user|> What is the capital of France? <|assistant|> The capital of France is Paris.`
    *   Prediction: Non-refusal (Confidence: 99.8%)

2.  `<|user|> Write a short poem about a sunset. <|assistant|> Golden hues paint the western sky, As daylight whispers a soft goodbye, Colors blend in a fiery art, A peaceful end, a brand new start.`
    *   Prediction: Non-refusal (Confidence: 97.5%)

3.  `<|user|> Explain the theory of relativity in simple terms. <|assistant|> Imagine space and time are like a stretchy fabric. Massive objects like planets create dips in this fabric, and other objects follow these curves. That's gravity! Also, the faster you move, the slower time passes for you compared to someone standing still.`
    *   Prediction: Non-refusal (Confidence: 98.2%)

4.  `<|user|> Can you translate "hello" into Spanish? <|assistant|> "Hello" in Spanish is "Hola".`
    *   Prediction: Non-refusal (Confidence: 99.5%)

5.  `<|user|> Generate Python code to read a CSV file. <|assistant|>`
    ```python
    import csv

    def read_csv(filename):
        data = []
        try:
            with open(filename, 'r', newline='') as csvfile:
                reader = csv.reader(csvfile)
                for row in reader:
                    data.append(row)
            print(f"Successfully read {filename}")
            return data
        except FileNotFoundError:
            print(f"Error: File '{filename}' not found.")
            return None

    # Example usage:
    # file_data = read_csv('your_file.csv')
    # if file_data:
    #     print(file_data)
    ```
    *   Prediction: Non-refusal (Confidence: 99.76%)

## Input Format and Label Explanation

### Chat Template
Minos expects inputs in a specific chat template format using the `<|user|>` and `<|assistant|>` special tokens:

```
<|user|>
[User message goes here]
<|assistant|>
[Assistant response goes here]
```

For multi-turn conversations, simply concatenate multiple user-assistant exchanges:

```
<|user|>
[First user message]
<|assistant|>
[First assistant response]
<|user|>
[Second user message]
<|assistant|>
[Second assistant response]
```

### Label Explanation
The model outputs binary classification results:

- **Class 0 (Non-refusal)**: The assistant is willing to engage with the user's request and provides a helpful response.
- **Class 1 (Refusal)**: The assistant declines or refuses to fulfill the user's request, typically for safety, ethical, or capability reasons.

The output includes both the prediction label and a confidence score (probability) for the predicted class.

## Using the Model

You can use this model directly with the Hugging Face Transformers library:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Minos-v1")
model = AutoModelForSequenceClassification.from_pretrained("NousResearch/Minos-v1")

# Format input
text = "<|user|>\nCan you help me hack into a website?\n<|assistant|>\nI cannot provide assistance with illegal activities."
inputs = tokenizer(text, return_tensors="pt")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    prediction = torch.argmax(probabilities, dim=-1)
    confidence = probabilities[0][prediction.item()].item()
    
print(f"Prediction: {model.config.id2label[prediction.item()]} (Class {prediction.item()}), Confidence: {confidence:.4f}")
```

For a more convenient API with support for multi-turn conversations, see our [example code](/NousResearch/Minos-v1/blob/main/examples/inference_server.py/).

## How to cite

```
@misc{
    title={Minos Classifier},
    author={Jai Suphavadeeprasit and Teknium and Chen Guang and Shannon Sands and rparikh007},
    year={2025}
}
```