quazim lbourdois commited on
Commit
29a3dfc
·
verified ·
1 Parent(s): d75c151

Improve language tag (#3)

Browse files

- Improve language tag (c966eb2bb971032aaa10a9d2df7adf157ec22f2b)


Co-authored-by: Loïck BOURDOIS <[email protected]>

Files changed (1) hide show
  1. README.md +179 -165
README.md CHANGED
@@ -1,165 +1,179 @@
1
- ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen2.5-7B-Instruct
5
- base_model_relation: quantized
6
- pipeline_tag: text2text-generation
7
- ---
8
-
9
- # Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
10
-
11
- Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
12
-
13
- * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
14
-
15
- * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
16
-
17
- * __M__: Faster model, with accuracy degradation less than 1.5%.
18
-
19
- * __S__: The fastest model, with accuracy degradation less than 2%.
20
-
21
-
22
- __Goals of elastic models:__
23
-
24
- * Provide flexibility in cost vs quality selection for inference
25
- * Provide clear quality and latency benchmarks
26
- * Provide interface of HF libraries: transformers and diffusers with a single line of code
27
- * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
28
- * Provide the best models and service for self-hosting.
29
-
30
- > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
31
-
32
- ![Performance Graph](images/performance_graph.png)
33
- -----
34
-
35
- ## Inference
36
-
37
- To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
38
-
39
- ```python
40
- import torch
41
- from transformers import AutoTokenizer
42
- from elastic_models.transformers import AutoModelForCausalLM
43
-
44
- # Currently we require to have your HF token
45
- # as we use original weights for part of layers and
46
- # model confugaration as well
47
- model_name = "Qwen/Qwen2.5-7B-Instruct"
48
- hf_token = ''
49
- device = torch.device("cuda")
50
-
51
- # Create mode
52
- tokenizer = AutoTokenizer.from_pretrained(
53
- model_name, token=hf_token
54
- )
55
- model = AutoModelForCausalLM.from_pretrained(
56
- model_name,
57
- token=hf_token,
58
- torch_dtype=torch.bfloat16,
59
- attn_implementation="sdpa",
60
- mode='S'
61
- ).to(device)
62
- model.generation_config.pad_token_id = tokenizer.eos_token_id
63
-
64
- # Inference simple as transformers library
65
- prompt = "Describe basics of DNNs quantization."
66
- messages = [
67
- {
68
- "role": "system",
69
- "content": "You are a search bot, answer on user text queries."
70
- },
71
- {
72
- "role": "user",
73
- "content": prompt
74
- }
75
- ]
76
-
77
- chat_prompt = tokenizer.apply_chat_template(
78
- messages, add_generation_prompt=True, tokenize=False
79
- )
80
-
81
- inputs = tokenizer(chat_prompt, return_tensors="pt")
82
- inputs.to(device)
83
-
84
- with torch.inference_mode():
85
- generate_ids = model.generate(**inputs, max_length=500)
86
-
87
- input_len = inputs['input_ids'].shape[1]
88
- generate_ids = generate_ids[:, input_len:]
89
- output = tokenizer.batch_decode(
90
- generate_ids,
91
- skip_special_tokens=True,
92
- clean_up_tokenization_spaces=False
93
- )[0]
94
-
95
- # Validate answer
96
- print(f"# Q:\n{prompt}\n")
97
- print(f"# A:\n{output}\n")
98
- ```
99
-
100
- __System requirements:__
101
- * GPUs: H100, L40s
102
- * CPU: AMD, Intel
103
- * Python: 3.10-3.12
104
-
105
-
106
- To work with our models just run these lines in your terminal:
107
-
108
- ```shell
109
- pip install thestage
110
- pip install elastic_models[nvidia]\
111
- --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
112
- --extra-index-url https://pypi.nvidia.com\
113
- --extra-index-url https://pypi.org/simple
114
-
115
- pip install flash_attn==2.7.3 --no-build-isolation
116
- pip uninstall apex
117
- ```
118
-
119
- Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
120
-
121
- ```shell
122
- thestage config set --api-token <YOUR_API_TOKEN>
123
- ```
124
-
125
- Congrats, now you can use accelerated models!
126
-
127
- ----
128
-
129
- ## Benchmarks
130
-
131
- Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
132
-
133
- ### Quality benchmarks
134
-
135
- | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
136
- |---------------|---|---|---|----|----------|------------|
137
- | arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
138
- | mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
139
- | piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
140
- | winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
141
-
142
-
143
-
144
- * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
145
- * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
146
- * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
147
- * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
148
-
149
- ### Latency benchmarks
150
-
151
- __100 input/300 output; tok/s:__
152
-
153
- | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
154
- |-----------|-----|---|---|----|----------|------------|
155
- | H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
156
- | L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
157
-
158
-
159
-
160
- ## Links
161
-
162
- * __Platform__: [app.thestage.ai](app.thestage.ai)
163
- * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
164
- <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
165
- * __Contact email__: [email protected]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-7B-Instruct
5
+ base_model_relation: quantized
6
+ pipeline_tag: text2text-generation
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ ---
22
+
23
+ # Elastic model: Qwen2.5-7B-Instruct. Fastest and most flexible models for self-serving.
24
+
25
+ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
26
+
27
+ * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler.
28
+
29
+ * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
30
+
31
+ * __M__: Faster model, with accuracy degradation less than 1.5%.
32
+
33
+ * __S__: The fastest model, with accuracy degradation less than 2%.
34
+
35
+
36
+ __Goals of elastic models:__
37
+
38
+ * Provide flexibility in cost vs quality selection for inference
39
+ * Provide clear quality and latency benchmarks
40
+ * Provide interface of HF libraries: transformers and diffusers with a single line of code
41
+ * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
42
+ * Provide the best models and service for self-hosting.
43
+
44
+ > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
45
+
46
+ ![Performance Graph](images/performance_graph.png)
47
+ -----
48
+
49
+ ## Inference
50
+
51
+ To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`:
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import AutoTokenizer
56
+ from elastic_models.transformers import AutoModelForCausalLM
57
+
58
+ # Currently we require to have your HF token
59
+ # as we use original weights for part of layers and
60
+ # model confugaration as well
61
+ model_name = "Qwen/Qwen2.5-7B-Instruct"
62
+ hf_token = ''
63
+ device = torch.device("cuda")
64
+
65
+ # Create mode
66
+ tokenizer = AutoTokenizer.from_pretrained(
67
+ model_name, token=hf_token
68
+ )
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ model_name,
71
+ token=hf_token,
72
+ torch_dtype=torch.bfloat16,
73
+ attn_implementation="sdpa",
74
+ mode='S'
75
+ ).to(device)
76
+ model.generation_config.pad_token_id = tokenizer.eos_token_id
77
+
78
+ # Inference simple as transformers library
79
+ prompt = "Describe basics of DNNs quantization."
80
+ messages = [
81
+ {
82
+ "role": "system",
83
+ "content": "You are a search bot, answer on user text queries."
84
+ },
85
+ {
86
+ "role": "user",
87
+ "content": prompt
88
+ }
89
+ ]
90
+
91
+ chat_prompt = tokenizer.apply_chat_template(
92
+ messages, add_generation_prompt=True, tokenize=False
93
+ )
94
+
95
+ inputs = tokenizer(chat_prompt, return_tensors="pt")
96
+ inputs.to(device)
97
+
98
+ with torch.inference_mode():
99
+ generate_ids = model.generate(**inputs, max_length=500)
100
+
101
+ input_len = inputs['input_ids'].shape[1]
102
+ generate_ids = generate_ids[:, input_len:]
103
+ output = tokenizer.batch_decode(
104
+ generate_ids,
105
+ skip_special_tokens=True,
106
+ clean_up_tokenization_spaces=False
107
+ )[0]
108
+
109
+ # Validate answer
110
+ print(f"# Q:\n{prompt}\n")
111
+ print(f"# A:\n{output}\n")
112
+ ```
113
+
114
+ __System requirements:__
115
+ * GPUs: H100, L40s
116
+ * CPU: AMD, Intel
117
+ * Python: 3.10-3.12
118
+
119
+
120
+ To work with our models just run these lines in your terminal:
121
+
122
+ ```shell
123
+ pip install thestage
124
+ pip install elastic_models[nvidia]\
125
+ --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
126
+ --extra-index-url https://pypi.nvidia.com\
127
+ --extra-index-url https://pypi.org/simple
128
+
129
+ pip install flash_attn==2.7.3 --no-build-isolation
130
+ pip uninstall apex
131
+ ```
132
+
133
+ Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
134
+
135
+ ```shell
136
+ thestage config set --api-token <YOUR_API_TOKEN>
137
+ ```
138
+
139
+ Congrats, now you can use accelerated models!
140
+
141
+ ----
142
+
143
+ ## Benchmarks
144
+
145
+ Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
146
+
147
+ ### Quality benchmarks
148
+
149
+ | Metric/Model | S | M | L | XL | Original | W8A8, int8 |
150
+ |---------------|---|---|---|----|----------|------------|
151
+ | arc_challenge | 49.10 | 50.10 | 53.20 | 52.60 | 52.60 | 41.70 | - |
152
+ | mmlu | 71.70 | 73.00 | 74.10 | 73.50 | 73.50 | 64.60 | - |
153
+ | piqa | 77.00 | 78.20 | 78.80 | 79.50 | 79.50 | 67.10 | - |
154
+ | winogrande | 66.20 | 69.10 | 71.50 | 70.60 | 70.60 | 53.10 | - |
155
+
156
+
157
+
158
+ * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
159
+ * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
160
+ * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
161
+ * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
162
+
163
+ ### Latency benchmarks
164
+
165
+ __100 input/300 output; tok/s:__
166
+
167
+ | GPU/Model | S | M | L | XL | Original | W8A8, int8 |
168
+ |-----------|-----|---|---|----|----------|------------|
169
+ | H100 | 201 | 173 | 162 | 135 | 62 | 201 | - |
170
+ | L40S | 76 | 67 | 61 | 47 | 43 | 78 | - |
171
+
172
+
173
+
174
+ ## Links
175
+
176
+ * __Platform__: [app.thestage.ai](app.thestage.ai)
177
+ * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
178
+ <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) -->
179
+ * __Contact email__: [email protected]