Update README.md
Browse files
README.md
CHANGED
@@ -88,16 +88,41 @@ A Program of Thought (PoT) prompting approach was used to modify the training da
|
|
88 |
|
89 |
PoT was implemented by creating a new column in the dataset that initiated a “plan” for the model to follow. The plan was a 4-step process: analyze the question from the train column, think step by step and plan an efficient solution before writing the code, consider any necessary programming constructs or tools, and explain the approach with well-organized and documented code. The answer to the question from the dataset was appended after the plan so there was a single text block for training. Lastly, before training started, the pandas dataframes (train and validation) were converted into Hugging Face Dataset objects, and the tokenizer was applied to the newly created column for each row.
|
90 |
|
|
|
91 |
|
92 |
-
|
93 |
-
|
94 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
95 |
|
96 |
-
####
|
97 |
|
98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
|
100 |
-
[More Information Needed]
|
101 |
|
102 |
## Evaluation
|
103 |
|
|
|
88 |
|
89 |
PoT was implemented by creating a new column in the dataset that initiated a “plan” for the model to follow. The plan was a 4-step process: analyze the question from the train column, think step by step and plan an efficient solution before writing the code, consider any necessary programming constructs or tools, and explain the approach with well-organized and documented code. The answer to the question from the dataset was appended after the plan so there was a single text block for training. Lastly, before training started, the pandas dataframes (train and validation) were converted into Hugging Face Dataset objects, and the tokenizer was applied to the newly created column for each row.
|
90 |
|
91 |
+
LoRA was used as the training method. LoRA can efficiently update a subset of the model parameters through the introduction of low-rank matrices into weight updates, which reduces the computational overhead compared to a full fine-tuning approach while also preserving the pre-trained knowledge of the base model. This is a big advantage when using a relatively large model like Mistral-7B. Compared to other training approaches like prompt tuning, LoRA appears to be better at systematic reasoning and can be used to integrate new knowledge into the model. While LoRA will still demand relatively large amounts of computational resources compared to a simpler method like prompt tuning, this drawback is outweighed by LoRA’s ability to adapt the model effectively for a complex task such as this. LoRA’s balance of scalability, task-specific adaptability, and preservation of the model’s foundational knowledge make it the ideal choice for achieving high scores on all tested benchmarks.
|
92 |
|
93 |
+
In order to see all of the hyperparameters used to train the model, look at the ‘Training Hyperparameters’ section below. An r of 64 offers a trade-off between parameter efficiency and expressiveness. Setting lora_alpha = r sets the scaling factor to 1.0 and helps stabilize training without overpowering the base model’s representations. A lora_dropout of 0.05 adds regularization to prevent overfitting and typically is low enough to not be too aggressive to harm learning but still provide generalization. I chose ‘q_proj’ and ‘v_proj’ as the target modules since these attention layers have been shown to have high impact and are efficient for downstream task performance. For the training arguments, a learning rate of 1e-5 was used, as this prevents the LoRA layers from diverging too fast or corrupting the base model’s latent space, which preserves pre-trained knowledge. Additionally, the model was trained on 2 epochs, given that LoRA models tend to adapt quickly since only a percentage of the parameters are trained.
|
|
|
|
|
94 |
|
95 |
+
#### Training Hyperparameters
|
96 |
|
97 |
+
```python
|
98 |
+
from transformers import TrainingArguments
|
99 |
+
from peft import get_peft_model, LoraConfig
|
100 |
+
|
101 |
+
lora_config = LoraConfig(
|
102 |
+
r = 64,
|
103 |
+
lora_alpha = 64,
|
104 |
+
lora_dropout = 0.05,
|
105 |
+
bias = "none",
|
106 |
+
task_type = "CAUSAL_LM",
|
107 |
+
target_modules = ['q_proj', 'v_proj']
|
108 |
+
)
|
109 |
+
model = get_peft_model(model, lora_config)
|
110 |
+
|
111 |
+
def create_training_arguments(path, learning_rate = 0.00001, epochs=2, eval_steps=12000):
|
112 |
+
training_args = TrainingArguments(
|
113 |
+
output_dir = path,
|
114 |
+
auto_find_batch_size = True,
|
115 |
+
learning_rate = learning_rate,
|
116 |
+
num_train_epochs = epochs,
|
117 |
+
logging_steps = eval_steps,
|
118 |
+
eval_strategy = "steps",
|
119 |
+
eval_steps = eval_steps,
|
120 |
+
save_steps = eval_steps,
|
121 |
+
load_best_model_at_end = True
|
122 |
+
)
|
123 |
+
return training_args
|
124 |
+
```
|
125 |
|
|
|
126 |
|
127 |
## Evaluation
|
128 |
|