aswain4 commited on
Commit
a4a7f76
·
verified ·
1 Parent(s): d183a1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -6
README.md CHANGED
@@ -88,16 +88,41 @@ A Program of Thought (PoT) prompting approach was used to modify the training da
88
 
89
  PoT was implemented by creating a new column in the dataset that initiated a “plan” for the model to follow. The plan was a 4-step process: analyze the question from the train column, think step by step and plan an efficient solution before writing the code, consider any necessary programming constructs or tools, and explain the approach with well-organized and documented code. The answer to the question from the dataset was appended after the plan so there was a single text block for training. Lastly, before training started, the pandas dataframes (train and validation) were converted into Hugging Face Dataset objects, and the tokenizer was applied to the newly created column for each row.
90
 
 
91
 
92
- #### Training Hyperparameters
93
-
94
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
95
 
96
- #### Speeds, Sizes, Times [optional]
97
 
98
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- [More Information Needed]
101
 
102
  ## Evaluation
103
 
 
88
 
89
  PoT was implemented by creating a new column in the dataset that initiated a “plan” for the model to follow. The plan was a 4-step process: analyze the question from the train column, think step by step and plan an efficient solution before writing the code, consider any necessary programming constructs or tools, and explain the approach with well-organized and documented code. The answer to the question from the dataset was appended after the plan so there was a single text block for training. Lastly, before training started, the pandas dataframes (train and validation) were converted into Hugging Face Dataset objects, and the tokenizer was applied to the newly created column for each row.
90
 
91
+ LoRA was used as the training method. LoRA can efficiently update a subset of the model parameters through the introduction of low-rank matrices into weight updates, which reduces the computational overhead compared to a full fine-tuning approach while also preserving the pre-trained knowledge of the base model. This is a big advantage when using a relatively large model like Mistral-7B. Compared to other training approaches like prompt tuning, LoRA appears to be better at systematic reasoning and can be used to integrate new knowledge into the model. While LoRA will still demand relatively large amounts of computational resources compared to a simpler method like prompt tuning, this drawback is outweighed by LoRA’s ability to adapt the model effectively for a complex task such as this. LoRA’s balance of scalability, task-specific adaptability, and preservation of the model’s foundational knowledge make it the ideal choice for achieving high scores on all tested benchmarks.
92
 
93
+ In order to see all of the hyperparameters used to train the model, look at the ‘Training Hyperparameters’ section below. An r of 64 offers a trade-off between parameter efficiency and expressiveness. Setting lora_alpha = r sets the scaling factor to 1.0 and helps stabilize training without overpowering the base model’s representations. A lora_dropout of 0.05 adds regularization to prevent overfitting and typically is low enough to not be too aggressive to harm learning but still provide generalization. I chose ‘q_proj’ and ‘v_proj’ as the target modules since these attention layers have been shown to have high impact and are efficient for downstream task performance. For the training arguments, a learning rate of 1e-5 was used, as this prevents the LoRA layers from diverging too fast or corrupting the base model’s latent space, which preserves pre-trained knowledge. Additionally, the model was trained on 2 epochs, given that LoRA models tend to adapt quickly since only a percentage of the parameters are trained.
 
 
94
 
95
+ #### Training Hyperparameters
96
 
97
+ ```python
98
+ from transformers import TrainingArguments
99
+ from peft import get_peft_model, LoraConfig
100
+
101
+ lora_config = LoraConfig(
102
+ r = 64,
103
+ lora_alpha = 64,
104
+ lora_dropout = 0.05,
105
+ bias = "none",
106
+ task_type = "CAUSAL_LM",
107
+ target_modules = ['q_proj', 'v_proj']
108
+ )
109
+ model = get_peft_model(model, lora_config)
110
+
111
+ def create_training_arguments(path, learning_rate = 0.00001, epochs=2, eval_steps=12000):
112
+ training_args = TrainingArguments(
113
+ output_dir = path,
114
+ auto_find_batch_size = True,
115
+ learning_rate = learning_rate,
116
+ num_train_epochs = epochs,
117
+ logging_steps = eval_steps,
118
+ eval_strategy = "steps",
119
+ eval_steps = eval_steps,
120
+ save_steps = eval_steps,
121
+ load_best_model_at_end = True
122
+ )
123
+ return training_args
124
+ ```
125
 
 
126
 
127
  ## Evaluation
128