Model Card for rtt4fb/LlamaCode-Codeforces-v1

Developed by Taylor Tucker at the University of Virginia School of Data Science

Introduction

The use of large language models (LLMs) to generate practice questions for students could go a long way towards improving the educational opportunities for students by allowing for the flexible and individualized generation of practice problems, which can greatly increase student success [6]. It would also help make the lives of computer science educators easier by reducing the time-consuming process of generating material for students. The problem with current LLM capability in this regard is that the problems the models generate are barebones, boilerplate, and boring. I set out to train a LLM to generate interesting and applicable problems with which students may practice their skills without being dissuaded by the drabness of the material. This model may also be used to match student learning rates, allowing for more individualized instruction.

Training Data

To train the model, I utilized a vast repository generated by MatrixStudio of data from Codeforces, a Kyrgyzstan-based online programming competition [4]. The dataset consisted of 690,396 problems used in the competition and the topic tags describing each problem. After removing duplicate questions to avoid training and testing leakage, I was left with 16,533 examples, which were immediately split into an 80% / 20% train-test split. The testing split was stored and not used for training.

The problems in the Codeforces dataset are interesting, unique, and vast in terms of their subject matter. The original dataset was transformed, using a baseline prompt (i.e. "Please generate 3 Python programming question(s) involving the following subject areas:") alongside the problems associated tags. The response is the question itself. I utilized three-shot prompting of questions with the same tags to maximize the model’s learning capabilities.

Training Method

Based on the complex nature of the problems in the Codeforces repository, and without the computational capacity to perform a full fine-tune, I opted for the parameter-efficient fine-tuning strategy using LoRA due to its ease of implementation, computational efficiency, and proven track record [8]. LoRA is a widely utilized, parameter-efficient adapter methodology which has been shown to work well for fine-tuning operations across use cases. LoRA works by introducing adapter matrices in the layers of the LLM, training these low-latency parameters without changing the pre-trained abilities of the base model. The hyperparameters for LoRA used in this project are listed as follows:

LORA_R = 64
LORA_ALPHA = 64
LORA_DROPOUT = 0.05

The base model in this experiment was Meta’s Llama 3.2 1 billion parameter Instruct model using its respective tokenizer [14]. The adapted model was trained on the training split of the Codeforces dataset over multiple days using two Nvidia A6000 GPUs.

Evaluation

This model will be evaluated on two benchmark, BigBenchHard, MMLU, and CodeXGlue and the testing set of the custom dataset derived from Codeforces [3], [4], [5], [15]. BigBenchHard was chosen as a benchmark to ensure that the model maintained general reasoning capabilities. CodeXGlue was used to ensure that the model maintained programming understanding, even though the aim of this fine-tuning process was to improve programming problem generation. MMLU was chosen to benchmark the models ability to retain general knowledge post-training and to indicate potential catastrophic forgetting if it occurs.

The results of LlamaCode-Codeforces-v1 will be compared against against Microsoft's Phi-4 Mini Instruct model and the baseline Llama 3.2 1 Billion Instruct model [11], [15]. These models were chosen due to the similar number of parameters to the LlamaCode-Codeforces-v1 model (3.8b and 1b, respectively). Further, both models are used for reasoning tasks.

Benchmark	BBH	CodeXGlue	MMLU	Codeforces Test
Metric	Exact Match	Smoothed Bleu	Accuracy	Mean BERT F1 Score
Model
LlamaCode-Codeforces-v1	0.0000	1.0346	0.3498	0.8019
Llama 3.2 1B Instruct	0.0000	1.0209	0.4666	0.8010
Phi 4 Mini Instruct	0.0000	1.0506	0.6846	0.6753

As can be seen in the table above, the fine-tuned LlamaCode-Codeforces-v1 model outperforms both baseline models in the custom Codeforces testing benchmark, with an average BERT F1 score of 0.8019. The fine-tuned model also outperforms the base Llama 3.2 1B Instruct on the CodeXGlue programming understanding benchmark, however, falls short of Phi-4 in this regard. The fine-tuned model performs worse on the MMLU benchmark in accuracy, however, still maintains some capabilties. Both Llama 3.2 1B and the fine-tuned model fall short to Phi-4 on MMLU by a large margin.

Usage & Intended Uses

Usage

Please ensure you have the PyTorch, HuggingFace Hub, and Transformers libraries installed:

pip install torch huggingface_hub transformers

To load the model as is, run:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model  = AutoModelForCausalLM.from_pretrained("rtt4fb/LlamaCode-Codeforces-v1", device_map="auto", torch_dtype=torch.bfloat16)

To use the model in a pipeline paradigm, run:

from transformers import pipeline

model_id = "rtt4fb/LlamaCode-Codeforces-v1"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
    )

prompt = "Write two programming problems about: Strings."

outputs = pipe(
    prompt,
    max_new_tokens=512
)

print(outputs)

Intended Uses

The intended use cases for this model are to experimentally generate programming practice problems. It is not intended at this time for use in an educational setting, although, and future research may be needed to determine the effectiveness and safety of this model for problem generation.

This model is also intended for computer science educators to use as a brainstorming mechanism for problem generation, whether for the classroom experience, assignments, or assessments. This model can be used in conjuction with textbooks, online resources, or other educational materials to enhance the quality of programming problems with which to provide students. Instructors must perform quality control over the outputs of this model in order to ensure appropriateness, safety, quality, and applicability of the outputs are maintained.

Prompt Format

An example training prompt is below, derived from the training data:

Please generate 3 Python programming question(s) involving the following subject areas: greedy, implementation. For example:

1) Ivan has got an array of *n* non-negative integers *a*1,<=*a*2,<=...,<=*a**n*. Ivan knows that the array is sorted in the non-decreasing order. 

Ivan wrote out integers 2*a*1,<=2*a*2,<=...,<=2*a**n* on a piece of paper. Now he wonders, what minimum number of integers of form 2*b* (*b*<=≥<=0) need to be added to the piece of paper so that the sum of all integers written on the paper equalled 2*v*<=-<=1 for some integer *v* (*v*<=≥<=0). 

Help Ivan, find the required quantity of numbers.

2) Permutation *p* is an ordered set of integers *p*1,<=<=*p*2,<=<=...,<=<=*p**n*, consisting of *n* distinct positive integers, each of them doesn't exceed *n*. We'll denote the *i*-th element of permutation *p* as *p**i*. We'll call number *n* the size or the length of permutation *p*1,<=<=*p*2,<=<=...,<=<=*p**n*.

The decreasing coefficient of permutation *p*1,<=*p*2,<=...,<=*p**n* is the number of such *i* (1<=≤<=*i*<=&lt;<=*n*), that *p**i*<=&gt;<=*p**i*<=+<=1.

You have numbers *n* and *k*. Your task is to print the permutation of length *n* with decreasing coefficient *k*.

3) In Chelyabinsk lives a much respected businessman Nikita with a strange nickname "Boss". Once Nikita decided to go with his friend Alex to the Summer Biathlon World Cup. Nikita, as a very important person, received a token which allows to place bets on each section no more than on one competitor.

To begin with friends learned the rules: in the race there are *n* sections of equal length and *m* participants. The participants numbered from 1 to *m*. About each participant the following is known:
-  *l**i* — the number of the starting section, -  *r**i* — the number of the finishing section (*l**i*<=≤<=*r**i*),-  *t**i* — the time a biathlete needs to complete an section of the path,-  *c**i* — the profit in roubles. If the *i*-th sportsman wins on one of the sections, the profit will be given to the man who had placed a bet on that sportsman.
The *i*-th biathlete passes the sections from *l**i* to *r**i* inclusive. The competitor runs the whole way in (*r**i*<=-<=*l**i*<=+<=1)·*t**i* time units. It takes him exactly *t**i* time units to pass each section. In case of the athlete's victory on *k* sections the man who has betted on him receives *k*·*c**i* roubles.

In each section the winner is determined independently as follows: if there is at least one biathlete running this in this section, then among all of them the winner is the one who has ran this section in minimum time (spent minimum time passing this section). In case of equality of times the athlete with the smaller index number wins. If there are no participants in this section, then the winner in this section in not determined. We have to say that in the summer biathlon all the participants are moving at a constant speed.

We should also add that Nikita can bet on each section and on any contestant running in this section.

Help the friends find the maximum possible profit.

However, a more reasonable, single-shot prompt may be:

Please generate 3 Python programming question(s) involving the following subject areas: greedy, implementation.

Expected Output

For the given example training data prompt above, the expected response is a Python programming question written in English, as seen below:

Devu is a renowned classical singer. He is invited to many big functions/festivals. Recently he was invited to "All World Classical Singing Festival". Other than Devu, comedian Churu was also invited.

Devu has provided organizers a list of the songs and required time for singing them. He will sing *n* songs, *i**th* song will take *t**i* minutes exactly. 

The Comedian, Churu will crack jokes. All his jokes are of 5 minutes exactly.

People have mainly come to listen Devu. But you know that he needs rest of 10 minutes after each song. On the other hand, Churu being a very active person, doesn't need any rest.

You as one of the organizers should make an optimal sсhedule for the event. For some reasons you must follow the conditions:
 -  The duration of the event must be no more than *d* minutes; -  Devu must complete all his songs; -  With satisfying the two previous conditions the number of jokes cracked by Churu should be as many as possible. 
If it is not possible to find a way to conduct all the songs of the Devu, output -1. Otherwise find out maximum number of jokes that Churu can crack in the grand event.

Limitations

One limitation of this model is its size. Being a 1-billion parameter model, it lacks the full capabilities of a larger language model. Particularly, due to the fine-tuning of the model and the subsequent drop in the MMLU benchmark, this model may perform worse on general knowledge prompts than other models. Further, due to the uncommon subjects of the fine-tuning data, the model may lose consistency of model outputs (e.g. swapping bits of questions). This may lead the model to produce programming questions which make little sense, hence the encouragement of educators or researchers to manually examing model responses for clarity.

Due to the nature of large language models, this model may output inappropriate and dangerous responses. The testing of this model cannot cover all of the potential risks of this technology. We strongly recommend that model outputs are carefully vetted for appropriateness and safety.

References

[1] R. Xie, C. Huang, J. Wang, and B. Dhingra, “Adversarial Math Word Problem Generation,” Jun. 15, 2024, arXiv: arXiv:2402.17916. doi: 10.48550/arXiv.2402.17916.
[2] C. Si, D. Yang, and T. Hashimoto, “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers,” Sep. 06, 2024, arXiv: arXiv:2409.04109. doi: 10.48550/arXiv.2409.04109.
[3] M. Suzgun et al., “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them,” Oct. 17, 2022, arXiv: arXiv:2210.09261. doi: 10.48550/arXiv.2210.09261.
[4] MatrixStudio, “Codeforces Python Submissions.” HuggingFace. [Online]. Available: https://huggingface.co./datasets/MatrixStudio/Codeforces-Python-Submissions\ [5] S. Lu et al., “CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation,” Mar. 16, 2021, arXiv: arXiv:2102.04664. doi: 10.48550/arXiv.2102.04664.
[6] F. Schargel and J. Smink, Helping Students Graduate: A Strategic Approach to Dropout Prevention, 0 ed. Routledge, 2013. doi: 10.4324/9781315854816.
[7] gzipChrist, “Leetcode Problem Dataset.” Kaggle. Accessed: Jan. 29, 2025. [Online]. Available: https://www.kaggle.com/datasets/gzipchrist/leetcode-problem-dataset\ [8] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685.
[9] D. Hendrycks, S. Basart, S. Kadavath, and M. Mazeika, “Measuring Coding Challenge Competence with APPS.” 2021. [Online]. Available: https://huggingface.co./datasets/codeparrot/apps\ [10] D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” Jan. 12, 2021, arXiv: arXiv:2009.03300. doi: 10.48550/arXiv.2009.03300. [11] M. Abdin et al., “Phi-4 Technical Report,” Dec. 12, 2024, arXiv: arXiv:2412.08905. doi: 10.48550/arXiv.2412.08905. [12] J. Austin et al., “Program Synthesis with Large Language Models,” Aug. 16, 2021, arXiv: arXiv:2108.07732. doi: 10.48550/arXiv.2108.07732.
[13] P. Denny et al., “Prompt Problems: A New Programming Exercise for the Generative AI Era,” in Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1, Portland OR USA: ACM, Mar. 2024, pp. 296–302. doi: 10.1145/3626252.3630909.
[14] X. Kang, Z. Wang, X. Jin, W. Wang, K. Huang, and Q. Wang, “Template-Driven LLM-Paraphrased Framework for Tabular Math Word Problem Generation,” Dec. 20, 2024, arXiv: arXiv:2412.15594. doi: 10.48550/arXiv.2412.15594.
[15] A. Grattafiori et al., “The Llama 3 Herd of Models,” 2024, arXiv. doi: 10.48550/ARXIV.2407.21783.

rtt4fb
/

LlamaCode-Codeforces-v1