DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML
This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation.
Model Description
DeepSeek-R1-Distill-Llama-8B is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count.
This CoreML conversion provides:
- Full compatibility with Apple Silicon devices (M1, M2, M3 series)
- Stateful inference with KV-caching for efficient text generation
- Optimized performance for on-device deployment
Technical Specifications
- Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- Parameters: 8 billion
- Context Length: Configurable (default: 64, expandable based on memory constraints)
- Quantization: FP16
- File Format: .mlpackage
- Deployment Target: macOS 15+
- Architecture: Stateful LLM with key-value caching
- Input Features: Flexible input size with dynamic shape handling
Key Features
- Stateful Inference: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed.
- Dynamic Input Shapes: Supports variable input lengths through RangeDim specification.
- Optimized Memory Usage: Efficiently manages the key-value cache to minimize memory footprint.
Implementation Details
This conversion utilizes:
- A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation
- CoreML's state management capabilities for maintaining KV caches between inference calls
- Proper buffer registration to ensure state persistence
- Dynamic tensor shapes to accommodate various input and context lengths
Usage
The model can be loaded and used with CoreML in your Swift or Python projects:
import coremltools as ct
# Load the model
model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage")
# Prepare inputs for inference
# ...
# Run inference
output = model.predict({
"inputIds": input_ids,
"causalMask": causal_mask
})
Conversion Process
The model was converted using CoreML Tools with the following steps:
- Loading the original model from Hugging Face
- Wrapping it with custom state management
- Tracing with PyTorch's JIT
- Converting to CoreML format with state specifications
- Saving in the .mlpackage format
Requirements
To use this model:
- Apple Silicon Mac (M1/M2/M3 series)
- macOS 15 or later
- Minimum 16GB RAM recommended
Limitations
- The model requires significant memory for inference, especially with longer contexts
- Performance is highly dependent on the device's Neural Engine capabilities
- The default configuration supports a context length of 64 tokens, but this can be adjusted
License
This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model.
Acknowledgments
- DeepSeek-AI for creating and releasing the original model
- Hugging Face for hosting the model and providing the Transformers library
- Apple for developing the CoreML framework
Citation
If you use this model in your research, please cite both the original DeepSeek model and this conversion.
- Downloads last month
- 6
Model tree for anthonymikinka/DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B