DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML

This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation.

Model Description

DeepSeek-R1-Distill-Llama-8B is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count.

This CoreML conversion provides:

Full compatibility with Apple Silicon devices (M1, M2, M3 series)
Stateful inference with KV-caching for efficient text generation
Optimized performance for on-device deployment

Technical Specifications

Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Parameters: 8 billion
Context Length: Configurable (default: 64, expandable based on memory constraints)
Quantization: FP16
File Format: .mlpackage
Deployment Target: macOS 15+
Architecture: Stateful LLM with key-value caching
Input Features: Flexible input size with dynamic shape handling

Key Features

Stateful Inference: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed.
Dynamic Input Shapes: Supports variable input lengths through RangeDim specification.
Optimized Memory Usage: Efficiently manages the key-value cache to minimize memory footprint.

Implementation Details

This conversion utilizes:

A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation
CoreML's state management capabilities for maintaining KV caches between inference calls
Proper buffer registration to ensure state persistence
Dynamic tensor shapes to accommodate various input and context lengths

Usage

The model can be loaded and used with CoreML in your Swift or Python projects:

import coremltools as ct

# Load the model
model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage")

# Prepare inputs for inference
# ...

# Run inference
output = model.predict({
    "inputIds": input_ids,
    "causalMask": causal_mask
})

Conversion Process

The model was converted using CoreML Tools with the following steps:

Loading the original model from Hugging Face
Wrapping it with custom state management
Tracing with PyTorch's JIT
Converting to CoreML format with state specifications
Saving in the .mlpackage format

Requirements

To use this model:

Apple Silicon Mac (M1/M2/M3 series)
macOS 15 or later
Minimum 16GB RAM recommended

Limitations

The model requires significant memory for inference, especially with longer contexts
Performance is highly dependent on the device's Neural Engine capabilities
The default configuration supports a context length of 64 tokens, but this can be adjusted

License

This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model.

Acknowledgments

DeepSeek-AI for creating and releasing the original model
Hugging Face for hosting the model and providing the Transformers library
Apple for developing the CoreML framework

Citation

If you use this model in your research, please cite both the original DeepSeek model and this conversion.

anthonymikinka
/

DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML