File size: 2,298 Bytes
b9e46e6 83c15fe b81bc93 e837a6e 6944f44 e837a6e b81bc93 d7024c3 e837a6e b81bc93 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
---
title: README
emoji: 🐢
colorFrom: purple
colorTo: purple
sdk: static
pinned: false
---
Text-Generation-Inference is a solution build for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
- Tensor Parallelism and custom cuda kernels
- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
- Quantization with bitsandbytes or gptq
- Continuous batching of incoming requests for increased total throughput
- Accelerated weight loading (start-up time) with safetensors
- Logits warpers (temperature scaling, topk, repetition penalty ...)
- Watermarking with A Watermark for Large Language Models
- Stop sequences, Log probabilities
- Token streaming using Server-Sent Events (SSE)
<img width="300px" src="https://huggingface.co./spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
## Currently optimized architectures
- [BLOOM](https://huggingface.co./bigscience/bloom)
- [FLAN-T5](https://huggingface.co./google/flan-t5-xxl)
- [Galactica](https://huggingface.co./facebook/galactica-120b)
- [GPT-Neox](https://huggingface.co./EleutherAI/gpt-neox-20b)
- [Llama](https://github.com/facebookresearch/llama)
- [OPT](https://huggingface.co./facebook/opt-66b)
- [SantaCoder](https://huggingface.co./bigcode/santacoder)
- [Starcoder](https://huggingface.co./bigcode/starcoder)
- [Falcon 7B](https://huggingface.co./tiiuae/falcon-7b)
- [Falcon 40B](https://huggingface.co./tiiuae/falcon-40b)
## Check out the source code 👉
- the server backend: https://github.com/huggingface/text-generation-inference
- the Chat UI: https://huggingface.co./spaces/text-generation-inference/chat-ui
## Check out examples
- [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co./blog/sagemaker-huggingface-llm)
- [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co./blog/inference-endpoints-llm)
|