However, the recipe behind OpenAI’s reasoning models has been a well kept secret. That is, until last week, when DeepSeek released their DeepSeek-R1 model and promptly broke the internet (and the stock market!). (View Highlight)
DeepSeek AI open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen architectures. You can find them all in the DeepSeek R1 collection. (View Highlight)
Let’s review how you can deploy and fine-tune DeepSeek R1 models with Hugging Face on AWS. (View Highlight)
Hugging Face Inference Endpoints offers an easy and secure way to deploy Machine Learning models on dedicated compute for use in production on AWS. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security. (View Highlight)
With Inference Endpoints, you can deploy any of the 6 distilled models from DeepSeek-R1 and also a quantized version of DeepSeek R1 made by Unsloth: https://huggingface.co/unsloth/DeepSeek-R1-GGUF. On the model page, click on Deploy, then on HF Inference Endpoints. You will be redirected to the Inference Endpoint page, where we selected for you an optimized inference container, and the recommended hardware to run the model. Once you created your endpoint, you can send your queries to DeepSeek R1 for 8.3$ per hour with AWS 🤯. (View Highlight)
| Note: The team is working on enabling DeepSeek models deployment on Inferentia instances. Stay tuned! (View Highlight)
| Note: The team is working on enabling DeepSeek-R1 deployment on Amazon Sagemaker AI with the Hugging Face LLM DLCs on GPU. Stay tuned! (View Highlight)
Before, let’s start with a few pre-requisites. Make sur you have a Sagemaker Domain configured, sufficient quota in Sagemaker, and a JupyterLab space. For DeepSeek-R1-Distill-Llama-70B, you should raise the default quota for ml.g6.48xlarge for endpoint usage to 1. (View Highlight)
Let’s walk through the deployment of DeepSeek-R1-Distill-Llama-70B on a Neuron instance, like AWS Trainium 2 and AWS Inferentia 2. (View Highlight)
The pre-requisites to deploy to a Neuron instance are the same. Make sur you have a Sagemaker Domain configured, sufficient quota in Sagemaker, and a JupyterLab space. For DeepSeek-R1-Distill-Llama-70B, you should raise the default quota for ml.inf2.48xlarge for endpoint usage to 1. (View Highlight)
Then, instantiate a sagemaker_session which is used to determine the current region and execution role.
Create the SageMaker Model object with the Python SDK:
Deploy the model to a Sagemaker endpoint and test the endpoint:
That’s it, you deployed a Llama 70B reasoning model on a Neuron instance! Under the hood, it downloaded a pre-compiled model from Hugging Face to speed up the endpoint start time. (View Highlight)
Before, let’s start with a few pre-requisites. Make sur you have subscribed to the Hugging Face Neuron Deep Learning AMI on the Marketplace. It provides you all the necessary dependencies to train and deploy Hugging Face models on Trainium & Inferentia. Then, launch an inf2.48xlarge instance in EC2 with the AMI and connect through SSH. You can check our step-by-step guide if you have never done it.
Once connected through the instance, you can deploy the model on an endpoint (View Highlight)