Background
Currently, demands related to DeepSeek can be summarized into two categories:
- Due to the official APP/Web services frequently failing to return results, various cloud vendors, hardware or software enterprises provide full-strength or distilled versions of APIs + computing power services. There are also numerous local deployment solutions based on open-source technologies and home computing and storage devices to alleviate the service pressure on DeepSeek's official servers.
- Various industries start to call DeepSeek APIs for designing large model applications serving both internal and external purposes, focusing on the efficiency and stability of application construction.
Previously, we have released numerous cloud-based and local deployments addressing the first demand; this article will discuss engineering solutions for traffic management concerning the second demand.
DeepSeek Deployment
Since DeepSeek has open-sourced the complete DeepSeek-R1 model weights, enterprises can deploy the model within their networks, thus keeping the entire AI application data flow under their control.
- Model Weights Download: Available through the ModelScope community (https://modelscope.cn/).
Considering the full DeepSeek-R1 model has 671 billion parameters, running it requires substantial GPU resources. Quantization methods like int8/int4 can be considered for inference. Meanwhile, DeepSeek has released several distilled models of different specifications that can be deployed on machines with lower configurations.
Deployment Solutions
Alibaba Cloud officially provides multiple deployment options, including PAI, Bailian, GPU + ACK, ModelScope+FC, Spring AI Alibaba + Ollama. Details can be found via links provided in the original text.
PAI:https://mp.weixin.qq.com/s/Ly9bseQxhmunlbePphRsnA
Bailian:https://mp.weixin.qq.com/s/UgB90HfKlMDfarMugc5F5w
container(GPU + ACK):https://mp.weixin.qq.com/s/SSGD5G7KL8iYLy2jxh9FOg
Serverless(ModelScope+FC):https://mp.weixin.qq.com/s/yk5t0oIv7XQR0ky6phiq6g
Local Deployment(Spring AI Alibaba + Ollama + Higress):https://mp.weixin.qq.com/s/-8z9OFHvn0A1ga2rFsmeww
Common Requirements and Engineering Challenges During Large Model Application Implementation
Similar to deploying web applications, when deploying large model applications, challenges such as sudden traffic surges and overload, network fluctuations and delays, security and compliance issues, invocation quotas and cost control, online faults caused by releases arise. However, due to the differences in architecture between large model applications and web applications, the corresponding solutions vary accordingly.
The importance of traffic management for the engineering of large model applications was shared in "A Comprehensive View of Large Model Inference", where AI gateways have become a standard feature for large model applications. By registering deployed models as services through an AI gateway and exposing APIs to callers, capabilities such as rate limiting, authentication, and statistics are enabled.
Higress is a high-performance gateway open-sourced by Alibaba Cloud, designed for the deployment of web applications and large model applications. It also offers commercial services through the Alibaba Cloud Native API Gateway. This article will demonstrate using the console of the Alibaba Cloud Native API Gateway.
Specific Needs and Solutions
Fallback Strategies for Self-built DeepSeek Services:** Utilizing smaller parameter models like DeepSeek-R1-Distill-Qwen-32B for fallback.
Given the vast 671 billion parameters of DeepSeek-R1, deploying it incurs significant costs. It is advisable to deploy some distilled models from the R1 series as a fallback option. For instance, the DeepSeek-R1-Distill-Qwen-32B, trained based on the Qwen model, is an excellent alternative.
The AI Gateway within Alibaba Cloud's Native API Gateway supports configuring multiple backend model services and includes a fallback capability to reschedule failed requests. If a call to a self-deployed DeepSeek-R1 fails, the request can be routed to models with fewer parameters. Additionally, routing to online API services, such as DeepSeek-V3 or Qwen-max, can be selected to ensure comprehensive service capabilities.
Content Security Assurance for Self-built DeepSeek Services: Ensuring real-time processing and blocking of sensitive content using Alibaba Cloud Content Security.
The output style of DeepSeek's R1 series open-source models tends to be relatively "free". If these models are used to provide external services, there may be concerns regarding content security. Should the model respond to sensitive questions, it could potentially lead to additional explanation liabilities for enterprises.
Alibaba Cloud Native API Gateway integrates with Alibaba Cloud Content Security, offering real-time processing and content blocking for large model requests/responses. Alibaba Cloud Content Security has received certification from the China Academy of Information and Communications Technology (CAICT), providing robust AI-based content security assurance.
Once content security is enabled, if a user sends inappropriate content, they will receive a response indicating that the content is in violation. This ensures that any potential breaches of content policy are promptly addressed, thereby safeguarding the service against inappropriate or harmful content.
By leveraging these capabilities, enterprises can ensure that their AI applications maintain high standards of content safety and regulatory compliance, minimizing risks associated with inappropriate responses from large model deployments.
{
"id": "chatcmpl-E45zRLc5hUCxhsda4ODEhjvkEycC9",
"object": "chat.completion",
"model": "from-security-guard",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "我不能处理隐私信息"
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0
}
}
Additionally, the content security console allows you to view the audit logs for each request. This feature enables detailed monitoring and tracking of all requests processed by the system, helping ensure that any potentially inappropriate or sensitive content can be audited and managed effectively. By reviewing these logs, enterprises can gain insights into the types of requests being made and the responses being generated, thereby facilitating better management of content security and compliance with relevant regulations. This transparency is crucial for maintaining the integrity and safety of AI-powered services.
Authorization and Quota Control for API Users: Issuing API keys to control permissions and quota usage.
Based on the consumer authentication capabilities of Alibaba Cloud Native API Gateway, it supports multi-tenancy for model services. Users can issue their own API Keys on the gateway just like model service providers, controlling the invocation permissions and quotas of consumers. Combined with observability features, this also allows for monitoring and statistics on each consumer's token usage.
For online model services, this functionality enables the masking of the original model provider's API Key, thereby achieving multi-tenancy management of API Keys. This setup not only enhances security by protecting the original API Keys from exposure but also provides a structured way to manage different users or tenants accessing the service. Each tenant can be allocated specific access rights and usage limits, ensuring controlled and secure access to the model services. This approach is ideal for enterprises looking to offer customized access to AI models while maintaining robust security and control over usage and costs.
Traffic Distribution Between Different LLMs: Supporting gradual traffic switching for model migration.
Alibaba Cloud Native API Gateway supports model traffic shifting capabilities, facilitating smoother transitions between models. As illustrated, 90% of request traffic can be routed to OpenAI, while 10% is routed to DeepSeek. Subsequent adjustments in the traffic distribution for gray release (canary release) only require configuration changes and redeployment without necessitating any code-level modifications.
Cost Reduction Through Caching Common Requests: Reducing backend model load by caching frequent requests.
Alibaba Cloud Native API Gateway supports caching of Large Language Model (LLM) production results. Once the caching capability is enabled, common requests—such as greetings or inquiries about product capabilities—can be directly responded to from the cache without needing to access the backend model. This prevents the usage of valuable inference resources for routine queries.
These capabilities are complemented by rich observability features, including monitoring of content security, rate limiting, caching, etc., along with advanced semantic vector indexing functions for further improving model application performance.
In addition to the aforementioned capabilities, Alibaba Cloud Native API Gateway, in conjunction with SLS (Simple Logging Service), provides advanced functionalities such as semantic vector indexing and semantic enrichment based on large model dialogues. These features enable topic clustering, intent recognition, sentiment analysis, quality assessment, and more, assisting users in progressively enhancing the effectiveness of their model applications.