Use aiohttp inside proxy server && add --disable-cache-status argument #3020

AllentDan · 2025-01-13T10:09:53Z

No description provided.

AllentDan · 2025-01-13T10:11:42Z

I tested internlm2-chat-7 on seven nodes. The performance using this PR and #2961 is:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     10000     
Benchmark duration (s):                  88.44     
Total input tokens:                      2317235   
Total generated tokens:                  2007343   
Total generated tokens (retokenized):    2004019   
Request throughput (req/s):              113.07    
Input token throughput (tok/s):          26201.58  
Output token throughput (tok/s):         22697.55  
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   47196.62  
Median E2E Latency (ms):                 47467.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          38858.11  
Median TTFT (ms):                        39344.03  
P99 TTFT (ms):                           63995.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.60     
Median TPOT (ms):                        45.05     
P99 TPOT (ms):                           162.91    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.18     
Median ITL (ms):                         0.01      
P99 ITL (ms):                            740.29    
==================================================

While single api_server performance:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     3000      
Benchmark duration (s):                  132.74    
Total input tokens:                      683944    
Total generated tokens:                  597386    
Total generated tokens (retokenized):    596120    
Request throughput (req/s):              22.60     
Input token throughput (tok/s):          5152.61   
Output token throughput (tok/s):         4500.51   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61212.61  
Median E2E Latency (ms):                 60387.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          52197.99  
Median TTFT (ms):                        50757.01  
P99 TTFT (ms):                           107224.86 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.10     
Median TPOT (ms):                        47.26     
P99 TPOT (ms):                           195.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.13     
Median ITL (ms):                         44.34     
P99 ITL (ms):                            325.68    
==================================================

lvhan028 · 2025-02-06T04:26:12Z

May resolve the conflicts

Conflicts: lmdeploy/cli/serve.py lmdeploy/serve/proxy/proxy.py

…oad_state_dict * commit 'f6f7a5d707e3ccbc69af10babf1c9afcaf72a402': fix deepseekv2 has no attribute use_mla error (InternLM#3188) fix blocked fp8 moe (InternLM#3181) [Feature] support deepseek-vl2 for pytorch engine (InternLM#3149) make turbomind support gpu embedding inputs (InternLM#3177) fix temperature=0 (InternLM#3176) Update qwen2.py (InternLM#3174) Fix tool call prompt for InternLM and Qwen (InternLM#3156) Use pad_token_id as image_token_id for vl models (InternLM#3158) fix default temperature value (InternLM#3166) fix min length penalty (InternLM#3150) update cuda runtime package dependencies (InternLM#3142) fix typing (InternLM#3153) support deepseekv2 for maca backend. (InternLM#2918) fix the issue that stop_token may be less than defined in model.py (InternLM#3148) [fix] fix vl gradio, use pipeline api and remove interactive chat (InternLM#3136) [feature] add dlinfer w8a8 support. (InternLM#2988) Use aiohttp inside proxy server && add --disable-cache-status argument (InternLM#3020) support eos_token list in turbomind (InternLM#3044)

AllentDan added 3 commits January 13, 2025 17:01

use aiohttp

5a7703d

disable-cache-status

66d6600

remove connect limit

6a29e74

lvhan028 added the improvement label Feb 6, 2025

lvhan028 self-requested a review February 6, 2025 04:26

Merge branch 'main' into improve-proxy

81440d5

Conflicts: lmdeploy/cli/serve.py lmdeploy/serve/proxy/proxy.py

lvhan028 approved these changes Feb 17, 2025

View reviewed changes

lvhan028 merged commit aa03abc into InternLM:main Feb 17, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

AllentDan commented Jan 13, 2025

AllentDan commented Jan 13, 2025 •

edited

Loading

lvhan028 commented Feb 6, 2025

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Conversation

AllentDan commented Jan 13, 2025

AllentDan commented Jan 13, 2025 • edited Loading

lvhan028 commented Feb 6, 2025

AllentDan commented Jan 13, 2025 •

edited

Loading