Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Merged
merged 4 commits into from
Feb 17, 2025

Conversation

AllentDan
Copy link
Collaborator

No description provided.

@AllentDan
Copy link
Collaborator Author

AllentDan commented Jan 13, 2025

I tested internlm2-chat-7 on seven nodes. The performance using this PR and #2961 is:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     10000     
Benchmark duration (s):                  88.44     
Total input tokens:                      2317235   
Total generated tokens:                  2007343   
Total generated tokens (retokenized):    2004019   
Request throughput (req/s):              113.07    
Input token throughput (tok/s):          26201.58  
Output token throughput (tok/s):         22697.55  
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   47196.62  
Median E2E Latency (ms):                 47467.02  
---------------Time to First Token----------------
Mean TTFT (ms):                          38858.11  
Median TTFT (ms):                        39344.03  
P99 TTFT (ms):                           63995.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.60     
Median TPOT (ms):                        45.05     
P99 TPOT (ms):                           162.91    
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.18     
Median ITL (ms):                         0.01      
P99 ITL (ms):                            740.29    
==================================================

While single api_server performance:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     3000      
Benchmark duration (s):                  132.74    
Total input tokens:                      683944    
Total generated tokens:                  597386    
Total generated tokens (retokenized):    596120    
Request throughput (req/s):              22.60     
Input token throughput (tok/s):          5152.61   
Output token throughput (tok/s):         4500.51   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61212.61  
Median E2E Latency (ms):                 60387.67  
---------------Time to First Token----------------
Mean TTFT (ms):                          52197.99  
Median TTFT (ms):                        50757.01  
P99 TTFT (ms):                           107224.86 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.10     
Median TPOT (ms):                        47.26     
P99 TPOT (ms):                           195.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.13     
Median ITL (ms):                         44.34     
P99 ITL (ms):                            325.68    
==================================================

@lvhan028 lvhan028 self-requested a review February 6, 2025 04:26
@lvhan028
Copy link
Collaborator

lvhan028 commented Feb 6, 2025

May resolve the conflicts

Conflicts:
	lmdeploy/cli/serve.py
	lmdeploy/serve/proxy/proxy.py
@lvhan028 lvhan028 merged commit aa03abc into InternLM:main Feb 17, 2025
5 checks passed
tastelikefeet added a commit to tastelikefeet/lmdeploy that referenced this pull request Feb 25, 2025
…oad_state_dict

* commit 'f6f7a5d707e3ccbc69af10babf1c9afcaf72a402':
  fix deepseekv2 has no attribute use_mla error (InternLM#3188)
  fix blocked fp8 moe (InternLM#3181)
  [Feature] support deepseek-vl2 for pytorch engine (InternLM#3149)
  make turbomind support gpu embedding inputs (InternLM#3177)
  fix temperature=0 (InternLM#3176)
  Update qwen2.py (InternLM#3174)
  Fix tool call prompt for InternLM and Qwen (InternLM#3156)
  Use pad_token_id as image_token_id for vl models (InternLM#3158)
  fix default temperature value (InternLM#3166)
  fix min length penalty (InternLM#3150)
  update cuda runtime package dependencies (InternLM#3142)
  fix typing (InternLM#3153)
  support deepseekv2 for maca backend. (InternLM#2918)
  fix the issue that stop_token may be less than defined in model.py (InternLM#3148)
  [fix] fix vl gradio, use pipeline api and remove interactive chat (InternLM#3136)
  [feature] add dlinfer w8a8 support. (InternLM#2988)
  Use aiohttp inside proxy server && add --disable-cache-status argument (InternLM#3020)
  support eos_token list in turbomind (InternLM#3044)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants