Make turbomind support embedding inputs on GPU #3177

chengyuma · 2025-02-24T02:51:52Z

Motivation

The current implementation of turbomind only supports embedding inputs on the CPU. When embedding inputs are too large, the CPU can become a bottleneck. This PR aims to enable support for embedding inputs on the GPU, thereby alleviating the CPU bottleneck.

Modification

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

irexyc · 2025-02-24T08:11:19Z

@chengyuma

Are you using the pipeline api or using TurboMind and TurboMindInstance api directly?

chengyuma · 2025-02-24T09:12:43Z

@chengyuma

Are you using the pipeline api or direct using TurboMind and TurboMindInstance api ?

I use the TurboMind and TurboMindInstance API to develop a multimodal service.

…oad_state_dict * commit 'f6f7a5d707e3ccbc69af10babf1c9afcaf72a402': fix deepseekv2 has no attribute use_mla error (InternLM#3188) fix blocked fp8 moe (InternLM#3181) [Feature] support deepseek-vl2 for pytorch engine (InternLM#3149) make turbomind support gpu embedding inputs (InternLM#3177) fix temperature=0 (InternLM#3176) Update qwen2.py (InternLM#3174) Fix tool call prompt for InternLM and Qwen (InternLM#3156) Use pad_token_id as image_token_id for vl models (InternLM#3158) fix default temperature value (InternLM#3166) fix min length penalty (InternLM#3150) update cuda runtime package dependencies (InternLM#3142) fix typing (InternLM#3153) support deepseekv2 for maca backend. (InternLM#2918) fix the issue that stop_token may be less than defined in model.py (InternLM#3148) [fix] fix vl gradio, use pipeline api and remove interactive chat (InternLM#3136) [feature] add dlinfer w8a8 support. (InternLM#2988) Use aiohttp inside proxy server && add --disable-cache-status argument (InternLM#3020) support eos_token list in turbomind (InternLM#3044)

make turbomind support gpu embedding inputs

4d9b76f

lvhan028 requested review from irexyc and lzhangzz February 24, 2025 04:21

lvhan028 added the improvement label Feb 24, 2025

lvhan028 approved these changes Feb 24, 2025

View reviewed changes

irexyc approved these changes Feb 24, 2025

View reviewed changes

lvhan028 merged commit b472d35 into InternLM:main Feb 24, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make turbomind support embedding inputs on GPU #3177

Make turbomind support embedding inputs on GPU #3177

chengyuma commented Feb 24, 2025

irexyc commented Feb 24, 2025 •

edited

Loading

chengyuma commented Feb 24, 2025

Make turbomind support embedding inputs on GPU #3177

Make turbomind support embedding inputs on GPU #3177

Conversation

chengyuma commented Feb 24, 2025

Motivation

Modification

Checklist

irexyc commented Feb 24, 2025 • edited Loading

chengyuma commented Feb 24, 2025

irexyc commented Feb 24, 2025 •

edited

Loading