05-28 10:18 阿里巴巴_算法工程师

关注

LLM 大模型学习必知必会系列(三)：LLM和多模态大模型

LLM 大模型学习必知必会系列(三)：LLM和多模态模型高效推理实践

1.多模态大模型推理

LLM 的推理流程：

多模态的 LLM 的原理：

代码演示：使用 ModelScope NoteBook 完成语言大模型，视觉大模型，音频大模型的推理

环境配置与安装

以下主要演示的模型推理代码可在魔搭社区免费实例 PAI-DSW 的配置下运行（显存 24G）：

点击模型右侧 Notebook 快速开发按钮，选择 GPU 环境：
打开 Python 3 (ipykernel)：

示例代码语言大模型推理示例代码

#通义千问1_8B LLM大模型的推理代码示例
#通义千问1_8B:https://modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary
from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

#Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", revision='master', trust_remote_code=True)

#use bf16
#model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
#use fp16
#model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
#use cpu only
#model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", device_map="cpu", trust_remote_code=True).eval()
#use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", revision='master', device_map="auto", trust_remote_code=True).eval()

#Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
#model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-1_8B-Chat", trust_remote_code=True) # 可指定不同的生成长度、top_p等相关超参

#第一轮对话 1st dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# 你好！很高兴为你提供帮助。

#第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
#这是一个关于一个年轻人奋斗创业最终取得成功的故事。
#故事的主人公叫李明，他来自一个普通的家庭，父母都是普通的工人。从小，李明就立下了一个目标：要成为一名成功的企业家。
#为了实现这个目标，李明勤奋学习，考上了大学。在大学期间，他积极参加各种创业比赛，获得了不少奖项。他还利用课余时间去实习，积累了宝贵的经验。
#毕业后，李明决定开始自己的创业之路。他开始寻找投资机会，但多次都被拒绝了。然而，他并没有放弃。他继续努力，不断改进自己的创业计划，并寻找新的投资机会。
#最终，李明成功地获得了一笔投资，开始了自己的创业之路。他成立了一家科技公司，专注于开发新型软件。在他的领导下，公司迅速发展起来，成为了一家成功的科技企业。
#李明的成功并不是偶然的。他勤奋、坚韧、勇于冒险，不断学习和改进自己。他的成功也证明了，只要努力奋斗，任何人都有可能取得成功。

#第三轮对话 3rd dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)
#《奋斗创业：一个年轻人的成功之路》

#Qwen-1.8B-Chat现在可以通过调整系统指令（System Prompt），实现角色扮演，语言风格迁移，任务设定，行为设定等能力。
#Qwen-1.8B-Chat can realize roly playing, language style transfer, task setting, and behavior setting by system prompt.
response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")
print(response)
#你好啊！我是一只可爱的二次元猫咪哦，不知道你有什么问题需要我帮忙解答吗？

response, _ = model.chat(tokenizer, "My colleague works diligently", history=None, system="You will write beautiful compliments according to needs")
print(response)
#Your colleague is an outstanding worker! Their dedication and hard work are truly inspiring. They always go above and beyond to ensure that 
#their tasks are completed on time and to the highest standard. I am lucky to have them as a colleague, and I know I can count on them to handle any challenge that comes their way.

输出结果：

视觉大模型推理示例代码

 #Qwen-VL 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。Qwen-VL 系列模型性能强大，具备多语言对话、多图交错对话等能力，并支持中文开放域定位和细粒度图像识别与理解。
from modelscope import (
    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
from auto_gptq import AutoGPTQForCausalLM

model_dir = snapshot_download("qwen/Qwen-VL-Chat-Int4", revision='v1.0.0')

import torch
torch.manual_seed(1234)

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

# use cuda device
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cuda", trust_remote_code=True,use_safetensors=True).eval()

# 1st dialogue turn
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': '这是什么'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 图中是一名年轻女子在沙滩上和她的狗玩耍，狗的品种可能是拉布拉多。她们坐在沙滩上，狗的前腿抬起来，似乎在和人类击掌。两人之间充满了信任和爱。

# 2nd dialogue turn
response, history = model.chat(tokenizer, '输出"狗"的检测框', history=history)
print(response)

image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('1.jpg')
else:
  print("no box")

输出结果：

音频大模型推理示例代码

from modelscope import (
    snapshot_download, AutoModelForCausalLM, AutoTokenizer, GenerationConfig
)
import torch
model_id = 'qwen/Qwen-Audio-Chat'
revision = 'master'

model_dir = snapshot_download(model_id, revision=revision)
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
if not hasattr(tokenizer, 'model_dir'):
    tokenizer.model_dir = model_dir

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cuda", trust_remote_code=True).eval()


# 1st dialogue turn
query = tokenizer.from_list_format([
    {'audio': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac'}, # Either a local path or an url
    {'text': 'what does the person say?'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".

# 2nd dialogue turn
response, history = model.chat(tokenizer, 'Find the start time and end time of the word "middle classes"', history=history)
print(response)
# The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.

输出结果：

2. vLLM+FastChat 高效推理实战

FastChat 是一个开放平台，用于训练、服务和评估基于 LLM 的 ChatBot。

FastChat 的核心功能包括： ●优秀的大语言模型训练和评估代码。 ●具有 Web UI 和 OpenAI 兼容的 RESTful API 的分布式多模型服务系统。

vLLM 是一个由加州伯克利分校、斯坦福大学和加州大学圣迭戈分校的研究人员基于操作系统中经典的虚拟缓存和分页技术开发的 LLM 服务系统。他实现了几乎零浪费的 KV 缓存，并且可以在请求内部和请求之间灵活共享 KV 高速缓存，从而减少内存使用量。

FastChat 开源链接：https://github.com/lm-sys/FastChat
vLLM 开源链接：https://github.com/vllm-project/vllm

实战演示：

安装 FastChat 最新包1

git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install .

环境变量设置

在 vLLM 和 FastChat 上使用魔搭的模型需要设置两个环境变量：1

export VLLM_USE_MODELSCOPE=True
export FASTCHAT_USE_MODELSCOPE=True

2.1 使用 FastChat 和 vLLM 实现发布 model worker(s)

可以结合 FastChat 和 vLLM 搭建一个网页 Demo 或者类 OpenAI API 服务器，

首先启动一个 controller：

python -m fastchat.serve.controller

然后启动 vllm_worker 发布模型。如下给出单卡推理的示例，运行如下命令：千问模型示例：

#以qwen-1.8B为例，在A10运行

python -m fastchat.serve.vllm_worker --model-path qwen/Qwen-1_8B-Chat --trust-remote-code --dtype bfloat16

启动 vLLM 优化 worker 后，本次实践启动页面端 demo 展示：1

python -m fastchat.serve.gradio_web_server --host 0.0.0.0 --port 8000

2.2 LLM 的应用场景：RAG

LLM 会产生误导性的 “幻觉”，依赖的信息可能过时，处理特定知识时效率不高，缺乏专业领域的深度洞察，同时在推理能力上也有所欠缺。正是在这样的背景下，检索增强生成技术（Retrieval-Augmented Generation，RAG）应时而生，成为 AI 时代的一大趋势。

RAG 通过在语言模型生成答案之前，先从广泛的文档数据库中检索相关信息，然后利用这些信息来引导生成过程，极大地提升了内容的准确性和相关性。RAG 有效地缓解了幻觉问题，提高了知识更新的速度，并增强了内容生成的可追溯性，使得大型语言模型在实际应用中变得更加实用和可信。

一个典型的 RAG 的例子：

这里面主要包括包括三个基本步骤：

索引 — 将文档库分割成较短的 Chunk，并通过编码器构建向量索引。
检索 — 根据问题和 chunks 的相似度检索相关文档片段。
生成 — 以检索到的上下文为条件，生成问题的回答。

RAG（开卷考试）VS. Finetune（专业课程学习）

示例代码：https://github.com/modelscope/modelscope/blob/master/examples/pytorch/application/qwen_doc_search_QA_based_on_langchain_llamaindex.ipynb

#大模型##LLM##多模态大模型#

AI前沿技术文章被收录于专栏

AI前沿技术

全部评论

推荐最新楼层

10-24 19:01

阿里巴巴_开发工程师

达摩院日常实习来了！！！

视觉实验室组内直招，有意向的同学可以发邮件或者私信我岗位一：达摩院多模态数据平台后端研发实习生职位描述：1、参与AIGC领域多模态数据平台后端能力开发，支持采集、处理和管理系统功能迭代。2、参与视频、图像、点云、文本、语音等多模态数据的数据处理。3、跟踪AIGC和多模态数据处理和分析的前沿技术，积极探索和实验新技术。4、实习城市：杭州/北京职位要求：1、计算机、软件工程、人工智能或相关领域本科及以上学历。2、熟悉Python和Java编程语言，了解MySQL、PostgreSQL、MongoDB等数据库的使用和管理。3、熟悉常见的后端开发框架（如Spring Boot、Django、Flask...

投递阿里巴巴等公司10个岗位 >

点赞评论收藏

10-24 16:24

门头沟学院算法工程师

日常实习offer比较

投票

本人博二多模态方向，日常实习接近尾声，希望大家给点意见。我比较希望实习的时候发论文，或者提高多模态方向的代码能力。

0offer要鼠啦：选美团吧，百度这边多模态不太行，学不到啥

点赞评论收藏

11-09 09:29

中国科学院大学算法工程师

【快手】大模型小型化算法实习生招聘

工作岗位：快手-AI平台-大模型小型化算法实习生工作地点：北京市海淀区西二旗中路29号元中心岗位要求：● 数理要求: 熟练掌握线性代数，概率论，信息论，凸优化等基础知识；了解矩阵论, 随机过程等● 框架要求: 精通PyTorch，熟练大模型并行框架的应用，包括 DeepSpeed，Megatron-LM● 代码要求: 精通Python● 对模型加速的算法研究有浓厚的兴趣，特别是针对LLM和SoRA等前沿模型的推理加速探索● 有论文发表经验优先，提供算力支持对创新idea的研究和应用岗位职责：● 负责快手内部AIGC大模型的推理部署效率优化需求，包括但不限于Diffusion采样时间优化，Dif...

投递快手等公司10个岗位 >

点赞评论收藏

昨天 16:22

门头沟学院 Java

淘天开发面经

#软件开发笔面经#**项目：**你的项目里redis都用了那些数据结构  set zset hash表，少说了个bitsmap。zset底层原理，解释下跳表。redis中set，zset，hash表 的区别。消息队列的作用，除了异步还有什么，解耦削峰填谷。加入消息队列会增大系统的负载，当时没有想其他的方案来替换消息队列吗？没，很多地方要用到，接着问还有什么地方用到消息队列？订单微服务支付成功给课程微服务加入课表，那么我如果强烈要求实时，不用消息队列怎么实现？项目中还用到了spring cloud的feign实现不同微服务调用，还可以通过rpc框架。说一下消息队列的原理消息如何实现有序？  这个只答了使用消息序号，还有使用单一消费者、分区队列、消息序号、延迟消费和事务消息Arraylist扩容为什么是1.5倍，答了可能跟负载因子有关，答错了，应该是，减少数组复制的开销，性能和内存利用率的一个折中。学校：你的研究方向在你研究方向取得的一个最重大的突破或者成果你用技术解决过生活中的一个问题**八股：**接口，抽象方法，内部类有什么区别  这个只有点忘记了，面试官提示在什么场景下会用到抽象类在后端返回给前端数据的时候，如何选择arraylist，linkedlist，set这种java内存模型，java对象的生命周期 这个忘了： Java对象的生命周期包括创建、应用、不可见、不可达、收集、终结和空间重分配等阶段。线程池解释，线程池你常用的阻塞队列是什么？为什么不用无界的，无界的阻塞队列会有什么问题？看过线程池底层源码吗？底层源码没看过。堆内存的分布，垃圾回收机制以及区别cpu高负载如何解决问题？没答全，top→进程pid→top -hp→ 线程pid→jstackcpu高负载可能是哪些原因造成的？这个答错了，答的死锁，应该有死循环，频繁的GC操作，上下文切换过于频繁等。

查看23道真题和解析软件开发笔面经

点赞评论收藏

2023-09-20 16:35

已编辑

北京理工大学算法工程师

查看5道真题和解析

点赞评论收藏

2 11 评论

招聘动态

杉川机器人

2025校园招聘

字节跳动

2025校园招聘

字节跳动Data

2025校园招聘

快手Star

2025届招聘

快手

销售类投递专区

库洛游戏

全站热榜

正在热议

# 选完offer后，你后悔学本专业吗 #

# 没有合适的工作，你会先找个干着，还是考公考研 #

35435次浏览 393人参与

# 秋招OC许愿 #

224362次浏览 1855人参与

# 如果能重来，就业or读研你选哪个？ #