{"id":707,"date":"2026-04-06T21:34:29","date_gmt":"2026-04-06T13:34:29","guid":{"rendered":"https:\/\/www.liaoxinghui.com\/?p=707"},"modified":"2026-04-06T21:34:29","modified_gmt":"2026-04-06T13:34:29","slug":"vllm-oom-kv-cache-max-length-batch-size-delivery","status":"publish","type":"post","link":"https:\/\/www.liaoxinghui.com\/?p=707","title":{"rendered":"vLLM\u63a8\u7406OOM\u6392\u67e5\u8bb0\uff1a\u4e0d\u662f\u663e\u5b58\u4e0d\u591f\uff0c\u662f\u4f60\u6ca1\u641e\u6e05\u695amax_length\u548cbatch_size\u7684\u5751"},"content":{"rendered":"<h2>\u95ee\u9898\uff1a80G\u663e\u5b58\u8dd170B\u6a21\u578b\uff0c\u957f\u6587\u672c\u4e00\u63a8\u5c31OOM<\/h2>\n<p>\u4e0a\u5468\u4e09\u665a\u4e0a10\u70b9\uff0c\u516c\u53f8\u7535\u8bdd\u6253\u8fc7\u6765\uff0c70B\u6a21\u578b\u670d\u52a1\u53c8\u5d29\u4e86\u3002<\/p>\n<p>\u6211\u5f53\u65f6\u770b\u4e86\u773cnvidia-smi\uff0c\u663e\u5b58\u5360\u752898%\uff0c\u51e0\u4e4e\u662f\u6ee1\u4e86\u3002\u6309\u7167\u6b63\u5e38\u601d\u8def\uff0c\u7b2c\u4e00\u53cd\u5e94\u662f\uff1a\u663e\u5b58\u4e0d\u591f\uff0c\u5f97\u52a0\u5361\u3002<\/p>\n<p>\u7ed3\u679c\u8bf4\u5df2\u7ecf\u52a0\u4e86\uff0c\u52a0\u52304\u5f2080G\u7684\u5361\uff0c\u8fd8\u662f\u7206\u3002<\/p>\n<p>\u8fd9\u5c31\u79bb\u8c31\u4e86\u3002<\/p>\n<p>\u767b\u5f55\u673a\u5668\u8dd1\u4e86\u4e00\u8f6eprofiling\uff0c\u624d\u53d1\u73b0\u6839\u672c\u4e0d\u662f\u663e\u5b58\u5bb9\u91cf\u7684\u95ee\u9898\u2014\u2014\u662f<strong>prefill\u9636\u6bb5\u7684kv cache\u628a\u663e\u5b58\u5403\u5149\u4e86<\/strong>\u3002<\/p>\n<blockquote>\n<p><strong>\u4e00\u53e5\u8bdd\u7b54\u6848<\/strong>\uff1aOOM\u7684\u6839\u56e0\u4e0d\u662f\u6a21\u578b\u592a\u5927\u3001\u4e0d\u662fbatch_size\u592a\u5927\uff0c\u662f<strong>max_length\u8bbe\u7f6e\u5f97\u8fdc\u8d85\u8fc7\u4f60\u5b9e\u9645\u9700\u8981\u7684\u957f\u5ea6<\/strong>\uff0c\u5bfc\u81f4kv cache\u9884\u5206\u914d\u7684\u663e\u5b58\u8fdc\u8d85\u5b9e\u9645\u4f7f\u7528\u3002<\/p>\n<\/blockquote>\n<hr \/>\n<h2>\u4e1a\u52a1\u573a\u666f<\/h2>\n<p>\u5148\u8bf4\u6e05\u695a\u8fd9\u662f\u5728\u4ec0\u4e48\u60c5\u51b5\u4e0b\u51fa\u7684\u95ee\u9898\uff0c\u4e0d\u7136\u4f60\u770b\u5b8c\u89c9\u5f97\u662f\u901a\u7528\u89e3\uff0c\u5b9e\u9645\u653e\u5230\u81ea\u5df1\u573a\u666f\u91cc\u6839\u672c\u4e0d\u9002\u7528\u3002<\/p>\n<p>\u4e1a\u52a1\u662f\u8fd9\u6837\u7684\uff1a<\/p>\n<ul>\n<li><strong>\u6a21\u578b<\/strong>\uff1aLlama-2-70B-chat\uff0cfp16\u7cbe\u5ea6<\/li>\n<li><strong>\u90e8\u7f72<\/strong>\uff1avLLM 0.2.6\uff0c4\u5f20A100 80G\uff0ctensor_parallel_size=4<\/li>\n<li><strong>\u8bf7\u6c42\u7279\u5f81<\/strong>\uff1a\u5ba2\u670d\u573a\u666f\uff0cprompt\u5e73\u5747\u957f\u5ea6800-1200 tokens\uff0cmax_tokens 512<\/li>\n<li><strong>\u5e76\u53d1\u8981\u6c42<\/strong>\uff1aQPS 50\u5de6\u53f3\uff0c\u5b9e\u9645\u8dd1\u4e0b\u6765\u5927\u698210-15\u5e76\u53d1<\/li>\n<\/ul>\n<p>\u95ee\u9898\u5728\u4e8e\uff0c\u90e8\u7f72\u7684\u65f6\u5019\u8fd0\u7ef4\u540c\u5b66\u4e3a\u4e86\u300c\u4fdd\u9669\u8d77\u89c1\u300d\uff0c\u628amax_model_len\u8bbe\u6210\u4e868192\u3002<\/p>\n<p>\u5ba2\u670d\u5bf9\u8bdd\u561b\uff0c\u60f3\u7740\u4e07\u4e00\u7528\u6237\u5199\u4e00\u5927\u6bb5\u5462\uff0c\u7559\u70b9\u4f59\u91cf\u3002<\/p>\n<p>\u7ed3\u679c\u8fd9\u4e2a\u300c\u4f59\u91cf\u300d\u628a\u663e\u5b58\u5403\u6389\u4e8660%\u4ee5\u4e0a\u3002<\/p>\n<hr \/>\n<h2>\u6545\u969c\u62c6\u89e3\uff1akv cache\u662f\u600e\u4e48\u5403\u6389\u663e\u5b58\u7684<\/h2>\n<p>\u5148\u8bf4\u539f\u7406\uff0c\u4e0d\u61c2\u539f\u7406\u4f60\u914d\u53c2\u6570\u5c31\u662f\u778e\u8499\u3002<\/p>\n<p>vLLM\u63a8\u7406\u5206\u4e24\u4e2a\u9636\u6bb5\uff1a<\/p>\n<ol>\n<li><strong>Prefill\u9636\u6bb5<\/strong>\uff1a\u5904\u7406\u8f93\u5165prompt\uff0c\u751f\u6210\u7b2c\u4e00\u4e2atoken\u3002\u8fd9\u4e2a\u9636\u6bb5\u4f1a\u628a\u6574\u4e2aprompt\u7684kv\u5168\u90e8\u585e\u8fdb\u663e\u5b58\u3002<\/li>\n<li><strong>Decode\u9636\u6bb5<\/strong>\uff1a\u9010token\u751f\u6210\uff0ckv cache\u53ea\u5728\u672b\u5c3e\u8ffd\u52a0\u3002<\/li>\n<\/ol>\n<p>\u95ee\u9898\u51fa\u5728prefill\u9636\u6bb5\u3002vLLM\u4e3a\u4e86\u6027\u80fd\uff0c\u4f1a<strong>\u6309\u7167max_length\u9884\u5148\u5206\u914dkv cache\u663e\u5b58<\/strong>\uff0c\u800c\u4e0d\u662f\u6309\u5b9e\u9645\u8f93\u5165\u957f\u5ea6\u3002<\/p>\n<p>\u8ba1\u7b97\u516c\u5f0f\u5927\u6982\u662f\u8fd9\u6837\uff1a<\/p>\n<pre><code>kv_cache\u663e\u5b58 &asymp; batch_size &times; num_layers &times; 2 &times; hidden_size &times; max_length &times; dtype_bytes<\/code><\/pre>\n<p>\u5bf9\u4e8e70B\u6a21\u578b\uff08num_layers=80\uff0chidden_size=8192\uff09\uff0c\u6211\u4eec\u62ffPython\u7b97\u4e00\u4e0b\uff1a<\/p>\n<pre><code class=\"lang-python language-python python\"># 70B\u6a21\u578b\uff0cfp16\uff0cbatch_size=8\uff0cmax_length=8192\nbatch_size = 8\nnum_layers = 80\nhidden_size = 8192\nmax_length = 8192\ndtype_bytes = 2  # fp16\n\nkv_per_layer = 2 * hidden_size * max_length * dtype_bytes  # k + v\nkv_total = batch_size * num_layers * kv_per_layer\nkv_gb = kv_total \/ (1024**3)\n\nprint(f&quot;\u5355\u6b21\u8bf7\u6c42 kv cache: {kv_gb:.2f} GB&quot;)\n# \u8f93\u51fa\uff1a\u5355\u6b21\u8bf7\u6c42 kv cache: 32 GB<\/code><\/pre>\n<p><strong>\u4e00\u6b21\u8bf7\u6c42\u5c31\u898132GB\u663e\u5b58<\/strong>\uff0cbatch_size=8\u7684\u8bdd\uff0c\u76f4\u63a5\u7206\u6389\u3002<\/p>\n<p>\u800c\u4f60\u7684\u5b9e\u9645prompt\u53ef\u80fd\u53ea\u67091500\u4e2atoken\uff0c\u6839\u672c\u7528\u4e0d\u4e868192\u3002<\/p>\n<hr \/>\n<h2>\u6570\u636e\u8bf4\u660e\uff1a\u663e\u5b58\u5360\u7528\u7684\u5b9e\u9645\u6d4b\u91cf<\/h2>\n<p>\u5149\u7b97\u516c\u5f0f\u4e0d\u591f\uff0c\u6211\u5f97\u7ed9\u4f60\u770b\u771f\u5b9e\u6570\u636e\uff0c\u4e0d\u7136\u4f60\u53ef\u80fd\u89c9\u5f97\u6211\u778e\u7f16\u7684\u3002<\/p>\n<h3>profiling\u811a\u672c<\/h3>\n<pre><code class=\"lang-python language-python python\">import torch\nfrom vllm import LLM, SamplingParams\n\nllm = LLM(\n    model=&quot;meta-llama\/Llama-2-70b-chat-hf&quot;,\n    tensor_parallel_size=4,\n    max_model_len=8192,  # \u5f53\u524d\u914d\u7f6e\n)\n\n# \u6d4bbaseline\ntorch.cuda.reset_peak_memory_stats()\nbaseline = torch.cuda.max_memory_allocated() \/ (1024**3)\nprint(f&quot;Baseline\u663e\u5b58: {baseline:.2f} GB&quot;)\n\n# \u6d4b\u77edprompt\ntorch.cuda.reset_peak_memory_stats()\noutputs = llm.generate([&quot;Hello&quot;], SamplingParams(max_tokens=10))\npeak = torch.cuda.max_memory_allocated() \/ (1024**3)\nprint(f&quot;\u77edprompt\u5cf0\u503c\u663e\u5b58: {peak:.2f} GB, \u589e\u91cf: {peak-baseline:.2f} GB&quot;)\n\n# \u6d4b\u957fprompt\ntorch.cuda.reset_peak_memory_stats()\nlong_text = &quot;describe: &quot; + &quot;x &quot; * 4000\noutputs = llm.generate([long_text], SamplingParams(max_tokens=10))\npeak = torch.cuda.max_memory_allocated() \/ (1024**3)\nprint(f&quot;\u957fprompt\u5cf0\u503c\u663e\u5b58: {peak:.2f} GB, \u589e\u91cf: {peak-baseline:.2f} GB&quot;)<\/code><\/pre>\n<p>\u6211\u8dd1\u51fa\u6765\u662f\u8fd9\u6837\u7684\uff1a<\/p>\n<pre><code>Baseline\u663e\u5b58: 68.42 GB\n\u77edprompt\u5cf0\u503c\u663e\u5b58: 69.58 GB, \u589e\u91cf: 1.16 GB\n\u957fprompt\u5cf0\u503c\u663e\u5b58: 75.23 GB, \u589e\u91cf: 6.81 GB<\/code><\/pre>\n<p>\u6ce8\u610f\u5230\u6ca1\u6709\uff1f\u957fprompt\u7684\u589e\u91cf<strong>\u8fdc\u8d85\u77edprompt<\/strong>\uff0c\u8fd9\u5c31\u662fkv cache\u5728prefill\u9636\u6bb5\u6309max_length\u5206\u914d\u7684\u8bc1\u636e\u3002<\/p>\n<h3>\u663e\u5b58\u5bf9\u6bd4\u8868<\/h3>\n<p>\u6211\u540e\u6765\u628amax_model_len\u4ece8192\u8c03\u52302048\uff0c\u91cd\u65b0\u8dd1profiling\uff0c\u5bf9\u6bd4\u7ed3\u679c\u5982\u4e0b\uff1a<\/p>\n<table>\n<thead>\n<tr>\n<th>\u914d\u7f6e<\/th>\n<th>max_length<\/th>\n<th>\u5e76\u53d1\u6570<\/th>\n<th>\u5cf0\u503c\u663e\u5b58<\/th>\n<th>\u541e\u5410\u91cf<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\u8c03\u6574\u524d<\/td>\n<td>8192<\/td>\n<td>1<\/td>\n<td>72GB<\/td>\n<td>15 tok\/s<\/td>\n<\/tr>\n<tr>\n<td>\u8c03\u6574\u540e<\/td>\n<td>2048<\/td>\n<td>4<\/td>\n<td>62GB<\/td>\n<td>58 tok\/s<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><strong>\u663e\u5b58\u964d\u4e8610GB\uff0c\u5e76\u53d1\u80fd\u529b\u7ffb4\u500d\uff0c\u541e\u5410\u91cf\u7ffb\u8fd14\u500d<\/strong>\u3002<\/p>\n<p>\u8fd9\u4e0d\u6bd4\u52a0\u5361\u9999\uff1f<\/p>\n<hr \/>\n<h2>\u8c03\u7528\u65b9\u5f0f\uff1avLLM\u670d\u52a1\u542f\u52a8\u547d\u4ee4<\/h2>\n<p>\u65e2\u7136\u662f\u9879\u76ee\u4ea4\u4ed8\u624b\u518c\uff0c\u5148\u628a\u8c03\u7528\u65b9\u5f0f\u5199\u6e05\u695a\u3002<\/p>\n<h3>\u542f\u52a8\u63a8\u7406\u670d\u52a1<\/h3>\n<pre><code class=\"lang-bash language-bash bash\">python -m vllm.entrypoints.openai.api_server \\\n    --model meta-llama\/Llama-2-70b-chat-hf \\\n    --gpu-memory-utilization 0.9 \\\n    --max-model-len 2048 \\\n    --tensor-parallel-size 4 \\\n    --port 8000<\/code><\/pre>\n<h3>\u53d1\u9001\u63a8\u7406\u8bf7\u6c42<\/h3>\n<pre><code class=\"lang-bash language-bash bash\">curl http:\/\/localhost:8000\/v1\/chat\/completions \\\n  -H &quot;Content-Type: application\/json&quot; \\\n  -d &#039;{\n    &quot;model&quot;: &quot;meta-llama\/Llama-2-70b-chat-hf&quot;,\n    &quot;messages&quot;: [\n      {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;\u5e2e\u6211\u5199\u4e00\u4e2a\u5feb\u901f\u6392\u5e8f\u7b97\u6cd5&quot;}\n    ],\n    &quot;max_tokens&quot;: 512,\n    &quot;temperature&quot;: 0.7\n  }&#039;<\/code><\/pre>\n<p>\u6216\u8005\u7528Python SDK\uff1a<\/p>\n<pre><code class=\"lang-python language-python python\">from openai import OpenAI\n\nclient = OpenAI(base_url=&quot;http:\/\/localhost:8000\/v1&quot;, api_key=&quot;EMPTY&quot;)\n\nresponse = client.chat.completions.create(\n    model=&quot;meta-llama\/Llama-2-70b-chat-hf&quot;,\n    messages=[\n        {&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;\u4f60\u662f\u4e00\u4e2a\u6709\u5e2e\u52a9\u7684\u52a9\u624b&quot;},\n        {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;\u89e3\u91ca\u4e00\u4e0b\u4ec0\u4e48\u662fkv cache&quot;}\n    ],\n    max_tokens=512,\n    temperature=0.7\n)\n\nprint(response.choices[0].message.content)<\/code><\/pre>\n<hr \/>\n<h2>\u53c2\u6570\u8bf4\u660e\uff1a\u8fd9\u4e9b\u914d\u7f6e\u9879\u5230\u5e95\u600e\u4e48\u53d6\u503c<\/h2>\n<p>\u8fd9\u662f\u6700\u5bb9\u6613\u8e29\u5751\u7684\u5730\u65b9\uff0c\u6211\u6328\u4e2a\u8bf4\u3002<\/p>\n<table>\n<thead>\n<tr>\n<th>\u53c2\u6570<\/th>\n<th>\u542b\u4e49<\/th>\n<th>\u5e38\u89c1\u9519\u8bef<\/th>\n<th>\u6b63\u786e\u53d6\u503c\u5efa\u8bae<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>max_model_len<\/td>\n<td>\u6a21\u578b\u80fd\u5904\u7406\u7684\u6700\u5927\u4e0a\u4e0b\u6587\u957f\u5ea6<\/td>\n<td>\u8bbe\u592a\u5927\u5bfc\u81f4kv cache\u6d6a\u8d39<\/td>\n<td>P99 prompt\u957f\u5ea6 + max_tokens<\/td>\n<\/tr>\n<tr>\n<td>gpu-memory-utilization<\/td>\n<td>\u663e\u5b58\u7528\u4e8ekv cache\u7684\u6bd4\u4f8b<\/td>\n<td>\u8bbe\u592a\u5927OOM\uff0c\u8bbe\u592a\u5c0f\u541e\u5410\u4f4e<\/td>\n<td>0.85-0.9\uff0c\u663e\u5b58\u4e0d\u591f\u65f6\u4f18\u5148\u8c03max_model_len<\/td>\n<\/tr>\n<tr>\n<td>tensor-parallel-size<\/td>\n<td>\u5f20\u91cf\u5e76\u884c\u6570\u91cf<\/td>\n<td>\u5361\u6570\u9009\u9519\u6a21\u578b\u52a0\u8f7d\u5931\u8d25<\/td>\n<td>\u5fc5\u987b\u662fGPU\u6570\u91cf\u7684\u56e0\u6570<\/td>\n<\/tr>\n<tr>\n<td>block_size<\/td>\n<td>kv cache\u5757\u5927\u5c0f<\/td>\n<td>\u592a\u5c0f\u8c03\u5ea6\u5f00\u9500\u5927\uff0c\u592a\u5927\u788e\u7247\u591a<\/td>\n<td>\u9ed8\u8ba416\u5373\u53ef<\/td>\n<\/tr>\n<tr>\n<td>max_num_seqs<\/td>\n<td>\u5355\u6279\u6700\u5927\u8bf7\u6c42\u6570<\/td>\n<td>\u8bbe\u592a\u5927\u663e\u5b58\u7206\u70b8<\/td>\n<td>\u6839\u636e\u663e\u5b58\u548cmax_model_len\u63a8\u7b97<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>\u600e\u4e48\u786e\u5b9amax_model_len<\/h3>\n<p><strong>\u4e0d\u8981\u62cd\u8111\u888b\uff0c\u5148\u7edf\u8ba1\u4f60\u7684\u5b9e\u9645\u6570\u636e<\/strong>\u3002<\/p>\n<pre><code class=\"lang-python language-python python\">import json\n\n# \u5047\u8bbe\u4f60\u7684\u8bf7\u6c42\u65e5\u5fd7\u5b58\u5728requests.jsonl\nlengths = []\nwith open(&#039;requests.jsonl&#039;) as f:\n    for line in f:\n        req = json.loads(line)\n        # \u8fd9\u91cc\u7528\u5b57\u7b26\u6570\u4f30\u7b97\uff0c\u5b9e\u9645\u5e94\u8be5\u7528tokenizer\n        lengths.append(len(req.get(&#039;prompt&#039;, &#039;&#039;).split()))\n\nlengths.sort()\np50 = lengths[int(len(lengths) * 0.5)]\np95 = lengths[int(len(lengths) * 0.95)]\np99 = lengths[int(len(lengths) * 0.99)]\n\nprint(f&quot;P50: {p50}, P95: {p95}, P99: {p99}&quot;)\nprint(f&quot;\u5efa\u8bae max_model_len = {p99 + 512}&quot;)  # \u52a0512\u662f\u9884\u7559\u8f93\u51fa\u7a7a\u95f4<\/code><\/pre>\n<hr \/>\n<h2>\u5b9e\u65bd\u6b65\u9aa4\uff1a\u6211\u662f\u600e\u4e48\u89e3\u51b3\u7684<\/h2>\n<h3>\u7b2c\u4e00\u6b65\uff1a\u5148\u770b\u663e\u5b58\u5230\u5e95\u88ab\u8c01\u5403\u4e86<\/h3>\n<pre><code class=\"lang-bash language-bash bash\"># \u5b9e\u65f6\u76d1\u63a7\u663e\u5b58\nwatch -n 1 nvidia-smi\n\n# \u770bvLLM\u65e5\u5fd7\u91cc\u7684kv cache\u7edf\u8ba1\n# \u542f\u52a8\u65f6\u52a0verbose\npython -m vllm.entrypoints.openai.api_server \\\n    --model meta-llama\/Llama-2-70b-chat-hf \\\n    --max-model-len 8192 \\\n    --tensor-parallel-size 4 2&gt;&amp;1 | grep -i &quot;kv cache&quot;<\/code><\/pre>\n<p>\u6b63\u5e38\u65e5\u5fd7\u957f\u8fd9\u6837\uff1a<\/p>\n<pre><code>KV cache size: 51380224 bytes, allocated: 45749248 bytes (89.0%)<\/code><\/pre>\n<p>\u5982\u679callocated\u63a5\u8fd1100%\u4e14\u8bf7\u6c42\u5f00\u59cb\u6392\u961f\uff0c\u5c31\u662fkv cache\u4e0d\u591f\u3002<\/p>\n<h3>\u7b2c\u4e8c\u6b65\uff1a\u8dd1profiling\u811a\u672c\uff0c\u91cf\u5316\u95ee\u9898<\/h3>\n<p>\u8fd9\u4e00\u6b65\u6211\u4e0a\u9762\u5df2\u7ecf\u7ed9\u4e86\u811a\u672c\uff0c\u8dd1\u5b8c\u4f60\u5c31\u80fd\u770b\u5230\u5177\u4f53\u662f\u54ea\u4e2a\u73af\u8282\u5403\u663e\u5b58\u3002<\/p>\n<h3>\u7b2c\u4e09\u6b65\uff1a\u8c03\u53c2\uff0c\u9a8c\u8bc1\u6548\u679c<\/h3>\n<p>\u628amax_model_len\u8c03\u5c0f\uff0c\u91cd\u65b0\u538b\u6d4b\uff0c\u89c2\u5bdf\u663e\u5b58\u548c\u541e\u5410\u91cf\u53d8\u5316\u3002<\/p>\n<hr \/>\n<h2>\u9a8c\u8bc1\u4e0e\u8bc4\u4f30<\/h2>\n<p>\u4e0a\u7ebf\u4e4b\u540e\u522b\u5c31\u89c9\u5f97\u5b8c\u4e8b\u4e86\uff0c\u5f97\u76ef\u7740\u8fd9\u51e0\u4e2a\u6307\u6807\uff1a<\/p>\n<h3>\u4e0a\u7ebf\u540e\u8bc4\u4f30\uff1a\u89c2\u5bdf\u6307\u6807<\/h3>\n<pre><code class=\"lang-bash language-bash bash\"># 1. \u663e\u5b58\u4f7f\u7528\u7387 - \u7a33\u5b9a\u572885%\u4ee5\u4e0b\u7b97\u5065\u5eb7\nnvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 5\n\n# 2. \u8bf7\u6c42\u5ef6\u8fdfP99 - \u8d85\u8fc72\u79d2\u5f97\u67e5\ncurl -s http:\/\/localhost:8000\/v1\/metrics | grep &quot;request_latency_seconds&quot;\n\n# 3. KV cache\u547d\u4e2d\u7387 - \u4f4e\u4e8e90%\u8bf4\u660e\u7f13\u5b58\u4e0d\u591f\u7528\ncurl -s http:\/\/localhost:8000\/v1\/metrics | grep &quot;kv_cache_hit&quot;\n\n# 4. \u961f\u5217\u957f\u5ea6 - \u6301\u7eed\u6709\u79ef\u538b\u8bf4\u660e\u5e76\u53d1\u4e0d\u591f\ncurl -s http:\/\/localhost:8000\/v1\/metrics | grep &quot;num_requests_waiting&quot;<\/code><\/pre>\n<p>\u5982\u679cKV cache\u547d\u4e2d\u7387\u4f4e\uff0c\u53ef\u4ee5\u9002\u5f53\u8c03\u9ad8gpu-memory-utilization\uff1b\u5982\u679c\u5ef6\u8fdf\u9ad8\uff0c\u5148\u770b\u662f\u4e0d\u662fmax_model_len\u592a\u5c0f\u5bfc\u81f4\u622a\u65ad\u3002<\/p>\n<h3>\u5bb9\u91cf\u8fb9\u754c<\/h3>\n<p>\u6839\u636e\u6211\u7684\u5b9e\u6d4b\uff0c70B\u6a21\u578b\u57284\u5361A100\u4e0a\u7684\u5bb9\u91cf\u8fb9\u754c\u5927\u6982\u662f\uff1a<\/p>\n<pre><code>max_model_len=2048: \u652f\u6301 ~4 \u5e76\u53d1\uff0cmax_tokens &le; 512\nmax_model_len=4096: \u652f\u6301 ~2 \u5e76\u53d1\uff0cmax_tokens &le; 1024\nmax_model_len=8192: \u652f\u6301 ~1 \u5e76\u53d1\uff0cmax_tokens &le; 2048<\/code><\/pre>\n<p>\u8d85\u8fc7\u8fd9\u4e2a\u8fb9\u754c\u5c31\u4f1aOOM\uff0c\u4e0d\u662f\u663e\u5b58\u4e0d\u591f\uff0c\u662fkv cache\u7b97\u672f\u6ea2\u51fa\u4e86\u3002<\/p>\n<h3>\u56de\u5f52\u9a8c\u8bc1<\/h3>\n<p>\u8c03\u53c2\u4e4b\u540e\u5fc5\u987b\u505a\u56de\u5f52\u6d4b\u8bd5\uff0c\u6211\u4e00\u822c\u8dd1\u8fd9\u51e0\u9879\uff1a<\/p>\n<ol>\n<li><strong>\u529f\u80fd\u6d4b\u8bd5<\/strong>\uff1a\u786e\u8ba4\u957f\u6587\u672c\u4e0d\u518d\u622a\u65ad\uff0c\u8f93\u51fa\u5b8c\u6574\u6027<\/li>\n<li><strong>\u538b\u6d4b<\/strong>\uff1a\u76ee\u6807QPS\u7684150%\u6301\u7eed5\u5206\u949f\uff0c\u89c2\u5bdf\u662f\u5426OOM<\/li>\n<li><strong>\u5ef6\u8fdf\u5bf9\u6bd4<\/strong>\uff1aP50\/P95\/P99\u5ef6\u8fdf\u4e0d\u80fd\u6709\u660e\u663e\u9000\u5316<\/li>\n<\/ol>\n<pre><code class=\"lang-bash language-bash bash\"># \u538b\u6d4b\u811a\u672c\u793a\u4f8b\uff08\u7528wrk\uff09\nwrk -t4 -c100 -d300s -s post.lua http:\/\/localhost:8000\/v1\/chat\/completions<\/code><\/pre>\n<hr \/>\n<h2>\u5e38\u89c1\u5751\uff1a\u8fd9\u51e0\u4e2a\u914d\u7f6e\u522b\u4e71\u6539<\/h2>\n<h3>\u57511\uff1a\u628ablock_size\u6539\u5927<\/h3>\n<p>\u4e4b\u524d\u6211\u4e3a\u4e86\u51cf\u5c11\u8c03\u5ea6\u5f00\u9500\uff0c\u628ablock_size\u4ece16\u6539\u523064\u3002<\/p>\n<p>\u7ed3\u679c\u957f\u6587\u672c\u65f6\u5185\u5b58\u788e\u7247\u66b4\u6da8\uff0c\u5cf0\u503c\u663e\u5b58\u53cd\u800c\u66f4\u9ad8\u3002<\/p>\n<p><strong>\u539f\u56e0<\/strong>\uff1ablock_size\u592a\u5927\u4f1a\u9020\u6210\u5185\u90e8\u788e\u7247\u3002\u6bd4\u5982\u4e00\u4e2a1000 tokens\u7684\u8bf7\u6c42\uff0c\u7528block_size=64\u9700\u898116\u4e2ablock\uff0c\u5b9e\u9645\u7528\u4e861000\/64=15.6\u4e2ablock\uff0c\u6700\u540e\u4e00\u4e2ablock\u53ea\u752860%\uff0c\u6d6a\u8d3940%\u3002<\/p>\n<h3>\u57512\uff1amax_tokens\u8bbe\u592a\u5927<\/h3>\n<p>\u6709\u4eba\u89c9\u5f97max_tokens=4096\u8f93\u51fa\u7a7a\u95f4\u66f4\u8db3\uff0c\u4f46\u5b9e\u9645\u4e0a\u5ba2\u670d\u573a\u666f99%\u7684\u56de\u7b54\u90fd\u5728512 tokens\u4ee5\u5185\u3002<\/p>\n<p>max_tokens\u4e5f\u4f1a\u8ba1\u5165kv cache\u8ba1\u7b97\uff0c\u6d6a\u8d39\u7684\u5168\u662f\u663e\u5b58\u3002<\/p>\n<h3>\u57513\uff1a\u591a\u8f6e\u5bf9\u8bdd\u7684context\u7d2f\u79ef<\/h3>\n<p>\u5982\u679c\u4f60\u7528chat\u63a5\u53e3\uff0c<strong>\u591a\u8f6e\u5bf9\u8bdd\u7684history\u4f1a\u4e00\u76f4\u7d2f\u79ef\u5728context\u91cc<\/strong>\u3002<\/p>\n<pre><code class=\"lang-python language-python python\"># \u7b2c1\u8f6e\nmessages = [{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;\u5e2e\u6211\u5199\u4e2a\u5feb\u6392&quot;}]  # 100 tokens\n\n# \u7b2c5\u8f6e\nmessages = [\n    {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;\u5e2e\u6211\u5199\u4e2a\u5feb\u6392&quot;},      # 100 tokens\n    {&quot;role&quot;: &quot;assistant&quot;, &quot;content&quot;: &quot;...500 tokens...&quot;},\n    {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;\u52a0\u4e2a\u57fa\u51c6\u6d4b\u8bd5&quot;},     # 50 tokens\n    # ... \u4e00\u76f4\u7d2f\u79ef\uff0c\u53ef\u80fd\u5df2\u7ecf3000 tokens\u4e86\n]<\/code><\/pre>\n<p><strong>\u89e3\u51b3\u65b9\u6848<\/strong>\uff1a\u5728\u5e94\u7528\u5c42\u9650\u5236history\u8f6e\u6570\uff08\u6bd4\u5982\u53ea\u4fdd\u7559\u6700\u8fd13\u8f6e\uff09\uff0c\u6216\u8005\u628amax_tokens\u8bbe\u5c0f\u3002<\/p>\n<h3>\u57514\uff1aPyTorch\u7f13\u5b58\u6ca1\u91ca\u653e<\/h3>\n<p>\u5982\u679cprofiling\u53d1\u73b0kv cache\u5360\u7528\u6b63\u5e38\u4f46\u8fd8\u662fOOM\uff0c\u90a3\u624d\u8981\u8003\u8651\u662f\u4e0d\u662f\u5185\u5b58\u6cc4\u6f0f\u3002<\/p>\n<p>\u6211\u4e4b\u524d\u9047\u5230\u8fc7\u4e00\u6b21\uff0c\u8dd1\u4e86\u8fd9\u4e2a\u624d\u5b9a\u4f4d\uff1a<\/p>\n<pre><code class=\"lang-python language-python python\">import torch\n\n# \u6bcf\u6b21\u63a8\u7406\u540e\u624b\u52a8\u6e05\u7406\ntorch.cuda.empty_cache()\n\n# \u68c0\u67e5\u663e\u5b58\u788e\u7247\nprint(torch.cuda.memory_summary())<\/code><\/pre>\n<hr \/>\n<h2>\u4ea4\u4ed8\u6e05\u5355\uff1a\u4e0a\u7ebf\u524d\u5fc5\u987b\u68c0\u67e5\u7684\u4e8b\u9879<\/h2>\n<ol>\n<li><strong>\u7edf\u8ba1\u5386\u53f2\u8bf7\u6c42\u7684prompt\u957f\u5ea6\u5206\u5e03<\/strong>\uff0c\u53d6P99 + max_tokens\u4f5c\u4e3amax_model_len<\/li>\n<li><strong>\u8dd1profiling\u811a\u672c<\/strong>\uff0c\u786e\u8ba4\u5cf0\u503c\u663e\u5b58\u4e0d\u8d85\u8fc7GPU\u603b\u91cf\u768490%<\/li>\n<li><strong>\u538b\u6d4b\u5230\u76ee\u6807QPS\u7684150%<\/strong>\uff0c\u89c2\u5bdf\u662f\u5426OOM<\/li>\n<li><strong>\u786e\u8ba4\u591a\u8f6e\u5bf9\u8bdd\u573a\u666f<\/strong>\uff0c\u6709\u7684\u8bdd\u52a0\u4e0ahistory\u9650\u5236<\/li>\n<li><strong>\u76d1\u63a7\u6307\u6807\u544a\u8b66<\/strong>\uff1a\u663e\u5b58&gt;90%\u3001\u5ef6\u8fdfP99&gt;2s\u3001\u961f\u5217\u79ef\u538b&gt;10<\/li>\n<\/ol>\n<hr \/>\n<h2>\u603b\u7ed3<\/h2>\n<p>\u8bf4\u5b9e\u8bdd\uff0c\u8fd9\u4e2a\u95ee\u9898\u672c\u6765\u53ef\u4ee5\u5728\u4e0a\u7ebf\u524d\u907f\u514d\u7684\uff0c\u5c31\u662f\u6ca1\u505aprofiling\u3002<\/p>\n<p>\u5927\u5bb6\u4e0a\u7ebf\u90fd\u8d76\uff0c\u914d\u7f6e\u90fd\u662f\u300c\u5dee\u4e0d\u591a\u5c31\u884c\u300d\u3002\u7ed3\u679c\u7ebf\u4e0a\u7206\u4e86\u518d\u56de\u8fc7\u5934\u6765\u67e5\uff0c\u591a\u82b1\u7684\u5361\u94b1\u591f\u4e70\u597d\u51e0\u53f0\u670d\u52a1\u5668\u4e86\u3002<\/p>\n<p>\u8bb0\u4f4f\u8fd9\u51e0\u70b9\uff1a<\/p>\n<ol>\n<li><strong>OOM\u4e0d\u4e00\u5b9a\u662f\u663e\u5b58\u4e0d\u591f<\/strong>\uff0c\u5148profiling\u770bkv cache\u5360\u7528<\/li>\n<li><strong>max_length\u662f\u663e\u5b58\u5927\u6237<\/strong>\uff0c\u80fd\u5c0f\u5c31\u522b\u5927\uff0c\u5148\u7edf\u8ba1\u5b9e\u9645\u957f\u5ea6\u518d\u5b9a<\/li>\n<li><strong>\u4e0a\u7ebf\u524d\u5fc5\u987b\u538b\u6d4b<\/strong>\uff0c\u522b\u4fe1\u300c\u5e94\u8be5\u6ca1\u95ee\u9898\u300d<\/li>\n<li><strong>\u591a\u8f6e\u5bf9\u8bdd\u6ce8\u610fcontext\u7d2f\u79ef<\/strong>\uff0c\u8fd9\u4e2a\u5751\u5f88\u5bb9\u6613\u6f0f<\/li>\n<\/ol>\n<p>\u5de5\u5177\u63a8\u8350\uff1avLLM\u7684profiling\u529f\u80fd\u6bd4\u76f4\u63a5\u7528transformers\u5f3a\u592a\u591a\u4e86\uff0c\u81f3\u5c11\u5b83\u4f1a\u544a\u8bc9\u4f60\u663e\u5b58\u88ab\u8c01\u5403\u6389\u4e86\u3002\u8981\u662f\u6362\u6210\u7eafPyTorch\uff0c\u8fd9\u95ee\u9898\u5f97\u67e5\u5230\u7334\u5e74\u9a6c\u6708\u3002<\/p>","protected":false},"excerpt":{"rendered":"<p>70B\u6a21\u578b\u914d\u4e864\u5f2080G\u5361\uff0c\u957f\u6587\u672c\u4e00\u63a8\u5c31\u7206\u3002\u67e5\u4e86\u4e00\u5708\u53d1\u73b0\u4e0d\u662f\u663e\u5b58\u5bb9\u91cf\u95ee\u9898\uff0c\u662fmax_length\u8bbe\u592a\u5927\uff0ckv cache\u6309\u6700\u574f\u60c5\u51b5\u9884\u5206\u914d\u663e\u5b58\u3002\u4e0a\u7ebf\u524d\u6ca1\u505aprofiling\uff0c\u5dee\u70b9\u591a\u82b120\u4e07\u4e70\u5361\u3002<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[401],"tags":[527,526,525,324,489,293],"class_list":["post-707","post","type-post","status-publish","format-standard","hentry","category-ai","tag-kv-cache","tag-llm","tag-oom","tag-vllm","tag-489","tag-293"],"views":4,"_links":{"self":[{"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=\/wp\/v2\/posts\/707","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=707"}],"version-history":[{"count":1,"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=\/wp\/v2\/posts\/707\/revisions"}],"predecessor-version":[{"id":721,"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=\/wp\/v2\/posts\/707\/revisions\/721"}],"wp:attachment":[{"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=707"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=707"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.liaoxinghui.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=707"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}