这篇文章不错，转发给_老虎社区_美港股上老虎 - 老虎社区

这篇文章不错，转发给大家看看

英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100

免责声明：上述内容仅代表发帖人个人观点，不构成本平台的任何投资建议。

精彩评论

我们需要你的真知灼见来填补这片空白

发表看法

{"i18n":{"language":"zh_CN"},"detailType":1,"isChannel":false,"data":{"magic":2,"id":407838939677272,"tweetId":"407838939677272","gmtCreate":1740584989247,"gmtModify":1740584991071,"author":{"id":3494285125388429,"idStr":"3494285125388429","authorId":3494285125388429,"authorIdStr":"3494285125388429","name":"Saraat","avatar":"https://static.tigerbbs.com/7fd697144f3aab0355b28cc6f7d75e8e","vip":1,"userType":1,"introduction":"","boolIsFan":false,"boolIsHead":false,"crmLevel":2,"crmLevelSwitch":1,"currentWearingBadge":{"badgeId":"e50ce593bb40487ebfb542ca54f6a561-4","templateUuid":"e50ce593bb40487ebfb542ca54f6a561","name":"明星虎友","description":"加入老虎社区2000天","bigImgUrl":"https://static.tigerbbs.com/dddf24b906c7011de2617d4fb3f76987","smallImgUrl":"https://static.tigerbbs.com/53d58ad32c97254c6f74db8b97e6ec49","grayImgUrl":"https://static.tigerbbs.com/6304700d92ad91c7a33e2e92ec32ecc1","redirectLinkEnabled":0,"hasAllocated":1,"isWearing":1,"stampPosition":0,"hasStamp":0,"allocationCount":1,"allocatedDate":"2024.01.06","individualDisplayEnabled":0},"individualDisplayBadges":[],"fanSize":358,"starInvestorFlag":false},"themes":[],"images":[],"coverImages":[],"html":"<html><head></head><body>这篇文章不错，转发给大家看看</body></html>","htmlText":"<html><head></head><body>这篇文章不错，转发给大家看看</body></html>","text":"这篇文章不错，转发给大家看看","highlighted":1,"essential":1,"paper":1,"likeSize":0,"commentSize":0,"repostSize":0,"favoriteSize":0,"link":"https://laohu8.com/post/407838939677272","repostId":2514856774,"repostType":2,"repost":{"id":"2514856774","kind":"news","pubTimestamp":1740581127,"share":"https://www.laohu8.com/m/news/2514856774?lang=zh_CN&edition=full","pubTime":"2025-02-26 22:45","market":"hk","language":"zh","title":"英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100","url":"https://stock-news.laohu8.com/highlight/detail?id=2514856774","media":"华尔街见闻","summary":"最近，英伟达开源了首个在Blackwell架构上优化的DeepSeek-R1，实现了推理速度提升25倍，和每token成本降低20倍的惊人成果。同时，DeepSeek连续开源多个英伟达GPU优化项目，共同探索模型性能极限。","content":"<html><body><p>当FP4的魔法与Blackwell的强大算力相遇，会碰撞出怎样的火花？</p>\n<p>答案是：推理性能暴涨25倍，成本狂降20倍！</p>\n<p>随着DeepSeek-R1本地化部署的爆火，<a href=\"https://laohu8.com/S/NVDA\">英伟达</a>也亲自下场，开源了首个基于Blackwell架构的优化方案——DeepSeek-R1-FP4。</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/7c25e800-ff9f-402b-add1-482a3f8284b0.png\"/></p>\n<p>在新模型的加持下，B200实现了高达21,088 token每秒的的推理吞吐量，相比于H100的844 token每秒，提升了25倍。</p>\n<p>与此同时，每token的成本也实现了20倍的降低。</p>\n<p>通过在Blackwell架构上应用TensorRT DeepSeek优化，英伟达让具有FP4生产级精度的模型，在MMLU通用智能基准测试中达到了FP8模型性能的99.8%。</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/12326fc1-a988-45c8-adc9-d49e3040e03e.png\"/></p>\n<h2>DeepSeek-R1首次基于Blackwell GPU优化</h2>\n<p>目前，英伟达基于FP4优化的DeepSeek-R1检查点现已在Hugging Face上开源。</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/239a6414-e0cc-4868-95a9-0b4320845173.png\"/></p>\n<p>模型地址：https://huggingface.co/nvidia/DeepSeek-R1-FP4</p>\n<p><strong>后训练量化</strong></p>\n<p>模型将Transformer模块内的线性算子的权重和激活量化到了FP4，适用于TensorRT-LLM推理。</p>\n<p>这种优化将每个参数从8位减少到4位，从而让磁盘空间和GPU显存的需求减少了约1.6倍。</p>\n<p><strong>使用TensorRT-LLM部署</strong></p>\n<p>要使用TensorRT-LLM LLM API部署量化后的FP4权重文件，并为给定的提示生成文本响应，请参照以下示例代码：</p>\n<p>硬件要求：需要支持TensorRT-LLM的英伟达GPU（如B200），并且需要8个GPU来实现tensor_parallel_size=8的张量并行。</p>\n<p>性能优化：代码利用FP4量化、TensorRT引擎和并行计算，旨在实现高效、低成本的推理，适合生产环境或高吞吐量应用。</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/4cdd0507-f77c-4ea8-8dc9-783c7c5867fb.png\"/></p>\n<p>对于此次优化的成果，网友表示惊叹。</p>\n<p>“FP4魔法让AI未来依然敏锐！”网友Isha评论道。</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/d15ae77b-c345-40aa-9094-37391041223d.png\"/></p>\n<p>网友algorusty则声称，有了这次的优化后，美国供应商能够以每百万token 0.25美元的价格提供R1。</p>\n<p>“还会有利润。”</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/6500994e-9e87-47ab-af97-f2e833a55764.png\"/></p>\n<p>网友Phil则将这次的优化与DeepSeek本周的开源5连发结合了起来。</p>\n<p>“这展示了硬件和开源模型结合的可能性。”他表示。</p>\n<p><img src=\"https://wpimg-wscn.awtmt.com/6642e288-815f-422d-84e0-89f7de163010.png\"/></p>\n<h2>DeepSeek全面开源</h2>\n<p>如今DeepSeek持续5天的“开源周”已经进行到了第3天。</p>\n<p>周一，他们开源了FlashMLA。这是DeepSeek专为英伟达Hopper GPU打造的高效MLA解码内核，特别针对变长序列进行了优化，目前已正式投产使用。</p>\n<p>周二开源了DeepEP，这是一个专为混合专家系统（MoE）和专家并行（EP）设计的通信库。</p>\n<p>周三开源的是DeepGEMM。这是一个支持稠密和MoE模型的FP8 GEMM（通用矩阵乘法）计算库，可为V3/R1的训练和推理提供强大支持。</p>\n<p>总的来说，不管是英伟达开源的DeepSeek-R1-FP4，还是DeepSeek开源的三个仓库，都是通过对英伟达GPU和集群的优化，来推动AI模型的高效计算和部署。</p>\n<p><span>本文来源：新智元，原文标题：《英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100》</span></p><div>风险提示及免责条款</div>\n<div>\n            市场有风险，投资需谨慎。本文不构成个人投资建议，也未考虑到个别用户特殊的投资目标、财务状况或需要。用户应考虑本文中的任何意见、观点或结论是否符合其特定状况。据此投资，责任自负。\n          </div>\n</body></html>","source":"wallstreetcn_hot_news","collect":0,"html":"<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n<meta name=\"viewport\" content=\"width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no\"/>\n<meta name=\"format-detection\" content=\"telephone=no,email=no,address=no\" />\n<title>英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100</title>\n<style type=\"text/css\">\na,abbr,acronym,address,applet,article,aside,audio,b,big,blockquote,body,canvas,caption,center,cite,code,dd,del,details,dfn,div,dl,dt,\nem,embed,fieldset,figcaption,figure,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,html,i,iframe,img,ins,kbd,label,legend,li,mark,menu,nav,\nobject,ol,output,p,pre,q,ruby,s,samp,section,small,span,strike,strong,sub,summary,sup,table,tbody,td,tfoot,th,thead,time,tr,tt,u,ul,var,video{ font:inherit;margin:0;padding:0;vertical-align:baseline;border:0 }\nbody{ font-size:16px; line-height:1.5; color:#999; background:transparent; }\n.wrapper{ overflow:hidden;word-break:break-all;padding:10px; }\nh1,h2{ font-weight:normal; line-height:1.35; margin-bottom:.6em; }\nh3,h4,h5,h6{ line-height:1.35; margin-bottom:1em; }\nh1{ font-size:24px; }\nh2{ font-size:20px; }\nh3{ font-size:18px; }\nh4{ font-size:16px; }\nh5{ font-size:14px; }\nh6{ font-size:12px; }\np,ul,ol,blockquote,dl,table{ margin:1.2em 0; }\nul,ol{ margin-left:2em; }\nul{ list-style:disc; }\nol{ list-style:decimal; }\nli,li p{ margin:10px 0;}\nimg{ max-width:100%;display:block;margin:0 auto 1em; }\nblockquote{ color:#B5B2B1; border-left:3px solid #aaa; padding:1em; }\nstrong,b{font-weight:bold;}\nem,i{font-style:italic;}\ntable{ width:100%;border-collapse:collapse;border-spacing:1px;margin:1em 0;font-size:.9em; }\nth,td{ padding:5px;text-align:left;border:1px solid #aaa; }\nth{ font-weight:bold;background:#5d5d5d; }\n.symbol-link{font-weight:bold;}\n/* header{ border-bottom:1px solid #494756; } */\n.title{ margin:0 0 8px;line-height:1.3;color:#ddd; }\n.meta {color:#5e5c6d;font-size:13px;margin:0 0 .5em; }\na{text-decoration:none; color:#2a4b87;}\n.meta .head { display: inline-block; overflow: hidden}\n.head .h-thumb { width: 30px; height: 30px; margin: 0; padding: 0; border-radius: 50%; float: left;}\n.head .h-content { margin: 0; padding: 0 0 0 9px; float: left;}\n.head .h-name {font-size: 13px; color: #eee; margin: 0;}\n.head .h-time {font-size: 11px; color: #7E829C; margin: 0;line-height: 11px;}\n.small {font-size: 12.5px; display: inline-block; transform: scale(0.9); -webkit-transform: scale(0.9); transform-origin: left; -webkit-transform-origin: left;}\n.smaller {font-size: 12.5px; display: inline-block; transform: scale(0.8); -webkit-transform: scale(0.8); transform-origin: left; -webkit-transform-origin: left;}\n.bt-text {font-size: 12px;margin: 1.5em 0 0 0}\n.bt-text p {margin: 0}\n</style>\n</head>\n<body>\n<div class=\"wrapper\">\n<header>\n<h2 class=\"title\">\n英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100\n</h2>\n\n<h4 class=\"meta\">\n\n\n2025-02-26 22:45 北京时间&nbsp;&nbsp;&nbsp;<a href=https://wallstreetcn.com/articles/3741953><strong>华尔街见闻</strong></a>\n\n\n</h4>\n\n</header>\n<article>\n<div>\n<p>当FP4的魔法与Blackwell的强大算力相遇，会碰撞出怎样的火花？\n答案是：推理性能暴涨25倍，成本狂降20倍！\n随着DeepSeek-R1本地化部署的爆火，英伟达也亲自下场，开源了首个基于Blackwell架构的优化方案——DeepSeek-R1-FP4。\n\n在新模型的加持下，B200实现了高达21,088 token每秒的的推理吞吐量，相比于H100的844 token每秒，提升了25倍。...</p>\n\n<a href=\"https://wallstreetcn.com/articles/3741953\">网页链接</a>\n\n</div>\n\n\n</article>\n</div>\n</body>\n</html>\n","type":0,"thumbnail":"https://wpimg-wscn.awtmt.com/158f0d7c-3002-4a68-9a03-92623961aa6d.jpeg","relate_stocks":{"NVDA":"英伟达"},"source_url":"https://wallstreetcn.com/articles/3741953","is_english":false,"share_image_url":"https://static.laohu8.com/e9f99090a1c2ed51c021029395664489","article_id":"2514856774","content_text":"当FP4的魔法与Blackwell的强大算力相遇，会碰撞出怎样的火花？\n答案是：推理性能暴涨25倍，成本狂降20倍！\n随着DeepSeek-R1本地化部署的爆火，英伟达也亲自下场，开源了首个基于Blackwell架构的优化方案——DeepSeek-R1-FP4。\n\n在新模型的加持下，B200实现了高达21,088 token每秒的的推理吞吐量，相比于H100的844 token每秒，提升了25倍。\n与此同时，每token的成本也实现了20倍的降低。\n通过在Blackwell架构上应用TensorRT DeepSeek优化，英伟达让具有FP4生产级精度的模型，在MMLU通用智能基准测试中达到了FP8模型性能的99.8%。\n\nDeepSeek-R1首次基于Blackwell GPU优化\n目前，英伟达基于FP4优化的DeepSeek-R1检查点现已在Hugging Face上开源。\n\n模型地址：https://huggingface.co/nvidia/DeepSeek-R1-FP4\n后训练量化\n模型将Transformer模块内的线性算子的权重和激活量化到了FP4，适用于TensorRT-LLM推理。\n这种优化将每个参数从8位减少到4位，从而让磁盘空间和GPU显存的需求减少了约1.6倍。\n使用TensorRT-LLM部署\n要使用TensorRT-LLM LLM API部署量化后的FP4权重文件，并为给定的提示生成文本响应，请参照以下示例代码：\n硬件要求：需要支持TensorRT-LLM的英伟达GPU（如B200），并且需要8个GPU来实现tensor_parallel_size=8的张量并行。\n性能优化：代码利用FP4量化、TensorRT引擎和并行计算，旨在实现高效、低成本的推理，适合生产环境或高吞吐量应用。\n\n对于此次优化的成果，网友表示惊叹。\n“FP4魔法让AI未来依然敏锐！”网友Isha评论道。\n\n网友algorusty则声称，有了这次的优化后，美国供应商能够以每百万token 0.25美元的价格提供R1。\n“还会有利润。”\n\n网友Phil则将这次的优化与DeepSeek本周的开源5连发结合了起来。\n“这展示了硬件和开源模型结合的可能性。”他表示。\n\nDeepSeek全面开源\n如今DeepSeek持续5天的“开源周”已经进行到了第3天。\n周一，他们开源了FlashMLA。这是DeepSeek专为英伟达Hopper GPU打造的高效MLA解码内核，特别针对变长序列进行了优化，目前已正式投产使用。\n周二开源了DeepEP，这是一个专为混合专家系统（MoE）和专家并行（EP）设计的通信库。\n周三开源的是DeepGEMM。这是一个支持稠密和MoE模型的FP8 GEMM（通用矩阵乘法）计算库，可为V3/R1的训练和推理提供强大支持。\n总的来说，不管是英伟达开源的DeepSeek-R1-FP4，还是DeepSeek开源的三个仓库，都是通过对英伟达GPU和集群的优化，来推动AI模型的高效计算和部署。\n本文来源：新智元，原文标题：《英伟达下场，首次优化DeepSeek-R1！B200性能狂飙25倍，碾压H100》风险提示及免责条款\n\n            市场有风险，投资需谨慎。本文不构成个人投资建议，也未考虑到个别用户特殊的投资目标、财务状况或需要。用户应考虑本文中的任何意见、观点或结论是否符合其特定状况。据此投资，责任自负。","news_type":1,"symbols_score_info":{"NVDA":1}},"isVote":1,"tweetType":1,"viewCount":1643,"commentLimit":10,"likeStatus":false,"favoriteStatus":false,"reportStatus":false,"symbols":[],"verified":2,"subType":0,"readableState":1,"langContent":"CN","currentLanguage":"CN","warmUpFlag":false,"orderFlag":false,"shareable":true,"causeOfNotShareable":"","featuresForAnalytics":[],"commentAndTweetFlag":false,"andRepostAutoSelectedFlag":false,"upFlag":false,"length":27,"optionInvolvedFlag":false,"xxTargetLangEnum":"ZH_CN"},"commentList":[],"isCommentEnd":true,"isTiger":false,"isWeiXinMini":false,"url":"/m/post/407838939677272"}