Use AutoGen for Local LLMs
TL;DR: We demonstrate how to use autogen for local LLM application. As an example, we will initiate an endpoint using FastChat and perform inference on ChatGLMv2-6b.
TL;DR: We demonstrate how to use autogen for local LLM application. As an example, we will initiate an endpoint using FastChat and perform inference on ChatGLMv2-6b.
TL;DR:
TL;DR:
GPT-4 is a big upgrade of foundation model capability, e.g., in code and math, accompanied by a much higher (more than 10x) price per token to use over GPT-3.5-Turbo. On a code completion benchmark, HumanEval, developed by OpenAI, GPT-4 can successfully solve 68% tasks while GPT-3.5-Turbo does 46%. It is possible to increase the success rate of GPT-4 further by generating multiple responses or making multiple calls. However, that will further increase the cost, which is already nearly 20 times of using GPT-3.5-Turbo and with more restricted API call rate limit. Can we achieve more with less?
In this blog post, we will explore a creative, adaptive way of using GPT models which leads to a big leap forward.
TL;DR: * Just by tuning the inference parameters like model, number of responses, temperature etc. without changing any model weights or prompt, the baseline accuracy of untuned gpt-4 can be improved by 20% in high school math competition problems. * For easy problems, the tuned gpt-3.5-turbo model vastly outperformed untuned gpt-4 in accuracy (e.g., 90% vs. 70%) and cost efficiency. For hard problems, the tuned gpt-4 is much more accurate (e.g., 35% vs. 20%) and less expensive than untuned gpt-4. * AutoGen can help with model selection, parameter tuning, and cost-saving in LLM applications.