My Cool AI Agent

My Cool AI AgentMy Cool AI AgentMy Cool AI Agent

My Cool AI Agent

My Cool AI AgentMy Cool AI AgentMy Cool AI Agent
  • Home
  • RAG / Vector DBs
  • No-Code AI Worflows
  • TOGAF EA Framework
  • Model Context Protocol
  • Azure, AWS, GCP, GitHub
  • Transform Sub-quadractic
  • AI Agents
  • APIs / Tools
  • ML & Transformers
  • TensoFl PyTorch LangChain
  • Graph Neural Networks
  • Glossary
  • Azure AI Foundry
  • Snowflake vs Databricks
  • AGI - SAI - 2027
  • Kubernetes - Docker
  • Technical Full Stack 2025
  • More
    • Home
    • RAG / Vector DBs
    • No-Code AI Worflows
    • TOGAF EA Framework
    • Model Context Protocol
    • Azure, AWS, GCP, GitHub
    • Transform Sub-quadractic
    • AI Agents
    • APIs / Tools
    • ML & Transformers
    • TensoFl PyTorch LangChain
    • Graph Neural Networks
    • Glossary
    • Azure AI Foundry
    • Snowflake vs Databricks
    • AGI - SAI - 2027
    • Kubernetes - Docker
    • Technical Full Stack 2025
  • Home
  • RAG / Vector DBs
  • No-Code AI Worflows
  • TOGAF EA Framework
  • Model Context Protocol
  • Azure, AWS, GCP, GitHub
  • Transform Sub-quadractic
  • AI Agents
  • APIs / Tools
  • ML & Transformers
  • TensoFl PyTorch LangChain
  • Graph Neural Networks
  • Glossary
  • Azure AI Foundry
  • Snowflake vs Databricks
  • AGI - SAI - 2027
  • Kubernetes - Docker
  • Technical Full Stack 2025

Model Context Protocol (MCP)

Additional Information

This i.

Model Context Protocol (MCP)

Additional Information

This i.

Model Context Protocol (MCP)

If customers can’t find it, it doesn’t exist. Clearly list and describe the services you offer. Also, be sure to showcase a premium service.

If customers can’t find it, it doesn’t exist. Clearly list and describe the services you offer. Also, be sure to showcase a premium service.

MCP Videos

Grab interest

Say something interesting about your business here.

Generate excitement

What's something exciting your business offers? Say it here.

Generate excitement

What's something exciting your business offers? Say it here.

Generate excitement

What's something exciting your business offers? Say it here.

Model Context Protocol (MCP) Details

MCP

  • The Model Context Protocol (MCP) is an open-standard, open-source framework introduced by Anthropic in November 2024 to standardize how AI models, particularly large language models (LLMs), integrate with external tools, data sources, and systems. Often compared to a "USB-C for AI applications," MCP provides a universal, model-agnostic interface to connect AI agents with diverse resources, reducing the need for custom integrations and addressing the "N×M problem" (where N AI applications require M custom connectors for each data source). 


  • Below is a detailed description of the MCP framework, its components, workflow, and a textual representation of its architecture diagram, followed by key benefits and use cases.Detailed Description of the MCP FrameworkMCP is designed to enable seamless, secure, and scalable interactions between AI models and external systems, such as databases, APIs, file systems, or business tools (e.g., Google Drive, Slack, GitHub). It standardizes the exchange of context, tools, and prompts, allowing AI agents to dynamically discover, inspect, and invoke resources without bespoke code for each integration. The protocol is built on JSON-RPC 2.0, drawing inspiration from the Language Server Protocol (LSP), and supports multiple transport mechanisms for flexibility.Core Components of MCP


  1. MCP Host:
    • The host is the primary application that interacts with the user and coordinates the AI workflow. Examples include AI-powered chat applications (e.g., Claude Desktop), integrated development environments (IDEs) like Cursor or Zed, or custom AI agents.
    • The host manages LLM interactions and integrates the MCP client to communicate with MCP servers.


  1. MCP Client:
    • Embedded within the host application, the client handles communication with MCP servers. It translates the host’s requirements into MCP-compliant requests and processes responses.
    • Clients are responsible for initiating connections, performing capability discovery, and invoking tools or accessing resources provided by servers.
    • Clients are built using SDKs in languages like Python, TypeScript, Java, C#, or Kotlin.


  1. MCP Server:
    • Servers are lightweight, independent programs that expose specific capabilities, such as tools, resources, or prompts, to MCP clients.
    • Each server focuses on a single integration point, e.g., a GitHub server for repository access, a PostgreSQL server for database queries, or a Slack server for messaging.
    • Servers define available functions, their schemas (via JSON), and metadata, enabling AI models to decide when and how to use them.


  1. Transport Layer:
    • The transport layer facilitates communication between clients and servers. MCP supports two primary transport mechanisms:
      • STDIO (Standard Input/Output): Used for local integrations, where the server runs in the same environment as the client. It’s simple and ideal for command-line tools or local development.
      • HTTP + SSE (Server-Sent Events): Used for remote connections, with HTTP POST for client-to-server requests and SSE for server-to-client streaming.
    • All communication uses JSON-RPC 2.0 for structured message exchange, ensuring consistency and interoperability.


  1. Primitives:
    • MCP defines three core primitives that servers expose to clients:
      • Tools (Model-Controlled): Executable functions that the AI model can invoke, such as API calls, database queries, or file operations. Tools include metadata (e.g., descriptions, parameters) to guide LLM usage.
      • Resources (Application-Controlled): Data sources like files, database records, or API responses that provide context without side effects (similar to GET endpoints in REST APIs).
      • Prompts (User-Controlled): Predefined templates that guide how tools or resources are used, allowing users to customize interactions.


  1. Protocol Layer:
    • The protocol layer manages message framing, request-response linking, and high-level communication patterns.
    • It supports dynamic context windows that grow with interactions, storing user preferences (e.g., language, tone) and session-specific data.
    • Error handling is standardized with JSON-RPC error codes (e.g., ParseError = -32700, InvalidRequest = -32600).


  1. Security and Authentication:
    • MCP emphasizes security through:
      • OAuth 2.1: Supports secure authentication for remote servers, with dynamic client registration and automatic endpoint discovery.
      • Local-First Design: Prioritizes local or self-hosted servers to minimize data exposure.
      • Permission Controls: Enforces explicit user authorization and sandboxing for file access.
      • Tool Safety: Metadata annotations indicate tool behavior (e.g., read-only vs. destructive) to prevent unintended actions.

MCP WorkflowThe MCP workflow follows a client-server architecture with a clear sequence of steps:


  1. Initialization:
    • The host application starts and creates one or more MCP clients.
    • Clients perform a handshake with MCP servers to negotiate capabilities and protocol versions.


  1. Discovery:
    • The client queries the server for available tools, resources, and prompts.
    • The server responds with a list of capabilities, including JSON schemas describing tool functions, resource endpoints, and prompt templates.


  1. Context Provision:
    • The host application makes resources and prompts available to the user or formats tools into an LLM-compatible structure (e.g., function-calling schemas).
    • The LLM decides which tools to invoke based on the user’s query and available context.


  1. Tool Invocation:
    • The client sends a request to the server to execute a tool or retrieve a resource.
    • The server processes the request (e.g., queries a database, calls an API) and returns the result via JSON-RPC.


  1. Response Integration:
    • The client passes the server’s response to the host, which integrates it into the LLM’s context.
    • The LLM generates a response for the user, enriched with real-time data or tool outputs.


  1. Feedback Loop:
    • User interactions or new data can be fed back into the system, updating the context window or triggering additional tool calls.


Key Features and Benefits

  • Standardization: MCP eliminates the need for custom connectors by providing a single protocol for all integrations, reducing development overhead.
  • Interoperability: Model-agnostic design allows switching between LLM providers (e.g., Claude, ChatGPT, LLaMA) without reengineering integrations.
  • Scalability: Modular server architecture supports adding or removing integrations without altering the core application.
  • Security: OAuth 2.1, local-first design, and permission controls ensure secure data access and compliance with standards like GDPR and SOC2.
  • Ecosystem Growth: Supported by major platforms (e.g., LangChain, OpenAI Agent SDK, Google Agent Developer Kit) and pre-built servers for tools like GitHub, Slack, and PostgreSQL.
  • Flexibility: Supports both local (STDIO) and remote (HTTP+SSE) integrations, with SDKs in multiple languages (Python, TypeScript, Java, C#, Kotlin).


Use Cases

  • Software Development: IDEs like Zed or Sourcegraph use MCP to provide coding assistants with real-time access to codebases, Git repositories, or documentation.
  • Enterprise Automation: Companies like Block use MCP to connect internal assistants to CRMs, knowledge bases, or proprietary databases for context-aware responses.
  • Natural Language Data Access: Applications like AI2SQL leverage MCP to enable LLMs to query SQL databases using plain language.
  • Personal Productivity: Assistants like Claude Desktop use MCP to interact with Google Drive, Slack, or Notion for task automation and data retrieval.


Limitations and Challenges

  • Security Concerns: Early 2025 analyses highlighted risks like prompt injection, tool permission vulnerabilities, and potential for lookalike tools to replace trusted ones.
  • Adoption Curve: While gaining traction, MCP requires developer familiarity and ecosystem maturity for widespread use.
  • Complexity

Announce coming events

Having


+-------------------------+
|       MCP Host          |
|  (e.g., Claude Desktop, |
|   IDE, Chat App)        |
|  +-------------------+  |
|  |   MCP Client      |  |
|  | (Handles Protocol)|  |
|  +-------------------+  |
|       | JSON-RPC 2.0    |
|       | (STDIO or HTTP+SSE) |
+-------------------------+
           |
           | Connects to Multiple Servers
           |
 +-------------------+   +-------------------+   +-------------------+
 |   MCP Server 1    |   |   MCP Server 2    |   |   MCP Server 3    |
 | (e.g., GitHub)    |   | (e.g., PostgreSQL)|   | (e.g., Slack)     |
 | - Tools: API Calls|   | - Tools: Queries  |   | - Tools: Messages |
 | - Resources: Repos|   | - Resources: Data |   | - Resources: Chats|
 | - Prompts: Templates | - Prompts: SQL Templates | - Prompts: Msg Templates |
 +-------------------+   +-------------------+   +-------------------+
           |                    |                    |
 +-------------------+   +-------------------+   +-------------------+
 | External System   |   | External System   |   | External System   |
 | (GitHub API)      |   | (Database)        |   | (Slack API)       |
 +-------------------+   +-------------------+   +-------------------+

Explanation:

  • The MCP Host (e.g., an IDE or chatbot) contains the MCP Client, which communicates with multiple MCP Servers using JSON-RPC 2.0 over STDIO or HTTP+SSE.
  • Each MCP Server connects to a specific External System (e.g., GitHub, PostgreSQL, Slack) and exposes Tools (actions), Resources (data), and Prompts (templates).
  • The client discovers server capabilities, invokes tools, or retrieves resources to provide context to the LLM, which generates user responses.


Integration with Vector Databases in AI WorkflowsSince the user previously asked about vector databases, here’s how MCP integrates with them in an AI workflow:


  • Embedding Storage: A vector database (e.g., Pinecone, Weaviate) can be exposed as an MCP server, providing access to vector embeddings as Resources. The server defines APIs for querying embeddings or performing similarity searches.
  • Retrieval-Augmented Generation (RAG): The MCP client retrieves relevant embeddings from the vector database server to enrich the LLM’s context, enabling more accurate responses.
  • Tool Invocation: The LLM can use Tools exposed by the vector database server (e.g., nearest-neighbor search) to dynamically fetch similar documents or data points.
  • Scalability: MCP’s modular architecture allows the vector database server to be swapped or scaled independently, supporting large-scale AI applications.


Example:

  • A user asks, “Find documents similar to this query.” The MCP client converts the query into a vector embedding, sends it to an MCP server connected to a Pinecone database, retrieves the top-k similar embeddings, and passes the results to the LLM for response generation.


Sources

  • Anthropic’s official MCP announcement and documentation.
  • The New Stack: Model Context Protocol primer for developers.
  • Boomi: How to use MCP the right way.
  • Microsoft .NET Blog: Building an MCP server in C#.
  • Wikipedia: Model Context Protocol overview.
  • Medium: Comprehensive guides and tutorials on MCP.
  • Composio: MCP architecture and use cases.
  • Posts on X discussing MCP’s role in AI integration.


MCP is a transformative protocol that standardizes AI integration with external systems, making it easier to build context-aware, scalable, and secure AI applications. Its client-server architecture, support for multiple transports, and focus on primitives like tools, resources, and prompts make it a versatile framework for modern AI workflows, including those leveraging vector databases.19 posts15 web pagesExplain MCP security featuresCompare MCP to GraphQLDeepSearchThinkEdit Image 

LangChain

  • LangChain and Model Context Protocol (MCP) are two distinct technologies that serve different purposes in the realm of AI and data integration. Here's a comparison to help understand their differences and use cases.


  • LangChain is a framework designed to facilitate the creation of applications that integrate with large language models (LLMs). It provides tools and libraries to build, manage, and deploy AI-powered applications. LangChain focuses on creating chains of operations that can process and transform data using LLMs. It supports various integrations and allows developers to build complex workflows involving multiple AI models and data sources.

MCP

LangChain

Detail your services

Announce coming events

Announce coming events

If customers can’t find it, it doesn’t exist. Clearly list and describe the services you offer. Also, be sure to showcase a premium service.

Announce coming events

Announce coming events

Announce coming events

Having a big sale, on-site celebrity, or other event? Be sure to announce it so everybody knows and gets excited about it.

Reinforcement Learning from Human Feedback (RLHF)

Are yo 

Illustrating Reinforcement Learning from Human Feedback (RLHF)


Language models have shown impressive capabilities in the past few years by generating diverse and compelling text from human input prompts. However, what makes a "good" text is inherently hard to define as it is subjective and context dependent. There are many applications such as writing stories where you want creativity, pieces of informative text which should be truthful, or code snippets that we want to be executable.

Writing a loss function to capture these attributes seems intractable and most language models are still trained with a simple next token prediction loss (e.g. cross entropy). To compensate for the shortcomings of the loss itself people define metrics that are designed to better capture human preferences such as BLEU or ROUGE. While being better suited than the loss function itself at measuring performance these metrics simply compare generated text to references with simple rules and are thus also limited. Wouldn't it be great if we use human feedback for generated text as a measure of performance or go even one step further and use that feedback as a loss to optimize the model? That's the idea of Reinforcement Learning from Human Feedback (RLHF); use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values.

RLHF's most recent success was its use in ChatGPT. Given ChatGPT's impressive abilities, we asked it to explain RLHF for us:

It does surprisingly well, but doesn't quite cover everything. We'll fill in those gaps!

RLHF: Let’s take it step by step

Reinforcement learning from Human Feedback (also referenced as RL from human preferences) is a challenging concept because it involves a multiple-model training process and different stages of deployment. In this blog post, we’ll break down the training process into three core steps:

  1. Pretraining a language model (LM),
  2. gathering data and training a reward model, and
  3. fine-tuning the LM with reinforcement learning.

To start, we'll look at how language models are pretrained.

Pretraining language models

As a starting point RLHF use a language model that has already been pretrained with the classical pretraining objectives (see this blog post for more details). OpenAI used a smaller version of GPT-3 for its first popular RLHF model, InstructGPT. In their shared papers, Anthropic used transformer models from 10 million to 52 billion parameters trained for this task. DeepMind has documented using up to their 280 billion parameter model Gopher. It is likely that all these companies use much larger models in their RLHF-powered products.

This initial model can also be fine-tuned on additional text or conditions, but does not necessarily need to be. For example, OpenAI fine-tuned on human-generated text that was “preferable” and Anthropic generated their initial LM for RLHF by distilling an original LM on context clues for their “helpful, honest, and harmless” criteria. These are both sources of what we refer to as expensive, augmented data, but it is not a required technique to understand RLHF. Core to starting the RLHF process is having a model that responds well to diverse instructions.

In general, there is not a clear answer on “which model” is the best for the starting point of RLHF. This will be a common theme in this blog – the design space of options in RLHF training are not thoroughly explored.

Next, with a language model, one needs to generate data to train a reward model, which is how human preferences are integrated into the system.

Reward model training

Generating a reward model (RM, also referred to as a preference model) calibrated with human preferences is where the relatively new research in RLHF begins. The underlying goal is to get a model or system that takes in a sequence of text, and returns a scalar reward which should numerically represent the human preference. The system can be an end-to-end LM, or a modular system outputting a reward (e.g. a model ranks outputs, and the ranking is converted to reward). The output being a scalar reward is crucial for existing RL algorithms being integrated seamlessly later in the RLHF process.

These LMs for reward modeling can be both another fine-tuned LM or a LM trained from scratch on the preference data. For example, Anthropic has used a specialized method of fine-tuning to initialize these models after pretraining (preference model pretraining, PMP) because they found it to be more sample efficient than fine-tuning, but no one base model is considered the clear best choice for reward models.

The training dataset of prompt-generation pairs for the RM is generated by sampling a set of prompts from a predefined dataset (Anthropic’s data generated primarily with a chat tool on Amazon Mechanical Turk is available on the Hub, and OpenAI used prompts submitted by users to the GPT API). The prompts are passed through the initial language model to generate new text.

Human annotators are used to rank the generated text outputs from the LM. One may initially think that humans should apply a scalar score directly to each piece of text in order to generate a reward model, but this is difficult to do in practice. The differing values of humans cause these scores to be uncalibrated and noisy. Instead, rankings are used to compare the outputs of multiple models and create a much better regularized dataset.

There are multiple methods for ranking the text. One method that has been successful is to have users compare generated text from two language models conditioned on the same prompt. By comparing model outputs in head-to-head matchups, an Elo system can be used to generate a ranking of the models and outputs relative to each-other. These different methods of ranking are normalized into a scalar reward signal for training.

An interesting artifact of this process is that the successful RLHF systems to date have used reward language models with varying sizes relative to the text generation (e.g. OpenAI 175B LM, 6B reward model, Anthropic used LM and reward models from 10B to 52B, DeepMind uses 70B Chinchilla models for both LM and reward). An intuition would be that these preference models need to have similar capacity to understand the text given to them as a model would need in order to generate said text.

At this point in the RLHF system, we have an initial language model that can be used to generate text and a preference model that takes in any text and assigns it a score of how well humans perceive it. Next, we use reinforcement learning (RL) to optimize the original language model with respect to the reward model.

Fine-tuning with RL

Training a language model with reinforcement learning was, for a long time, something that people would have thought as impossible both for engineering and algorithmic reasons. What multiple organizations seem to have gotten to work is fine-tuning some or all of the parameters of a copy of the initial LM with a policy-gradient RL algorithm, Proximal Policy Optimization (PPO). Some parameters of the LM are frozen because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive (for more, see Low-Rank Adaptation (LoRA) for LMs or the Sparrow LM from DeepMind) -- depending on the scale of the model and infrastructure being used. The exact dynamics of how many parameters to freeze, or not, is considered an open research problem. PPO has been around for a relatively long time – there are tons of guides on how it works. The relative maturity of this method made it a favorable choice for scaling up to the new application of distributed training for RLHF. It turns out that many of the core RL advancements to do RLHF have been figuring out how to update such a large model with a familiar algorithm (more on that later).

Let's first formulate this fine-tuning task as a RL problem. First, the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model (often on the order of 50k tokens) and the observation space is the distribution of possible input token sequences, which is also quite large given previous uses of RL (the dimension is approximately the size of vocabulary ^ length of the input token sequence). The reward function is a combination of the preference model and a constraint on policy shift.

The reward function is where the system combines all of the models we have discussed into one RLHF process. Given a prompt, x, from the dataset, the text y is generated by the current iteration of the fine-tuned policy. Concatenated with the original prompt, that text is passed to the preference model, which returns a scalar notion of “preferability”, rθrθ​. In addition, per-token probability distributions from the RL policy are compared to the ones from the initial model to compute a penalty on the difference between them. In multiple papers from OpenAI, Anthropic, and DeepMind, this penalty has been designed as a scaled version of the Kullback–Leibler (KL) divergence between these sequences of distributions over tokens, rKLrKL​. The KL divergence term penalizes the RL policy from moving substantially away from the initial pretrained model with each training batch, which can be useful to make sure the model outputs reasonably coherent text snippets. Without this penalty the optimization can start to generate text that is gibberish but fools the reward model to give a high reward. In practice, the KL divergence is approximated via sampling from both distributions (explained by John Schulman here). The final reward sent to the RL update rule is r=rθ−λrKLr=rθ​−λrKL​.

Some RLHF systems have added additional terms to the reward function. For example, OpenAI experimented successfully on InstructGPT by mixing in additional pre-training gradients (from the human annotation set) into the update rule for PPO. It is likely as RLHF is further investigated, the formulation of this reward function will continue to evolve.

Finally, the update rule is the parameter update from PPO that maximizes the reward metrics in the current batch of data (PPO is on-policy, which means the parameters are only updated with the current batch of prompt-generation pairs). PPO is a trust region optimization algorithm that uses constraints on the gradient to ensure the update step does not destabilize the learning process. DeepMind used a similar reward setup for Gopher but used synchronous advantage actor-critic (A2C) to optimize the gradients, which is notably different but has not been reproduced externally.

Technical detail note: The above diagram makes it look like both models generate different responses for the same prompt, but what really happens is that the RL policy generates text, and that text is fed into the initial model to produce its relative probabilities for the KL penalty. This initial model is untouched by gradient updates during training.

Optionally, RLHF can continue from this point by iteratively updating the reward model and the policy together. As the RL policy updates, users can continue ranking these outputs versus the model's earlier versions. Most papers have yet to discuss implementing this operation, as the deployment mode needed to collect this type of data only works for dialogue agents with access to an engaged user base. Anthropic discusses this option as Iterated Online RLHF (see the original paper), where iterations of the policy are included in the ELO ranking system across models. This introduces complex dynamics of the policy and reward model evolving, which represents a complex and open research question.

Open-source tools for RLHF

The first code released to perform RLHF on LMs was from OpenAI in TensorFlow in 2019.

Today, there are already a few active repositories for RLHF in PyTorch that grew out of this. The primary repositories are Transformers Reinforcement Learning (TRL), TRLX which originated as a fork of TRL, and Reinforcement Learning for Language models (RL4LMs).

TRL is designed to fine-tune pretrained LMs in the Hugging Face ecosystem with PPO. TRLX is an expanded fork of TRL built by CarperAI to handle larger models for online and offline training. At the moment, TRLX has an API capable of production-ready RLHF with PPO and Implicit Language Q-Learning ILQL at the scales required for LLM deployment (e.g. 33 billion parameters). Future versions of TRLX will allow for language models up to 200B parameters. As such, interfacing with TRLX is optimized for machine learning engineers with experience at this scale.

RL4LMs offers building blocks for fine-tuning and evaluating LLMs with a wide variety of RL algorithms (PPO, NLPO, A2C and TRPO), reward functions and metrics. Moreover, the library is easily customizable, which allows training of any encoder-decoder or encoder transformer-based LM on any arbitrary user-specified reward function. Notably, it is well-tested and benchmarked on a broad range of tasks in recent work amounting up to 2000 experiments highlighting several practical insights on data budget comparison (expert demonstrations vs. reward modeling), handling reward hacking and training instabilities, etc. RL4LMs current plans include distributed training of larger models and new RL algorithms.

Both TRLX and RL4LMs are under heavy further development, so expect more features beyond these soon.

There is a large dataset created by Anthropic available on the Hub.

What’s next for RLHF?

While these techniques are extremely promising and impactful and have caught the attention of the biggest research labs in AI, there are still clear limitations. The models, while better, can still output harmful or factually inaccurate text without any uncertainty. This imperfection represents a long-term challenge and motivation for RLHF – operating in an inherently human problem domain means there will never be a clear final line to cross for the model to be labeled as complete.

When deploying a system using RLHF, gathering the human preference data is quite expensive due to the direct integration of other human workers outside the training loop. RLHF performance is only as good as the quality of its human annotations, which takes on two varieties: human-generated text, such as fine-tuning the initial LM in InstructGPT, and labels of human preferences between model outputs.

Generating well-written human text answering specific prompts is very costly, as it often requires hiring part-time staff (rather than being able to rely on product users or crowdsourcing). Thankfully, the scale of data used in training the reward model for most applications of RLHF (~50k labeled preference samples) is not as expensive. However, it is still a higher cost than academic labs would likely be able to afford. Currently, there only exists one large-scale dataset for RLHF on a general language model (from Anthropic) and a couple of smaller-scale task-specific datasets (such as summarization data from OpenAI). The second challenge of data for RLHF is that human annotators can often disagree, adding a substantial potential variance to the training data without ground truth.

With these limitations, huge swaths of unexplored design options could still enable RLHF to take substantial strides. Many of these fall within the domain of improving the RL optimizer. PPO is a relatively old algorithm, but there are no structural reasons that other algorithms could not offer benefits and permutations on the existing RLHF workflow. One large cost of the feedback portion of fine-tuning the LM policy is that every generated piece of text from the policy needs to be evaluated on the reward model (as it acts like part of the environment in the standard RL framework). To avoid these costly forward passes of a large model, offline RL could be used as a policy optimizer. Recently, new algorithms have emerged, such as implicit language Q-learning (ILQL) [Talk on ILQL at CarperAI], that fit particularly well with this type of optimization. Other core trade-offs in the RL process, like exploration-exploitation balance, have also not been documented. Exploring these directions would at least develop a substantial understanding of how RLHF functions and, if not, provide improved performance.

We hosted a lecture on Tuesday 13 December 2022 that expanded on this post; you can watch it here!

Further reading

Here is a list of the most prevalent papers on RLHF to date. The field was recently popularized with the emergence of DeepRL (around 2017) and has grown into a broader study of the applications of LLMs from many large technology companies. Here are some papers on RLHF that pre-date the LM focus:

  • TAMER: Training an Agent Manually via Evaluative Reinforcement (Knox and Stone 2008): Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model.
  • Interactive Learning from Policy-Dependent Human Feedback (MacGlashan et al. 2017): Proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function.
  • Deep Reinforcement Learning from Human Preferences (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
  • Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces (Warnell et al. 2018): Extends the TAMER framework where a deep neural network is used to model the reward prediction.
  • A Survey of Preference-based Reinforcement Learning Methods (Wirth et al. 2017): Summarizes efforts above with many, many more references.

And here is a snapshot of the growing set of "key" papers that show RLHF's performance for LMs:

  • Fine-Tuning Language Models from Human Preferences (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
  • Learning to summarize with human feedback (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books.
  • WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
  • InstructGPT: Training language models to follow instructions with human feedback (OpenAI Alignment Team 2022): RLHF applied to a general language model [Blog post on InstructGPT].
  • GopherCite: Teaching language models to support answers with verified quotes (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
  • Sparrow: Improving alignment of dialogue agents via targeted human judgements (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
  • ChatGPT: Optimizing Language Models for Dialogue (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
  • Scaling Laws for Reward Model Overoptimization (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
  • Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.
  • Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization (Ramamurthy and Ammanabrolu et al. 2022): Discusses the design space of open-source tools in RLHF and proposes a new algorithm NLPO (Natural Language Policy Optimization) as an alternative to PPO.
  • Llama 2 (Touvron et al. 2023): Impactful open-access model with substantial RLHF details.

The field is the convergence of multiple fields, so you can also find resources in other areas:

  • Continual learning of instructions (Kojima et al. 2021, Suhr and Artzi 2022) or bandit learning from user feedback (Sokolov et al. 2016, Gao et al. 2022)
  • Earlier history on using other RL algorithms for text generation (not all with human preferences), such as with recurrent neural networks (Ranzato et al. 2015), an actor-critic algorithm for text prediction (Bahdanau et al. 2016), or an early work adding human preferences to this framework (Nguyen et al. 2017).

Citation: If you found this useful for your academic work, please consider citing our work, in text:

Lambert, et al., "Illustrating Reinforcement Learning from Human Feedback (RLHF)", Hugging Face Blog, 2022.

BibTeX citation:

@article{lambert2022illustrating,
 author = {Lambert, Nathan and Castricato, Louis and von Werra, Leandro and Havrilla, Alex},
 title = {Illustrating Reinforcement Learning from Human Feedback (RLHF)},
 journal = {Hugging Face Blog},
 year = {2022},
 note = {https://huggingface.co/blog/rlhf},
}

Thanks to Robert Kirk for fixing some factual errors regarding specific implementations of RLHF. Thanks to Stas Bekman for fixing some typos or confusing phrases Thanks to Peter Stone, Khanh X. Nguyen and Yoav Artzi for helping expand the related works further into history. Thanks to Igor Kotenkov for pointing out a technical error in the KL-penalty term of the RLHF procedure, its diagram, and textual description.

More Articles from our Blog

Finetune Stable Diffusion Models with DDPO via TRL

By metric-spaceSeptember 28, 2023•16

Putting RL back in RLHF

By vwxyzjnJune 11, 2024•93

Community

chironbangMar 25

Is the last figure caption correct?
"The above diagram makes it look like both models generate different responses for the same prompt, but what really happens is that the RL policy generates text, and that text is fed into the initial model to produce its relative probabilities for the KL penalty."

To me it makes more sense to feed the prompt to the reference and the policy models then to compare the reference model probability distribution with the policy's.
Any insight would be appreciated.

Copyright © 2025 My Cool AI Agent - All Rights Reserved.

Powered by

This website uses cookies.

We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.

Accept