Back to writing

LLM layer for a Rails application

·19 min read·#ruby#architecture#ai

Like it or not, a lot of applications are adding AI–native features: anything related to automated answers, object classification, knowledge base search, or text summarization can already be handed off to an LLM with pretty good results. If you happen to do this as a Rails engineer, this post will definitely be useful.

In this post I will describe my approach to LLM integration for Rails applications. We will discuss some common problems, explore related gems, build our own architecture layer for LLM integration, cover it with specs, and discuss ways to prepare the context.

Why we need a layer

Integrating an LLM into a Rails app at the early stages usually does not differ much from connecting any other API: we make a call with some parameters and get the response back, which is then used in the business layer. Of course, we should not forget to handle errors, move the interaction itself to the background, and so on. Nothing unusual so far.

Soon it turns out that things are not that simple: even though it’s nominally the same call, the parameters differ a lot from case to case, and preparing them requires separate work. One of the most important parameters is the prompt: we need to explain to the LLM what we actually want from it. For simple, short prompts, string interpolation is enough, but you’ll quickly outgrow it.

Error handling also becomes complicated and verbose: on top of network and server errors, an incorrect response (from the business logic standpoint) can come back, so we need to add validations.

At this point, an experienced engineer starts looking for libraries that would take at least some of these routine tasks off their hands. After some time working with the raw OpenAI adapter, I ended up with the following list of goals:

  • make it easier to support and replace models/providers;
  • separate the LLM interaction code from the business logic;
  • get rid of all the boilerplate (storing schemas/instructions, preparing templates);
  • have centralized logging in place, since you often want to inspect the “raw” response from the model when behavior is unexpected.

Library choice

Finding something isn’t hard: ruby_llm, activeagent, and a number of smaller solutions offer different levels of abstraction. In this post I will tell you which option I ended up with: ruby_llm as a transport plus my own layer on top.

While I was working on the post, I found ruby_llm-agents, which is pretty similar to what I came up with. How could I miss it? It was released in January, and I was working on this in October.

Moreover, I discovered that ruby_llm has shipped a similar DSL too. Fortunately my approach is a bit different, otherwise you would not be reading this post!

A quick tour of ruby_llm

ruby_llm is a library for working with different LLM providers (OpenAI, Anthropic, Google, and others). The interface for each model is similar, and some common tasks (e.g., error handling) are implemented right inside the library.

The main abstraction is a chat, which represents a single conversation with the LLM (and can include more than one message from us). The chat is configured through a chain of calls. For instance, with_instructions sets the system prompt, and with_schema enables structured output: the model is required to return JSON strictly following the specified JSON Schema.

class TicketSchema < RubyLLM::Schema
  string :category
  string :summary
end

chat = RubyLLM.chat(model: "gpt-4o")
chat.with_schema(TicketSchema)
chat.with_instructions("Classify the ticket.")

response = chat.ask(ticket.text)
response.content # => {"category" => "billing", "summary" => "..."}

Note that response.content is already parsed according to the schema!

Yes, this also means JSON Schema validation comes for free—one less thing to write yourself.

The next useful feature is persistence. We can save all our chats to the database. To do that, we generate the tables and models using ruby_llm’s generator and slightly adjust our code: instead of RubyLLM.chat we use Chat.create!, and everything just works.

Chat.create!(model: "gpt-4o")
    .with_instructions(instructions)
    .with_schema(output_schema)
    .ask(prompt)

If you decide to use persistence, think about two things:

  1. Chat and Message often carry some kind of business context, so you might want to rename the models and/or move them to a namespace—better do it right away;
  2. these tables are going to be big. Really. Think about partitioning or storing them somewhere outside the main DB.

Don’t say I didn’t warn you when the messages table hits 10M rows.

Designing the base class

Each LLM call should be wrapped in a separate class that inherits from a base class (let’s call it BaseLLMRequest). All the boilerplate and instrumentation lives in the base class, while subclasses only configure the request parameters. The base class can be implemented like this:

class BaseLLMRequest
  def call
    chat = build_chat
    message = build_user_message
    # Runner: a separate class so we can isolate the transport layer
    response = Runner.new(chat:).run_with(message)
    transform_response(response.content)
  end

  private

  def build_chat
    chat = Chat.create!(model:)
    chat = chat.with_instructions(instructions) if instructions
    chat = chat.with_schema(output_schema) if output_schema
    # You can add any chat settings here; some steps may be optional.
    chat
  end

  # Configuration: overridden in subclasses
  def model = nil
  def prompt = ""
  def instructions = nil
  def output_schema = nil

  # here we will transform the raw response into something convenient
  # for the business layer
  def transform_response(raw) = raw

  # ERB helper: covered below
  def eval_erb_template(path, variables)
    template = File.read(path)
    ERB.new(template, trim_mode: "-").result_with_hash(variables)
  end
end

Now let’s look at a subclass:

class TicketClassificationLLMRequest < BaseLLMRequest
  option :ticket

  def model = "gpt-4o-mini"

  # you can try to be clever and make the default implementation
  # use these paths and names
  def instructions_path = File.expand_path("./instructions.text.erb", __dir__)
  def output_schema_path = File.expand_path("./output_schema.json", __dir__)
  def prompt_path = File.expand_path("./prompt.text.erb", __dir__)

  # we will talk about ERB below
  def prompt = eval_erb_template(prompt_path, { body: ticket.body })

  # check that the category is valid and add more data to the response
  def transform_response(raw)
    category = raw["category"]
    VALID_CATEGORIES.include?(category) ?
      Success(category:, summary: raw["summary"]) : Failure()
  end
end

Why plain methods instead of a DSL (like in the gem mentioned above)? You can always migrate later, and this approach gives more flexibility. For instance, if you want to run an A/B test on the model, it is enough to add this logic to the implementation:

def model
  if user.ab_tests["ab_ticket_classification_model"] == "segment_gpt_5"
    "gpt-5"
  else
    "gpt-4o"
  end
end

Prompts as ERB templates

I keep prompt and schema parts in files, or inline as strings when that’s simpler. The file layout typically follows this pattern:

llm_requests/
  ticket_classification_llm_request.rb
  ticket_classification_llm_request/
    instructions.text.erb
    output_schema.json

ERB works just like in views: eval_erb_template substitutes variables into the template.

It is worth highlighting the difference between prompt and instructions. The prompt comes from the user role, and instructions come from the developer role. Providers generally give the developer role higher priority. In addition, instructions usually contain rules and the response format, while the prompt contains the specific data to process.

Handling invalid responses

Even with a JSON Schema, the LLM can return something that’s useless from a business standpoint: an unknown category, a summary that contradicts the body, a number outside the expected range. That’s what transform_response is for—validate against your domain rules and return Failure() when something looks wrong (we already did exactly this in TicketClassificationLLMRequest).

There are two patterns I reach for depending on the request:

  • fail fast—return Failure() and let the caller decide. Best for non-critical flows (enrichment, suggestions) where skipping is cheaper than retrying;
  • retry once with the validation error—feed the error back into the prompt ("the value 'nonsense' is not a valid category, pick one of ...") and re-run. Costs more, but salvages flaky responses.

Runner

Almost right away I decided to extract a separate class for instrumenting request execution. It’s called Runner and looks something like this:

class Runner
  def initialize(chat:)
    @chat = chat
  end

  def run_with(message)
    @response = nil
    response_time = Benchmark.realtime do
      @response = perform_ask(message)
    end
    save_response_time(response_time)
    @response
  end

  private

  def save_response_time(response_time)
    # Since Chat is just a model, we can add the columns we need and fill them in
    @chat.messages.last.update!(response_time:)
  end

  def perform_ask(message)
    Success(@chat.ask(message))
  rescue RubyLLM::ServerError, RubyLLM::ServiceUnavailableError, RubyLLM::OverloadedError => e
    NewRelic::Agent.record_custom_event("LLM_server_error", kind: e.class.name)
    Failure()
  rescue RubyLLM::Error => e
    ErrorTracker.capture_exception(e, extra: { raw_response: @response&.content })
    Failure()
  end
end

Not all errors should be sent to the error tracker. ServerError, OverloadedError, and ServiceUnavailableError are temporary problems on the provider side; it is better to send them to monitoring and set up an alert.

The same trick—adding columns to Message and filling them in the Runner—works for input_tokens, output_tokens, and the computed cost. Combined with response_time and the error events, this gives you the raw material for a few useful dashboards: schema-mismatch rate per request class, p95 response time, token spend per feature, and the share of failures that are provider outages versus genuinely bad output.

Running in the background

LLM calls are slow—a few hundred milliseconds at best, several seconds at worst—so you don’t want them in the request/response cycle. Since *.call is just a regular method, wrapping a request in a background job is trivial:

class ClassifyTicketJob < ApplicationJob
  def perform(ticket)
    TicketClassificationLLMRequest.call(ticket:)
  end
end

Each call creates its own persisted chat, so retrying a failed job is safe—and you can still inspect the original chat row to see what came back the first time.

Writing tests

Earlier we wrote LLM requests. Now it’s time to test them—but how? LLM responses are non-deterministic and don’t always match the schema; real requests cost money and take a lot of time. Because of this non-determinism we cannot rely on the usual approach with webmock, but we also do not want to make real requests. In my projects I settled on two layers: unit tests for the request logic, and prompt tests for the responses.

Level 1: unit tests for the LLM request

The goal is to test the code without making real requests to the model. As in any other integration, we mock the HTTP connection, plug in a prepared JSON response, and check two things:

  1. What payload was sent to the LLM—parameters, system prompt, user message, schema.
  2. How the response was processed—parsing, edge cases, errors.

For this it is convenient to have a helper module that mocks the connection and provides utilities to build the expected payload:

The helper uses included do, so it needs extend ActiveSupport::Concern—otherwise you’ll get a NoMethodError on the first run.

module LLMRequestHelpers
  extend ActiveSupport::Concern

  included do
    let(:ruby_llm_connection) { instance_double RubyLLM::Connection }
    let(:faraday_response) { instance_double Faraday::Response }

    before do
      allow(RubyLLM::Connection).to receive(:new)
        .and_return(ruby_llm_connection)
      allow(ruby_llm_connection).to receive(:post)
        .and_return(faraday_response)
      allow(faraday_response).to receive(:body)
        .and_return(llm_response_body)
    end
  end
end

The test itself looks something like this:

describe TicketClassificationLLMRequest, type: :llm_request do
  subject(:call) { described_class.call(ticket:) }

  let(:ticket) { create :ticket }
  let(:assistant_response) {
    { "category" => "billing", "summary" => "Payment issue" }.to_json
  }

  # happy path: parsing + correct payload
  specify do
    expect(call).to eq(Success(category: "billing", summary: "Payment issue"))

    expect(ruby_llm_connection).to have_received(:post) do |_, payload|
      expect(payload[:model]).to eq("gpt-4o-mini")
      expect(payload[:messages]).to include(
        hash_including(role: "user", content: ticket.body)
      )
    end
  end

  # invalid category
  context "when category is unknown" do
    let(:assistant_response) {
      { "category" => "nonsense", "summary" => "..." }.to_json
    }

    specify { expect(call).to eq(Failure()) }
  end
end

What does this test cover?

  1. Building the prompt—in case someone breaks the instructions template or passes the wrong variables.
  2. Parsing the response: the LLM does not always answer strictly according to the schema, and there can be extra keys or unexpected nesting.

These tests are fast, make no network requests, and run on every commit. It is also worth pointing out that the test is fully independent from the transport layer: we are essentially testing the parameters sent to the LLM API, so when replacing ruby_llm with something else the test does not even need to be touched.

Level 2: prompt tests

Prompt tests check that executing the request returns a more or less expected response. Why “more or less”? Because proving that the response is always absolutely correct is impossible, but we can at least check some cases and do basic assertions (for example, that a boolean field gets a boolean, and that a text summary is shorter than the original text).

For this, we need prompt tests: real requests to the LLM, with the result compared against the expected one. The simplest option is text files with test cases:

# billing_tickets.txt
I was charged twice for my subscription
My invoice shows the wrong amount

# support_tickets.txt
How do I reset my password?
Where can I find my API key?

And a spec that runs every case:

describe "ticket classification prompt" do
  shared_examples "checks ticket list" do |file:, expected_category:|
    read_lines(file).each do |line|
      context "when text is '#{line}'" do
        let(:ticket) { create :ticket, body: line }

        it { expect(result[:category]).to eq(expected_category) }
      end
    end
  end

  include_examples "checks ticket list",
    file: "billing_tickets.txt", expected_category: "billing"
  include_examples "checks ticket list",
    file: "support_tickets.txt", expected_category: "support"
end

For more complex scenarios, use YAML, where each case describes the input data and the expected result over several fields:

# cases.yml
- description: "Billing complaint with refund request"
  ticket_body: "I was charged twice for my subscription, I want a refund"
  expected_category: "billing"
  expected_priority: "high"

- description: "General how-to question"
  ticket_body: "How do I reset my password?"
  expected_category: "support"
  expected_priority: "low"
test_cases = YAML.load_file("spec/prompts/ticket_classification/cases.yml")

test_cases.each do |test_case|
  context test_case["description"] do
    let(:ticket) { create :ticket, body: test_case["ticket_body"] }

    specify do
      expect(result[:category]).to eq(test_case["expected_category"])
      expect(result[:priority]).to eq(test_case["expected_priority"])
    end
  end
end

Prompt tests should not be run on every commit: running them on a schedule (and whenever the prompts themselves change) is enough.

Providing LLMs access to our data

LLMs work well when they have enough context. There are three ways to give them that context, and each one has its own trade-offs.

Rich context

Everything that might be needed is passed directly into the prompt or instructions. In practice, for most tasks in a Rails application rich context is enough. If you know exactly what the LLM needs for a specific task, it’s easier to pass it explicitly. Technically it’s just a longer prompt, so our current setup works out of the box—we just pass more variables to ERB:

def prompt
  previous_tickets = ticket.customer.tickets.recent
    .limit(5).pluck(:body, :category)

  eval_erb_template(prompt_path, {
    body: ticket.body,
    previous_tickets:
  })
end

Tool use

The LLM itself decides which data to request via function calling (ruby_llm has tool support). By default it’s the model’s call whether to actually use a tool, so the response might not include all the context you expected. You can force a specific tool with choice: :required (or a tool name), but at that point you’ve removed the dynamic part—you might as well pass the data as rich context. This makes tool use a good fit when the data is optional or branching, but a risky default when the model really does need a specific piece of context to answer correctly.

For our ticket classifier, a tool might look like this:

class FetchCustomerHistory < RubyLLM::Tool
  description "Fetch the last 5 tickets submitted by the customer"
  param :customer_id, desc: "ID of the customer"

  def execute(customer_id:)
    Customer.find(customer_id).tickets.recent.limit(5).pluck(:body, :category)
  end
end

Attaching it to the chat is one line—you’d add a tools method to the subclass and extend build_chat to wire it up:

chat = chat.with_tools(*tools) if tools.any?

In my projects I reach for tools mostly when the data is genuinely optional—otherwise I’d rather pay the prompt-size tax and pass it as rich context.

RAG

RAG comes into play when you have too much data to provide. In this case, you index your data as embeddings in a vector store and either wrap that store with tools or retrieve a relevant subset before passing it as rich context. How you actually do that retrieval (pure vector search, hybrid with BM25, reranking, and so on) is a whole topic of its own and deserves a separate post.


That’s all for today. Here is a quick recap of the post:

  1. the foundation of our LLM layer is a base class with configuration via methods;
  2. familiar ERB works well for prompts—keep them in files, not in string interpolation;
  3. validate the response in transform_response and pick between failing fast and retrying with the error;
  4. wrap every call in a Runner so response time, tokens, cost, and errors land somewhere queryable;
  5. test the request logic with mocks on every commit, and run prompt tests on a schedule;
  6. rich context beats flexibility—predictability is more important.

At the meantime, take a look at the LLM code in your own app. Are the prompts living in a file or buried in a string? Can you tell how much each call costs, or how often the schema falls over? Maybe you’ll find that the patterns above slot in neatly. Maybe you’ll come up with something better—either way, I’d love to hear about it!


Wiring LLMs into a Rails app the right way is harder than it looks. I offer Rails + LLM architecture consulting.