Like it or not, a lot of applications are adding AI–native features: anything related to automated answers, object classification, knowledge base search, or text summarization can already be handed off to an LLM with pretty good results. If you happen to do this as a Rails engineer, this post will definitely be useful.
In this post I will describe my approach to LLM integration for Rails applications. We will discuss some common problems, explore related gems, build our own architecture layer for LLM integration, cover it with specs, and discuss ways to prepare the context.
Why we need a layer
Integrating an LLM into a Rails app at the early stages usually does not differ much from connecting any other API: we make a call with some parameters and get the response back, which is then used in the business layer. Of course, we should not forget to handle errors, move the interaction itself to the background, and so on. Nothing unusual so far.
Soon it turns out that things are not that simple: even though it’s nominally the same call, the parameters differ a lot from case to case, and preparing them requires separate work. One of the most important parameters is the prompt: we need to explain to the LLM what we actually want from it. For simple, short prompts, string interpolation is enough, but you’ll quickly outgrow it.
Error handling also becomes complicated and verbose: on top of network and server errors, an incorrect response (from the business logic standpoint) can come back, so we need to add validations.
At this point, an experienced engineer starts looking for libraries that would take at least some of these routine tasks off their hands. After some time working with the raw OpenAI adapter, I ended up with the following list of goals:
- make it easier to support and replace models/providers;
- separate the LLM interaction code from the business logic;
- get rid of all the boilerplate (storing schemas/instructions, preparing templates);
- have centralized logging in place, since you often want to inspect the “raw” response from the model when behavior is unexpected.
Library choice
Finding something isn’t hard: ruby_llm, activeagent, and a number of smaller solutions offer different levels of abstraction. In this post I will tell you which option I ended up with: ruby_llm as a transport plus my own layer on top.
While I was working on the post, I found ruby_llm-agents, which is pretty similar to what I came up with. How could I miss it? It was released in January, and I was working on this in October.
Moreover, I discovered that ruby_llm has shipped a similar DSL too. Fortunately my approach is a bit different, otherwise you would not be reading this post!
A quick tour of ruby_llm
ruby_llm is a library for working with different LLM providers (OpenAI, Anthropic, Google, and others). The interface for each model is similar, and some common tasks (e.g., error handling) are implemented right inside the library.
The main abstraction is a chat, which represents a single conversation with the LLM (and can include more than one message from us). The chat is configured through a chain of calls. For instance, with_instructions sets the system prompt, and with_schema enables structured output: the model is required to return JSON strictly following the specified JSON Schema.
class TicketSchema < RubyLLM::Schema
string :category
string :summary
end
chat = RubyLLM.chat(model: "gpt-4o")
chat.with_schema(TicketSchema)
chat.with_instructions("Classify the ticket.")
response = chat.ask(ticket.text)
response.content # => {"category" => "billing", "summary" => "..."}
Note that response.content is already parsed according to the schema!
Yes, this also means JSON Schema validation comes for free—one less thing to write yourself.
The next useful feature is persistence. We can save all our chats to the database. To do that, we generate the tables and models using ruby_llm’s generator and slightly adjust our code: instead of RubyLLM.chat we use Chat.create!, and everything just works.
Chat.create!(model: "gpt-4o")
.with_instructions(instructions)
.with_schema(output_schema)
.ask(prompt)
If you decide to use persistence, think about two things:
- Chat and Message often carry some kind of business context, so you might want to rename the models and/or move them to a namespace—better do it right away;
- these tables are going to be big. Really. Think about partitioning or storing them somewhere outside the main DB.
Don’t say I didn’t warn you when the messages table hits 10M rows.
Designing the base class
Each LLM call should be wrapped in a separate class that inherits from a base class (let’s call it BaseLLMRequest). All the boilerplate and instrumentation lives in the base class, while subclasses only configure the request parameters. The base class can be implemented like this:
class BaseLLMRequest
def call
chat = build_chat
message = build_user_message
# Runner: a separate class so we can isolate the transport layer
response = Runner.new(chat:).run_with(message)
transform_response(response.content)
end
private
def build_chat
chat = Chat.create!(model:)
chat = chat.with_instructions(instructions) if instructions
chat = chat.with_schema(output_schema) if output_schema
# You can add any chat settings here; some steps may be optional.
chat
end
# Configuration: overridden in subclasses
def model = nil
def prompt = ""
def instructions = nil
def output_schema = nil
# here we will transform the raw response into something convenient
# for the business layer
def transform_response(raw) = raw
# ERB helper: covered below
def eval_erb_template(path, variables)
template = File.read(path)
ERB.new(template, trim_mode: "-").result_with_hash(variables)
end
end
Now let’s look at a subclass:
class TicketClassificationLLMRequest < BaseLLMRequest
option :ticket
def model = "gpt-4o-mini"
# you can try to be clever and make the default implementation
# use these paths and names
def instructions_path = File.expand_path("./instructions.text.erb", __dir__)
def output_schema_path = File.expand_path("./output_schema.json", __dir__)
def prompt_path = File.expand_path("./prompt.text.erb", __dir__)
# we will talk about ERB below
def prompt = eval_erb_template(prompt_path, { body: ticket.body })
# check that the category is valid and add more data to the response
def transform_response(raw)
category = raw["category"]
VALID_CATEGORIES.include?(category) ?
Success(category:, summary: raw["summary"]) : Failure()
end
end
Why plain methods instead of a DSL (like in the gem mentioned above)? You can always migrate later, and this approach gives more flexibility. For instance, if you want to run an A/B test on the model, it is enough to add this logic to the implementation:
def model
if user.ab_tests["ab_ticket_classification_model"] == "segment_gpt_5"
"gpt-5"
else
"gpt-4o"
end
end
Prompts as ERB templates
I keep prompt and schema parts in files, or inline as strings when that’s simpler. The file layout typically follows this pattern:
llm_requests/
ticket_classification_llm_request.rb
ticket_classification_llm_request/
instructions.text.erb
output_schema.json
ERB works just like in views: eval_erb_template substitutes variables into the template.
It is worth highlighting the difference between prompt and instructions. The prompt comes from the user role, and instructions come from the developer role. Providers generally give the developer role higher priority. In addition, instructions usually contain rules and the response format, while the prompt contains the specific data to process.
Handling invalid responses
Even with a JSON Schema, the LLM can return something that’s useless from a business standpoint: an unknown category, a summary that contradicts the body, a number outside the expected range. That’s what transform_response is for—validate against your domain rules and return Failure() when something looks wrong (we already did exactly this in TicketClassificationLLMRequest).
There are two patterns I reach for depending on the request:
- fail fast—return
Failure()and let the caller decide. Best for non-critical flows (enrichment, suggestions) where skipping is cheaper than retrying; - retry once with the validation error—feed the error back into the prompt (
"the value 'nonsense' is not a valid category, pick one of ...") and re-run. Costs more, but salvages flaky responses.
Runner
Almost right away I decided to extract a separate class for instrumenting request execution. It’s called Runner and looks something like this:
class Runner
def initialize(chat:)
@chat = chat
end
def run_with(message)
@response = nil
response_time = Benchmark.realtime do
@response = perform_ask(message)
end
save_response_time(response_time)
@response
end
private
def save_response_time(response_time)
# Since Chat is just a model, we can add the columns we need and fill them in
@chat.messages.last.update!(response_time:)
end
def perform_ask(message)
Success(@chat.ask(message))
rescue RubyLLM::ServerError, RubyLLM::ServiceUnavailableError, RubyLLM::OverloadedError => e
NewRelic::Agent.record_custom_event("LLM_server_error", kind: e.class.name)
Failure()
rescue RubyLLM::Error => e
ErrorTracker.capture_exception(e, extra: { raw_response: @response&.content })
Failure()
end
end
Not all errors should be sent to the error tracker. ServerError, OverloadedError, and ServiceUnavailableError are temporary problems on the provider side; it is better to send them to monitoring and set up an alert.
The same trick—adding columns to Message and filling them in the Runner—works for input_tokens, output_tokens, and the computed cost. Combined with response_time and the error events, this gives you the raw material for a few useful dashboards: schema-mismatch rate per request class, p95 response time, token spend per feature, and the share of failures that are provider outages versus genuinely bad output.
Running in the background
LLM calls are slow—a few hundred milliseconds at best, several seconds at worst—so you don’t want them in the request/response cycle. Since *.call is just a regular method, wrapping a request in a background job is trivial:
class ClassifyTicketJob < ApplicationJob
def perform(ticket)
TicketClassificationLLMRequest.call(ticket:)
end
end
Each call creates its own persisted chat, so retrying a failed job is safe—and you can still inspect the original chat row to see what came back the first time.
Writing tests
Earlier we wrote LLM requests. Now it’s time to test them—but how? LLM responses are non-deterministic and don’t always match the schema; real requests cost money and take a lot of time. Because of this non-determinism we cannot rely on the usual approach with webmock, but we also do not want to make real requests. In my projects I settled on two layers: unit tests for the request logic, and prompt tests for the responses.
Level 1: unit tests for the LLM request
The goal is to test the code without making real requests to the model. As in any other integration, we mock the HTTP connection, plug in a prepared JSON response, and check two things:
- What payload was sent to the LLM—parameters, system prompt, user message, schema.
- How the response was processed—parsing, edge cases, errors.
For this it is convenient to have a helper module that mocks the connection and provides utilities to build the expected payload:
The helper uses
included do, so it needsextend ActiveSupport::Concern—otherwise you’ll get aNoMethodErroron the first run.
module LLMRequestHelpers
extend ActiveSupport::Concern
included do
let(:ruby_llm_connection) { instance_double RubyLLM::Connection }
let(:faraday_response) { instance_double Faraday::Response }
before do
allow(RubyLLM::Connection).to receive(:new)
.and_return(ruby_llm_connection)
allow(ruby_llm_connection).to receive(:post)
.and_return(faraday_response)
allow(faraday_response).to receive(:body)
.and_return(llm_response_body)
end
end
end
The test itself looks something like this:
describe TicketClassificationLLMRequest, type: :llm_request do
subject(:call) { described_class.call(ticket:) }
let(:ticket) { create :ticket }
let(:assistant_response) {
{ "category" => "billing", "summary" => "Payment issue" }.to_json
}
# happy path: parsing + correct payload
specify do
expect(call).to eq(Success(category: "billing", summary: "Payment issue"))
expect(ruby_llm_connection).to have_received(:post) do |_, payload|
expect(payload[:model]).to eq("gpt-4o-mini")
expect(payload[:messages]).to include(
hash_including(role: "user", content: ticket.body)
)
end
end
# invalid category
context "when category is unknown" do
let(:assistant_response) {
{ "category" => "nonsense", "summary" => "..." }.to_json
}
specify { expect(call).to eq(Failure()) }
end
end
What does this test cover?
- Building the prompt—in case someone breaks the instructions template or passes the wrong variables.
- Parsing the response: the LLM does not always answer strictly according to the schema, and there can be extra keys or unexpected nesting.
These tests are fast, make no network requests, and run on every commit. It is also worth pointing out that the test is fully independent from the transport layer: we are essentially testing the parameters sent to the LLM API, so when replacing ruby_llm with something else the test does not even need to be touched.
Level 2: prompt tests
Prompt tests check that executing the request returns a more or less expected response. Why “more or less”? Because proving that the response is always absolutely correct is impossible, but we can at least check some cases and do basic assertions (for example, that a boolean field gets a boolean, and that a text summary is shorter than the original text).
For this, we need prompt tests: real requests to the LLM, with the result compared against the expected one. The simplest option is text files with test cases:
# billing_tickets.txt
I was charged twice for my subscription
My invoice shows the wrong amount
# support_tickets.txt
How do I reset my password?
Where can I find my API key?
And a spec that runs every case:
describe "ticket classification prompt" do
shared_examples "checks ticket list" do |file:, expected_category:|
read_lines(file).each do |line|
context "when text is '#{line}'" do
let(:ticket) { create :ticket, body: line }
it { expect(result[:category]).to eq(expected_category) }
end
end
end
include_examples "checks ticket list",
file: "billing_tickets.txt", expected_category: "billing"
include_examples "checks ticket list",
file: "support_tickets.txt", expected_category: "support"
end
For more complex scenarios, use YAML, where each case describes the input data and the expected result over several fields:
# cases.yml
- description: "Billing complaint with refund request"
ticket_body: "I was charged twice for my subscription, I want a refund"
expected_category: "billing"
expected_priority: "high"
- description: "General how-to question"
ticket_body: "How do I reset my password?"
expected_category: "support"
expected_priority: "low"
test_cases = YAML.load_file("spec/prompts/ticket_classification/cases.yml")
test_cases.each do |test_case|
context test_case["description"] do
let(:ticket) { create :ticket, body: test_case["ticket_body"] }
specify do
expect(result[:category]).to eq(test_case["expected_category"])
expect(result[:priority]).to eq(test_case["expected_priority"])
end
end
end
Prompt tests should not be run on every commit: running them on a schedule (and whenever the prompts themselves change) is enough.
Providing LLMs access to our data
LLMs work well when they have enough context. There are three ways to give them that context, and each one has its own trade-offs.
Rich context
Everything that might be needed is passed directly into the prompt or instructions. In practice, for most tasks in a Rails application rich context is enough. If you know exactly what the LLM needs for a specific task, it’s easier to pass it explicitly. Technically it’s just a longer prompt, so our current setup works out of the box—we just pass more variables to ERB:
def prompt
previous_tickets = ticket.customer.tickets.recent
.limit(5).pluck(:body, :category)
eval_erb_template(prompt_path, {
body: ticket.body,
previous_tickets:
})
end
Tool use
The LLM itself decides which data to request via function calling (ruby_llm has tool support). By default it’s the model’s call whether to actually use a tool, so the response might not include all the context you expected. You can force a specific tool with choice: :required (or a tool name), but at that point you’ve removed the dynamic part—you might as well pass the data as rich context. This makes tool use a good fit when the data is optional or branching, but a risky default when the model really does need a specific piece of context to answer correctly.
For our ticket classifier, a tool might look like this:
class FetchCustomerHistory < RubyLLM::Tool
description "Fetch the last 5 tickets submitted by the customer"
param :customer_id, desc: "ID of the customer"
def execute(customer_id:)
Customer.find(customer_id).tickets.recent.limit(5).pluck(:body, :category)
end
end
Attaching it to the chat is one line—you’d add a tools method to the subclass and extend build_chat to wire it up:
chat = chat.with_tools(*tools) if tools.any?
In my projects I reach for tools mostly when the data is genuinely optional—otherwise I’d rather pay the prompt-size tax and pass it as rich context.
RAG
RAG comes into play when you have too much data to provide. In this case, you index your data as embeddings in a vector store and either wrap that store with tools or retrieve a relevant subset before passing it as rich context. How you actually do that retrieval (pure vector search, hybrid with BM25, reranking, and so on) is a whole topic of its own and deserves a separate post.
That’s all for today. Here is a quick recap of the post:
- the foundation of our LLM layer is a base class with configuration via methods;
- familiar ERB works well for prompts—keep them in files, not in string interpolation;
- validate the response in
transform_responseand pick between failing fast and retrying with the error; - wrap every call in a Runner so response time, tokens, cost, and errors land somewhere queryable;
- test the request logic with mocks on every commit, and run prompt tests on a schedule;
- rich context beats flexibility—predictability is more important.
At the meantime, take a look at the LLM code in your own app. Are the prompts living in a file or buried in a string? Can you tell how much each call costs, or how often the schema falls over? Maybe you’ll find that the patterns above slot in neatly. Maybe you’ll come up with something better—either way, I’d love to hear about it!
Wiring LLMs into a Rails app the right way is harder than it looks. I offer Rails + LLM architecture consulting.