Spring AI: Build a RAG Application
Large language models know a lot — but not about your data. RAG (Retrieval-Augmented Generation) solves this: find the relevant context from your documents, inject it into the prompt, and let the model answer grounded in your data. This article builds a complete RAG API with Spring AI 2.0.
What You’ll Build
A Q&A API over your product documentation:
User: "What's the return policy for electronics?"
→ Search vector store for relevant docs
→ Inject matching paragraphs into prompt
→ Claude/GPT answers based on your actual docs
Without RAG: the LLM guesses or hallucinate your policy. With RAG: the LLM reads your actual policy and answers accurately.
Setup
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>2.0.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<!-- Anthropic Claude (or OpenAI — swap one for the other) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-anthropic-spring-boot-starter</artifactId>
</dependency>
<!-- Embeddings — use OpenAI's text-embedding-3-small -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<!-- PostgreSQL vector store (pgvector) -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
</dependency>
<!-- PDF/text document readers -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
</dependencies>
Configuration
spring:
ai:
anthropic:
api-key: ${ANTHROPIC_API_KEY}
chat:
options:
model: claude-opus-4-7
max-tokens: 1024
temperature: 0.1 # low temperature for factual answers
openai:
api-key: ${OPENAI_API_KEY}
embedding:
options:
model: text-embedding-3-small
dimensions: 1536
vectorstore:
pgvector:
index-type: HNSW
distance-type: COSINE_DISTANCE
dimensions: 1536
schema-validation: true
initialize-schema: true
datasource:
url: jdbc:postgresql://localhost:5432/rag_db
Step 1: Ingest Documents
Load your documents, split them into chunks, embed them, and store in the vector database:
@Service
@RequiredArgsConstructor
@Slf4j
public class DocumentIngestionService {
private final VectorStore vectorStore;
private final TokenTextSplitter textSplitter;
public void ingestPdf(Resource pdfResource, String documentId) {
log.info("Ingesting document: {}", pdfResource.getFilename());
// Read PDF
TikaDocumentReader reader = new TikaDocumentReader(pdfResource);
List<Document> documents = reader.get();
// Split into chunks (overlapping for context preservation)
List<Document> chunks = textSplitter.apply(documents);
// Add metadata for filtering and attribution
chunks.forEach(chunk -> {
chunk.getMetadata().put("documentId", documentId);
chunk.getMetadata().put("source", pdfResource.getFilename());
chunk.getMetadata().put("ingestedAt", Instant.now().toString());
});
// Embed and store (Spring AI handles embedding API calls)
vectorStore.add(chunks);
log.info("Ingested {} chunks from {}", chunks.size(), pdfResource.getFilename());
}
public void ingestText(String content, String title, String documentId) {
Document doc = new Document(content, Map.of(
"documentId", documentId,
"title", title,
"source", "manual-input"
));
List<Document> chunks = textSplitter.apply(List.of(doc));
vectorStore.add(chunks);
}
}
@Bean
public TokenTextSplitter textSplitter() {
return new TokenTextSplitter(
512, // chunk size (tokens)
128, // overlap (tokens) — context continuity
5, // min chunk size
10000, // max chunk size
true // keep separator
);
}
Step 2: Ingest on Startup
@Component
@RequiredArgsConstructor
@Slf4j
public class DocumentLoader implements ApplicationRunner {
private final DocumentIngestionService ingestionService;
private final VectorStore vectorStore;
@Override
public void run(ApplicationArguments args) throws Exception {
// Only ingest if vector store is empty
SearchRequest probe = SearchRequest.query("test").withTopK(1);
if (!vectorStore.similaritySearch(probe).isEmpty()) {
log.info("Vector store already populated, skipping ingestion");
return;
}
log.info("Loading product documentation into vector store");
// Load from classpath
Resource[] docs = new PathMatchingResourcePatternResolver()
.getResources("classpath:docs/*.pdf");
for (Resource doc : docs) {
String docId = doc.getFilename().replace(".pdf", "");
ingestionService.ingestPdf(doc, docId);
}
log.info("Document ingestion complete");
}
}
Step 3: RAG Query Pipeline
@Service
@RequiredArgsConstructor
@Slf4j
public class RagService {
private final VectorStore vectorStore;
private final ChatClient chatClient;
public RagResponse query(String userQuestion) {
return query(userQuestion, null);
}
public RagResponse query(String userQuestion, @Nullable String documentId) {
// 1. Retrieve relevant chunks
SearchRequest searchRequest = SearchRequest.query(userQuestion)
.withTopK(5)
.withSimilarityThreshold(0.7);
// Filter by document if specified
if (documentId != null) {
searchRequest = searchRequest.withFilterExpression(
"documentId == '" + documentId + "'");
}
List<Document> relevantChunks = vectorStore.similaritySearch(searchRequest);
if (relevantChunks.isEmpty()) {
return RagResponse.noContext(userQuestion);
}
// 2. Build context from retrieved chunks
String context = relevantChunks.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n---\n\n"));
// 3. Build source attribution
List<SourceReference> sources = relevantChunks.stream()
.map(doc -> new SourceReference(
(String) doc.getMetadata().get("source"),
(String) doc.getMetadata().get("title")))
.distinct()
.toList();
// 4. Call the LLM with context
String answer = chatClient.prompt()
.system("""
You are a helpful assistant that answers questions based on the provided context.
Only answer based on the context below. If the answer is not in the context,
say "I don't have information about that in my knowledge base."
Do not make up information.
""")
.user(u -> u
.text("""
Context:
{context}
Question: {question}
Answer:
""")
.param("context", context)
.param("question", userQuestion))
.call()
.content();
log.info("RAG query: question='{}', chunks={}, tokens≈{}",
userQuestion, relevantChunks.size(), answer.length() / 4);
return new RagResponse(userQuestion, answer, sources);
}
}
public record RagResponse(
String question,
String answer,
List<SourceReference> sources
) {
static RagResponse noContext(String question) {
return new RagResponse(question,
"I don't have relevant information to answer this question.",
List.of());
}
}
public record SourceReference(String source, @Nullable String title) {}
Step 4: REST API
@RestController
@RequestMapping("/api/rag")
@RequiredArgsConstructor
public class RagController {
private final RagService ragService;
private final DocumentIngestionService ingestionService;
@PostMapping("/query")
public RagResponse query(@RequestBody @Valid QueryRequest request) {
return ragService.query(request.question(), request.documentId());
}
@PostMapping("/documents")
@PreAuthorize("hasRole('ADMIN')")
public ResponseEntity<Void> ingestDocument(
@RequestParam("file") MultipartFile file,
@RequestParam String documentId) throws IOException {
Resource resource = file.getResource();
ingestionService.ingestPdf(resource, documentId);
return ResponseEntity.accepted().build();
}
@PostMapping("/documents/text")
@PreAuthorize("hasRole('ADMIN')")
public ResponseEntity<Void> ingestText(@RequestBody @Valid TextIngestionRequest request) {
ingestionService.ingestText(request.content(), request.title(), request.documentId());
return ResponseEntity.accepted().build();
}
}
public record QueryRequest(
@NotBlank String question,
@Nullable String documentId
) {}
public record TextIngestionRequest(
@NotBlank String documentId,
@NotBlank String title,
@NotBlank String content
) {}
Step 5: Streaming Responses
For long answers, stream the response token-by-token:
@GetMapping(value = "/query/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> queryStream(@RequestParam String question) {
// 1. Retrieve context (synchronous)
List<Document> chunks = vectorStore.similaritySearch(
SearchRequest.query(question).withTopK(5));
String context = chunks.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"));
// 2. Stream the LLM response
return chatClient.prompt()
.system("Answer based on the context provided.")
.user(u -> u
.text("Context: {context}\n\nQuestion: {question}")
.param("context", context)
.param("question", question))
.stream()
.content();
}
Clients receive SSE tokens as the LLM generates them — perceived latency drops dramatically.
Evaluation: Is RAG Working?
@SpringBootTest
class RagEvaluationTest {
@Autowired RagService ragService;
@Autowired ChatClient evaluationClient;
@Test
void answerIsGroundedInContext() {
// Known question with known answer in the docs
RagResponse response = ragService.query("What is the return window for electronics?");
assertThat(response.answer()).isNotBlank();
assertThat(response.sources()).isNotEmpty();
// LLM-as-judge: is the answer factually grounded?
String judgment = evaluationClient.prompt()
.user(u -> u.text("""
Question: {question}
Answer: {answer}
Is this answer accurate and directly supported by factual content?
Reply only YES or NO.
""")
.param("question", response.question())
.param("answer", response.answer()))
.call()
.content();
assertThat(judgment.trim().toUpperCase()).startsWith("YES");
}
}
Production Considerations
Chunking strategy matters: Too small → no context. Too large → irrelevant context dilutes the signal. Start with 512 tokens, 128 overlap. Tune based on your document structure.
Embedding model consistency: Use the same model for ingestion and query. Never change the embedding model without re-ingesting all documents.
Similarity threshold: 0.7 is a good starting point. Lower = more (possibly irrelevant) results. Higher = fewer (but more precise) results.
Cost control: Each query makes 2 API calls: embedding the question (cheap) and calling the LLM (expensive). Cache frequent queries in Redis with short TTL.
Re-ingestion: When documents change, delete old vectors by documentId and re-ingest:
vectorStore.delete(List.of(FilterExpressionBuilder.eq("documentId", docId)));
ingestionService.ingestPdf(updatedDoc, docId);
What You’ve Learned
- RAG = retrieve relevant context from a vector store, inject into the LLM prompt → grounded answers
- Spring AI 2.0 provides
VectorStore,ChatClient,EmbeddingModelwith pluggable backends - Ingestion pipeline: read → chunk (TokenTextSplitter) → embed → store in pgvector
- Query pipeline: embed question → similarity search → build context → call LLM → return answer with sources
- Stream LLM responses with
chatClient.stream()and SSE for responsive UX - Evaluate RAG quality with LLM-as-judge in integration tests
Series Complete
This is the final article of the Spring Boot Tutorial series — 59 articles from “Hello World” to production-ready RAG AI applications.
Here’s what you’ve covered:
| Part | Focus | Articles |
|---|---|---|
| 1 | Getting Started | 1–7 |
| 2 | REST APIs | 8–14 |
| 3 | Spring Data JPA | 15–22 |
| 4 | Spring Security | 23–28 |
| 5 | Testing | 29–33 |
| 6 | Production-Ready Features | 34–37 |
| 7 | Performance | 38–41 |
| 8 | Messaging | 42–45 |
| 9 | Microservices | 46–51 |
| 10 | Containers & Cloud | 52–54 |
| 11 | Spring Boot 4 & Modern Java | 55–59 |
Return to the Spring Boot Tutorial index → for the complete list.