Spring AI: Build a RAG Application

Large language models know a lot — but not about your data. RAG (Retrieval-Augmented Generation) solves this: find the relevant context from your documents, inject it into the prompt, and let the model answer grounded in your data. This article builds a complete RAG API with Spring AI 2.0.

What You’ll Build

A Q&A API over your product documentation:

User: "What's the return policy for electronics?"
→ Search vector store for relevant docs
→ Inject matching paragraphs into prompt
→ Claude/GPT answers based on your actual docs

Without RAG: the LLM guesses or hallucinate your policy. With RAG: the LLM reads your actual policy and answers accurately.

Setup

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>2.0.0</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <!-- Anthropic Claude (or OpenAI — swap one for the other) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-anthropic-spring-boot-starter</artifactId>
    </dependency>

    <!-- Embeddings — use OpenAI's text-embedding-3-small -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>

    <!-- PostgreSQL vector store (pgvector) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
    </dependency>

    <!-- PDF/text document readers -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>
</dependencies>

Configuration

spring:
  ai:
    anthropic:
      api-key: ${ANTHROPIC_API_KEY}
      chat:
        options:
          model: claude-opus-4-7
          max-tokens: 1024
          temperature: 0.1       # low temperature for factual answers

    openai:
      api-key: ${OPENAI_API_KEY}
      embedding:
        options:
          model: text-embedding-3-small
          dimensions: 1536

    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1536
        schema-validation: true
        initialize-schema: true

  datasource:
    url: jdbc:postgresql://localhost:5432/rag_db

Step 1: Ingest Documents

Load your documents, split them into chunks, embed them, and store in the vector database:

@Service
@RequiredArgsConstructor
@Slf4j
public class DocumentIngestionService {

    private final VectorStore vectorStore;
    private final TokenTextSplitter textSplitter;

    public void ingestPdf(Resource pdfResource, String documentId) {
        log.info("Ingesting document: {}", pdfResource.getFilename());

        // Read PDF
        TikaDocumentReader reader = new TikaDocumentReader(pdfResource);
        List<Document> documents = reader.get();

        // Split into chunks (overlapping for context preservation)
        List<Document> chunks = textSplitter.apply(documents);

        // Add metadata for filtering and attribution
        chunks.forEach(chunk -> {
            chunk.getMetadata().put("documentId", documentId);
            chunk.getMetadata().put("source", pdfResource.getFilename());
            chunk.getMetadata().put("ingestedAt", Instant.now().toString());
        });

        // Embed and store (Spring AI handles embedding API calls)
        vectorStore.add(chunks);

        log.info("Ingested {} chunks from {}", chunks.size(), pdfResource.getFilename());
    }

    public void ingestText(String content, String title, String documentId) {
        Document doc = new Document(content, Map.of(
            "documentId", documentId,
            "title", title,
            "source", "manual-input"
        ));

        List<Document> chunks = textSplitter.apply(List.of(doc));
        vectorStore.add(chunks);
    }
}
@Bean
public TokenTextSplitter textSplitter() {
    return new TokenTextSplitter(
        512,   // chunk size (tokens)
        128,   // overlap (tokens) — context continuity
        5,     // min chunk size
        10000, // max chunk size
        true   // keep separator
    );
}

Step 2: Ingest on Startup

@Component
@RequiredArgsConstructor
@Slf4j
public class DocumentLoader implements ApplicationRunner {

    private final DocumentIngestionService ingestionService;
    private final VectorStore vectorStore;

    @Override
    public void run(ApplicationArguments args) throws Exception {
        // Only ingest if vector store is empty
        SearchRequest probe = SearchRequest.query("test").withTopK(1);
        if (!vectorStore.similaritySearch(probe).isEmpty()) {
            log.info("Vector store already populated, skipping ingestion");
            return;
        }

        log.info("Loading product documentation into vector store");

        // Load from classpath
        Resource[] docs = new PathMatchingResourcePatternResolver()
            .getResources("classpath:docs/*.pdf");

        for (Resource doc : docs) {
            String docId = doc.getFilename().replace(".pdf", "");
            ingestionService.ingestPdf(doc, docId);
        }

        log.info("Document ingestion complete");
    }
}

Step 3: RAG Query Pipeline

@Service
@RequiredArgsConstructor
@Slf4j
public class RagService {

    private final VectorStore vectorStore;
    private final ChatClient chatClient;

    public RagResponse query(String userQuestion) {
        return query(userQuestion, null);
    }

    public RagResponse query(String userQuestion, @Nullable String documentId) {
        // 1. Retrieve relevant chunks
        SearchRequest searchRequest = SearchRequest.query(userQuestion)
            .withTopK(5)
            .withSimilarityThreshold(0.7);

        // Filter by document if specified
        if (documentId != null) {
            searchRequest = searchRequest.withFilterExpression(
                "documentId == '" + documentId + "'");
        }

        List<Document> relevantChunks = vectorStore.similaritySearch(searchRequest);

        if (relevantChunks.isEmpty()) {
            return RagResponse.noContext(userQuestion);
        }

        // 2. Build context from retrieved chunks
        String context = relevantChunks.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n---\n\n"));

        // 3. Build source attribution
        List<SourceReference> sources = relevantChunks.stream()
            .map(doc -> new SourceReference(
                (String) doc.getMetadata().get("source"),
                (String) doc.getMetadata().get("title")))
            .distinct()
            .toList();

        // 4. Call the LLM with context
        String answer = chatClient.prompt()
            .system("""
                You are a helpful assistant that answers questions based on the provided context.
                Only answer based on the context below. If the answer is not in the context,
                say "I don't have information about that in my knowledge base."
                Do not make up information.
                """)
            .user(u -> u
                .text("""
                    Context:
                    {context}

                    Question: {question}

                    Answer:
                    """)
                .param("context", context)
                .param("question", userQuestion))
            .call()
            .content();

        log.info("RAG query: question='{}', chunks={}, tokens≈{}",
            userQuestion, relevantChunks.size(), answer.length() / 4);

        return new RagResponse(userQuestion, answer, sources);
    }
}

public record RagResponse(
    String question,
    String answer,
    List<SourceReference> sources
) {
    static RagResponse noContext(String question) {
        return new RagResponse(question,
            "I don't have relevant information to answer this question.",
            List.of());
    }
}

public record SourceReference(String source, @Nullable String title) {}

Step 4: REST API

@RestController
@RequestMapping("/api/rag")
@RequiredArgsConstructor
public class RagController {

    private final RagService ragService;
    private final DocumentIngestionService ingestionService;

    @PostMapping("/query")
    public RagResponse query(@RequestBody @Valid QueryRequest request) {
        return ragService.query(request.question(), request.documentId());
    }

    @PostMapping("/documents")
    @PreAuthorize("hasRole('ADMIN')")
    public ResponseEntity<Void> ingestDocument(
            @RequestParam("file") MultipartFile file,
            @RequestParam String documentId) throws IOException {

        Resource resource = file.getResource();
        ingestionService.ingestPdf(resource, documentId);
        return ResponseEntity.accepted().build();
    }

    @PostMapping("/documents/text")
    @PreAuthorize("hasRole('ADMIN')")
    public ResponseEntity<Void> ingestText(@RequestBody @Valid TextIngestionRequest request) {
        ingestionService.ingestText(request.content(), request.title(), request.documentId());
        return ResponseEntity.accepted().build();
    }
}

public record QueryRequest(
    @NotBlank String question,
    @Nullable String documentId
) {}

public record TextIngestionRequest(
    @NotBlank String documentId,
    @NotBlank String title,
    @NotBlank String content
) {}

Step 5: Streaming Responses

For long answers, stream the response token-by-token:

@GetMapping(value = "/query/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> queryStream(@RequestParam String question) {
    // 1. Retrieve context (synchronous)
    List<Document> chunks = vectorStore.similaritySearch(
        SearchRequest.query(question).withTopK(5));

    String context = chunks.stream()
        .map(Document::getContent)
        .collect(Collectors.joining("\n\n"));

    // 2. Stream the LLM response
    return chatClient.prompt()
        .system("Answer based on the context provided.")
        .user(u -> u
            .text("Context: {context}\n\nQuestion: {question}")
            .param("context", context)
            .param("question", question))
        .stream()
        .content();
}

Clients receive SSE tokens as the LLM generates them — perceived latency drops dramatically.

Evaluation: Is RAG Working?

@SpringBootTest
class RagEvaluationTest {

    @Autowired RagService ragService;
    @Autowired ChatClient evaluationClient;

    @Test
    void answerIsGroundedInContext() {
        // Known question with known answer in the docs
        RagResponse response = ragService.query("What is the return window for electronics?");

        assertThat(response.answer()).isNotBlank();
        assertThat(response.sources()).isNotEmpty();

        // LLM-as-judge: is the answer factually grounded?
        String judgment = evaluationClient.prompt()
            .user(u -> u.text("""
                Question: {question}
                Answer: {answer}

                Is this answer accurate and directly supported by factual content?
                Reply only YES or NO.
                """)
                .param("question", response.question())
                .param("answer", response.answer()))
            .call()
            .content();

        assertThat(judgment.trim().toUpperCase()).startsWith("YES");
    }
}

Production Considerations

Chunking strategy matters: Too small → no context. Too large → irrelevant context dilutes the signal. Start with 512 tokens, 128 overlap. Tune based on your document structure.

Embedding model consistency: Use the same model for ingestion and query. Never change the embedding model without re-ingesting all documents.

Similarity threshold: 0.7 is a good starting point. Lower = more (possibly irrelevant) results. Higher = fewer (but more precise) results.

Cost control: Each query makes 2 API calls: embedding the question (cheap) and calling the LLM (expensive). Cache frequent queries in Redis with short TTL.

Re-ingestion: When documents change, delete old vectors by documentId and re-ingest:

vectorStore.delete(List.of(FilterExpressionBuilder.eq("documentId", docId)));
ingestionService.ingestPdf(updatedDoc, docId);

What You’ve Learned

  • RAG = retrieve relevant context from a vector store, inject into the LLM prompt → grounded answers
  • Spring AI 2.0 provides VectorStore, ChatClient, EmbeddingModel with pluggable backends
  • Ingestion pipeline: read → chunk (TokenTextSplitter) → embed → store in pgvector
  • Query pipeline: embed question → similarity search → build context → call LLM → return answer with sources
  • Stream LLM responses with chatClient.stream() and SSE for responsive UX
  • Evaluate RAG quality with LLM-as-judge in integration tests

Series Complete

This is the final article of the Spring Boot Tutorial series — 59 articles from “Hello World” to production-ready RAG AI applications.

Here’s what you’ve covered:

PartFocusArticles
1Getting Started1–7
2REST APIs8–14
3Spring Data JPA15–22
4Spring Security23–28
5Testing29–33
6Production-Ready Features34–37
7Performance38–41
8Messaging42–45
9Microservices46–51
10Containers & Cloud52–54
11Spring Boot 4 & Modern Java55–59

Return to the Spring Boot Tutorial index → for the complete list.