Machine Learning Book Recommender (Portfolio)
An end‑to‑end project exploring data cleaning, recommendation modeling, and deployment.
Project Note: This app uses the Book‑Crossing dataset as a starting point.
The raw data contained significant noise and inconsistencies. I implemented extensive cleaning and
enrichment (e.g., standardized metadata, subject mapping), but a small number of errors can still persist.
The app exists to showcase the ML pipeline and engineering, not perfect catalog data.
Chat with the Librarian (AI Assistant)
The chatbot is a multi-turn virtual librarian powered by LangGraph,
able to chat about books, recommend titles, answer questions about the site,
and search the web for new or trending reads.
What it can do
- Perform catalog-grounded semantic search using enriched subject and description vectors.
- Generate personalized recommendations via internal tools (ALS, FAISS, Bayesian popularity, subject embeddings).
- Explain why each book fits your interests based on metadata and curation reasoning.
- Answer help and onboarding questions using internal documentation.
- Search the web to suggest new or trending books outside the catalog.
How it works
- The system runs on LangGraph, orchestrating multiple agents through a structured reasoning flow.
- A central Router Agent interprets the user’s intent and directs the request to the most appropriate branch.
- Each branch operates autonomously—handling conversation, documentation retrieval, web search, or recommendation—before returning results to the router for response synthesis.
- The recommendation branch is itself composed of two agents:
one for candidate generation using internal models and semantic retrieval,
and another for curation and explanation of the selected results.
- All branches share a multi-turn memory that preserves context across interactions,
allowing the assistant to adapt to the ongoing conversation.
Roadmap
Planned improvements focus on deeper reasoning and finer modularity within the recommendation system.
- Introduce a Planner Agent to guide multi-step reasoning before routing or tool use.
- Expand the Recsys branch into four specialized stages:
planner → candidate generation → curation → explanation.
- Add a Dialogue Manager for better intent tracking and long-session coordination.
- Explore improved context summarization for extended multi-turn memory.
Information Enrichment
Before books are embedded for semantic search, a dedicated Enrichment Agent runs as an offline job
that refines and filters catalog data. Its goal is to transform raw metadata into clean, expressive text
that better represents each book’s content and mood.
Each record is reformatted into a compact structure optimized for LLM embeddings—combining
title, author, subjects, tone, genre, and vibe into a single enriched description.
This helps downstream models capture nuance and thematic similarity more effectively.
The enrichment pipeline runs on a Kafka-based workflow with tiered data quality handling,
ensuring that books with varying metadata completeness receive appropriate processing depth.
Two Spark jobs handle the output: one ingests enriched data into SQL for quick retrieval,
while another archives raw enrichment objects in a data lake for versioned storage.
Finally, an incremental embedding job encodes new or updated books as they’re enriched,
so the chatbot always works with fresh semantic vectors without needing to rebuild the entire index.