Machine Learning Book Recommender (Portfolio)

An end‑to‑end project exploring data cleaning, recommendation modeling, and deployment.

Project Note: This app uses the Book‑Crossing dataset as a starting point. The raw data contained significant noise and inconsistencies. I implemented extensive cleaning and enrichment (e.g., standardized metadata, subject mapping), but a small number of errors can still persist. The app exists to showcase the ML pipeline and engineering, not perfect catalog data.

What This Demonstrates

End‑to‑end ML build: data ingestion → cleaning → modeling → deployment.
Recommendation UX: cold‑start via subjects; warm personalization via ratings.
Evaluation in the wild: popularity ranking and similar‑book browsing to gather signal.
MLOps mindset: scheduled retraining on a separate server with automated deployment.
Performance & scale: low‑latency candidate retrieval and reranking.

How to Use the App

Create Your Profile: Sign up and select your favorite genres. These are used to generate your first set of recommendations.
Explore Initial Recommendations: Until you've rated at least 10 books, suggestions are mostly based on your favorite genres. You can update them anytime from your profile.
Browse Popular Books: Visit the Search page without entering a title to see books ranked by popularity (using a Bayesian formula). This is a great way to find books you’ve already read and rate them quickly.
Rate at Least 10 Books: Once you’ve rated 10 or more books, the system unlocks fully personalized recommendations based on your own preferences.
Keep Exploring and Rating: The more books you rate, the better your recommendations become. You can also find similar books on any book's detail page.

Chat with the Librarian (AI Assistant)

The chatbot is a multi-turn virtual librarian powered by LangGraph, able to chat about books, recommend titles, answer questions about the site, and search the web for new or trending reads.

What it can do

Perform catalog-grounded semantic search using enriched subject and description vectors.
Generate personalized recommendations via internal tools (ALS, FAISS, Bayesian popularity, subject embeddings).
Explain why each book fits your interests based on metadata and curation reasoning.
Answer help and onboarding questions using internal documentation.
Search the web to suggest new or trending books outside the catalog.

How it works

The system runs on LangGraph, orchestrating multiple agents through a structured reasoning flow.
A central Router Agent interprets the user’s intent and directs the request to the most appropriate branch.
Each branch operates autonomously—handling conversation, documentation retrieval, web search, or recommendation—before returning results to the router for response synthesis.
The recommendation branch is itself composed of two agents:
one for candidate generation using internal models and semantic retrieval, and another for curation and explanation of the selected results.
All branches share a multi-turn memory that preserves context across interactions, allowing the assistant to adapt to the ongoing conversation.

Roadmap

Planned improvements focus on deeper reasoning and finer modularity within the recommendation system.

Introduce a Planner Agent to guide multi-step reasoning before routing or tool use.
Expand the Recsys branch into four specialized stages:
planner → candidate generation → curation → explanation.
Add a Dialogue Manager for better intent tracking and long-session coordination.
Explore improved context summarization for extended multi-turn memory.

Behind the Scenes

Data pipeline: Book‑Crossing base with extensive cleaning/enrichment (standardized metadata, subject mapping).
Retrieval: implicit‑feedback ALS embeddings for fast user–item candidate generation.
Content similarity: attention‑pooled subject embeddings scored via cosine similarity.
ANN search: FAISS indexes over ALS and subject spaces for efficient top‑K lookups.
Popularity: Bayesian adjustment balances rating averages with rating counts.
Reranking: LightGBM merges metadata and decomposed embeddings to refine results.
Automation: scheduled training on a separate server; artifacts exported and hot‑reloaded without downtime.

Information Enrichment

Before books are embedded for semantic search, a dedicated Enrichment Agent runs as an offline job that refines and filters catalog data. Its goal is to transform raw metadata into clean, expressive text that better represents each book’s content and mood.

Each record is reformatted into a compact structure optimized for LLM embeddings—combining title, author, subjects, tone, genre, and vibe into a single enriched description. This helps downstream models capture nuance and thematic similarity more effectively.

The enrichment pipeline runs on a Kafka-based workflow with tiered data quality handling, ensuring that books with varying metadata completeness receive appropriate processing depth. Two Spark jobs handle the output: one ingests enriched data into SQL for quick retrieval, while another archives raw enrichment objects in a data lake for versioned storage.

Finally, an incremental embedding job encodes new or updated books as they’re enriched, so the chatbot always works with fresh semantic vectors without needing to rebuild the entire index.