Maven Bookshelf challenge

An app based on Goodreads data, books and reviews, designed to help users search for titles and discover new reads.

About the Project

Sources:

  • Demo: https://py.cafe/app/marie-anne/maven-bookshelf
    (Limited dataset covering 1980–1990 due to restricted online resources; approx. 400 books and their reviews. Please allow a few seconds for the app to load. Popular authors from the 1980s include Terry Pratchett and Sue Grafton.)
  • GitHub: https://github.com/tigi/maven-bookshelf
    (The code is compatible with the full dataset, provided the two initial .csv files are placed in the root directory.)

Assignment

Design a tool that helps users build their ideal summer reading list based on Goodreads data.

Concept

A single-page web application where users enter one or more keywords to receive the best matching book recommendations.
“Best matching” in this context has two meanings:

  • Exact matches, such as author names or book titles.
  • Relevant suggestions based on one or more keywords, using AI-powered natural language processing (NLP) to understand context and meaning.

In addition, the idea was that simply browsing a book database has limited value if users can’t do anything with the results. That’s why the option to download selected titles was included.

Use of NLP

Natural Language Processing (NLP) makes it possible to generate an AI-based model that returns the most relevant results for any given keyword input. I used AI to help identify the most suitable approach for this particular use case, resulting in a method that balances performance and recommendation quality.

Suggestion:

Option A – TF-IDF (fast and simple)

pythonCopyEditfrom sklearn.feature_extraction.text import TfidfVectorizer  
vectorizer = TfidfVectorizer(stop_words="english")  
tfidf_matrix = vectorizer.fit_transform(books_df["full_text"])

Option B – Sentence Embeddings (better semantics)

pythonCopyEditfrom sentence_transformers import SentenceTransformer  
model = SentenceTransformer('all-MiniLM-L6-v2')  
embeddings = model.encode(books_df["full_text"], show_progress_bar=True)

Approach

I built a basic test application to evaluate the options. Option A proved to be fast and effective, while Option B was too resource-intensive for my laptop, especially with a full dataset that included over 300MB of review text. Additionally, the search results from Option A were more relevant. Based on these factors, I decided to proceed with Option A.

The following fields were used (then and now) to construct a relevance matrix:

pythonCopyEditbooks_df["text"] = (
    (books_df["original_title_lower"] + " ") +
    (books_df["genres_lower"] + " ") * 2 +
    books_df["description_lower"] + " " +
    (books_df["author_lower"] + " ") * 4 +
    books_df["review_text_lower"]
)

The review_text_lower column contains a concatenated string of all user reviews for each book. Initially, certain input fields were given higher weighting to influence relevance scoring, but this priority system is now largely redundant. I later decided that direct searches by author or title should take precedence over general keyword-based queries.

Layout and Design

The app follows a common design pattern often used in similar applications — nothing fancy, just functional. It is responsive, though small screens could still benefit from some refinement (e.g., hiding certain elements using display: none). The app title on py.cafe was slightly modified to reflect the limited dataset, so it no longer fits the layout perfectly.

Styling is based on Dash Bootstrap Components and the Vizro theme, with minimal custom CSS. Users can toggle between light and dark modes.

Bug: On mobile devices, the keyboard does not automatically disappear after pressing Enter. Tapping above the keyboard dismisses it, but this behavior isn’t ideal and should be addressed. I may fix this in an upcoming update, as it’s not expected behavior for this type of app.

Why keep the genre filter?

During testing, it became clear that selecting a genre often led to surprising and inspiring results. While the filter could technically be disabled on smaller screens (in which case the app defaults to the “all genres” option), it adds enough value to justify keeping it in the interface.

Why offer a CSV download?

While it’s technically easy to integrate links to online book retailers, I encourage users to support their local bookstores or libraries. The downloadable .csv file allows users to mark favorites and save a personalized reading list.

I briefly considered adding an email option, but given the current hosting environment for the demo, it’s likely that such emails would be blocked by spam filters.

Considerations: Speed, Resources, Quality, and NLP

Applying Natural Language Processing (NLP) significantly slows down the search process, and keyword-based relevance matrices don’t always return the most intuitive results. For example, “Lee Child” — a well-known thriller author — didn’t appear in the top 20 results, which led to the decision to prioritize direct searches on author and title (based on column values) over NLP-driven results.

Title-based searches are a bit more nuanced. For instance, an NLP search for “Venice” ranks Tod in Venedig by Thomas Mann highly, due to the German word Venedig. When prioritizing a direct title match, that result is excluded. These kinds of trade-offs have been incorporated into the current search strategy.

On Reviews

While review texts are included in the TF-IDF model, they are not prominently displayed in the interface. Personally, I tend to read reviews only when a book I either loved or disliked receives an unexpectedly low or high rating, out of curiosity. For most users, an average rating offers sufficient insight. The same applies to algorithmic recommendations (“you read/watched this, so you might like…”), often these are inaccurate or commercially driven, particularly when based on recent releases.

Development Process

ChatGPT helped me get started, while Claude Opus 4 (via Openrouter.ai) handled much of the time-consuming work. Total cost: approximately $8.

I worked iteratively: I would clearly describe the functionality I needed, apply the changes (after saving the previous version), and test thoroughly. When making layout changes, I occasionally shared the full code again. I only requested a complete regenerated codebase twice — the most resource-intensive part of the process.

Total time spent from initial idea to final version: approximately 16 hours.

Tools Used

  • Dash Bootstrap Components with the Vizro theme
  • ChatGPT (free version)
  • Claude Opus 4 (paid via Openrouter.ai)
  • Spider
  • Logo icon sourced from The Noun Project (with customized colors)

Links

Results