Audible: HTML refresh, multi-narrator & works dedup

Switch nightly discovery refresh to scrape Audible's curated HTML storefronts (popular, new releases, category pages) while keeping real-time user paths on the JSON catalog API. Add robust HTML resilience knobs (increased retries, capped jittered backoff, AdaptivePacer changes and per-batch cooldowns) to avoid failing nightly jobs during 503 storms. Implement multi-narrator capture via a new extractAllNarrators helper and update parsers to preserve all narrator anchors. Introduce two-pass dedup: in-memory deduplicateAndCollectGroups + collapseByExistingWorks that consults the works table, export metadataScore for consistent representative selection, and persist dedup groups (fire-and-forget). Wire collapseByExistingWorks into search/author/series routes and make defensive dedup in the refresh processor. Add HTML parsing helpers, runtime/lang-aware parsing, jitteredBackoff cap, and tests for the new behaviors.
This commit is contained in:
kikootwo
2026-05-14 15:23:15 -04:00
parent 5f0855b2f8
commit fcae3bcf09
17 changed files with 1241 additions and 214 deletions
+2
View File
@@ -45,6 +45,8 @@
- **Web scraping (popular, new releases)** → [integrations/audible.md](integrations/audible.md)
- **Database caching, real-time matching** → [integrations/audible.md](integrations/audible.md)
- **Book covers API for login page** → [frontend/pages/login.md](frontend/pages/login.md)
- **Dedup & works table (cross-ASIN identity)** → [integrations/audible.md](integrations/audible.md#dedup--works-table)
- **Multi-narrator capture in HTML scrapers** → [integrations/audible.md](integrations/audible.md#narrator-capture-in-html-scrapers)
## E-book Support (First-Class)
- **First-class ebook requests, separate tracking** → [integrations/ebook-sidecar.md](integrations/ebook-sidecar.md)
+83 -21
View File
@@ -1,29 +1,40 @@
# Audible Integration
**Status:** Implemented | Unauthenticated Audible JSON catalog API (primary) + Audnexus API (per-ASIN details)
**Status:** Implemented | Hybrid — curated HTML for discovery refresh + Audible JSON catalog API for user-facing real-time + Audnexus for per-ASIN details
## Overview
Audiobook metadata for discovery, search, and detail pages. All catalog operations (search, popular, new releases, categories, category books, author books, single-product details) now call Audible's unauthenticated public JSON catalog API (`api.audible.<tld>/1.0/catalog/*`). Per-ASIN detail lookups prefer Audnexus; the catalog API is used as fallback.
Audiobook metadata for discovery, search, and detail pages. Split by access pattern:
- **Nightly discovery refresh** (popular / new releases / category lists) — scraped from Audible's **curated HTML storefronts** (`www.audible.<tld>/adblbestsellers`, `/newreleases`, `/search?node=<id>`). The HTML pages reflect Audible's own editorial picks.
- **User-facing real-time** (search, author books, categories listing, per-ASIN details) — Audible's unauthenticated public **JSON catalog API** (`api.audible.<tld>/1.0/catalog/*`).
- **Per-ASIN detail lookups** — Audnexus (`api.audnex.us/books/{asin}`) primary; catalog API used as fallback when Audnexus returns 404.
## Architecture
- **Primary data source:** Audible JSON catalog API, same endpoint used by the official Audible mobile apps. No authentication, no API key, no user credentials, no special headers.
- **Per-ASIN details:** Audnexus (`api.audnex.us/books/{asin}`) remains primary; catalog API (`/1.0/catalog/products/{asin}`) is the fallback when Audnexus returns 404.
- **HTML scraping:** Removed from `audible.service.ts`. The only remaining HTML path is `audible-series.ts` (series-page scraping, out of scope).
- **`www.audible.<tld>`:** Still used by `audible-series.ts` and by `getBaseUrl()` for "View on Audible" link generation. Not used for any catalog operation.
- **Curated HTML (refresh job only):** the three methods called solely by `audible-refresh.processor.ts` (`getPopularAudiobooks`, `getNewReleases`, `getCategoryBooks`) scrape Audible's storefront HTML to inherit editorial curation. Beefed-up retry/backoff knobs (12 retries, 3-min jittered cap) handle 503 storms patiently on the nightly job without slowing healthy users.
- **JSON catalog API (real-time):** `search`, `searchByAuthorAsin`, `getCategories` (categories listing), and `fetchAudibleDetailsFromApi` (per-ASIN fallback). Same endpoint used by the official Audible mobile apps. No authentication, no API key, no user credentials, no special headers.
- **Audnexus (per-ASIN):** `getAudiobookDetails` and `getRuntime` prefer Audnexus, with catalog API fallback for `getAudiobookDetails`.
- **`www.audible.<tld>`:** Used by HTML refresh scraping, by `audible-series.ts`, and by `getBaseUrl()` for "View on Audible" link generation.
## Data Sources
All catalog operations are HTTP GET against `{apiBaseUrl}` (region-dependent, e.g. `https://api.audible.com`):
### Nightly refresh (HTML — `htmlClient`, baseURL `www.audible.<tld>`)
| Operation | Endpoint | Key params |
|---|---|---|
| Popular | `/adblbestsellers` | `pageSize=50`, `page=<n>` (omitted on first page) |
| New releases | `/newreleases` | `pageSize=50`, `page=<n>` (omitted on first page) |
| Category books | `/search` | `node=<categoryId>&pageSize=50&sort=popularity-rank&page=<n>` |
Parsed via cheerio. Selectors: `.productListItem` (popular/new releases), `.s-result-item, .productListItem` (categories).
### Real-time (JSON catalog API — `apiClient`, baseURL `api.audible.<tld>`)
| Operation | Endpoint | Key params |
|---|---|---|
| Search | `/1.0/catalog/products` | `keywords=<q>` |
| Author books | `/1.0/catalog/products` | `author=<name>` (name, NOT ASIN) |
| Popular | `/1.0/catalog/products` | `products_sort_by=BestSellers` |
| New releases | `/1.0/catalog/products` | `products_sort_by=-ReleaseDate` |
| Category books | `/1.0/catalog/products` | `category_id=<id>&products_sort_by=BestSellers` |
| Categories listing | `/1.0/catalog/categories` | (none) |
| Single product | `/1.0/catalog/products/{asin}` | — |
| Audnexus (per-ASIN) | `https://api.audnex.us/books/{asin}` | `region={audnexusParam}` |
@@ -48,20 +59,20 @@ Populates every `AudibleAudiobook` field. Covered:
## Gotchas
- **Catalog API cannot filter preorders or surface curated bestsellers.** The API's `BestSellers` sort is a right-now velocity rank that spikes on launch-day promos and preorder windows; the `-ReleaseDate` sort returns 100% future preorders. There is no server-side `release_time`, `released-only`, `customer_rights`, or alternate sort (`Reviewed`, `MostListened`, etc.) — every plausible variant was tested and silently ignored. This is why the nightly refresh job uses the curated HTML storefront pages instead.
- **`author=` takes a name, not an ASIN.** The catalog API has no ASIN-based author param. `searchByAuthorAsin()` queries by name, then filters client-side: keeps only products where `products[].authors[].asin === authorAsin`. Preserves ASIN-authoritative author identity. Also filters by `product.language` via `isAcceptedLanguage()` for the configured region.
- **Invalid ASIN returns HTTP 200 with stub body.** `/1.0/catalog/products/{asin}` responds 200 with `{product: {asin: INPUT}}` and no other fields. `fetchAudibleDetailsFromApi()` detects this via missing `product.title` and returns `null`.
- **`publisher_summary` is HTML.** Service strips tags via inline `stripHtml()` helper (regex-based, no cheerio) before populating `description`. Falls back to `merchandising_summary` (plain text) if `publisher_summary` missing.
- **Series is an array.** `products[].series[]` — a book may belong to multiple series. Service picks the first entry with non-empty `sequence`, else the first entry. `sequence` is cleaned by extracting first `/\d+(?:\.\d+)?/` match for numeric ordering.
- **Stub `product_images`:** cover URL reads from `product_images['500']`; missing keys fall back to `undefined`.
- **`page` is 0-indexed.** Despite the default value appearing to be 1, the API returns items `(page * num_results)` through `((page + 1) * num_results - 1)`. So `page=1` fetches items 51100, not 150. All service methods accept a 1-indexed `page` and subtract 1 at the axios call. The symptom of getting this wrong is silent: queries whose `total_results ≤ num_results` return an empty `products` array while `total_results` is populated (e.g. author searches for small catalogues).
- **`page` is 0-indexed (catalog API only).** Despite the default value appearing to be 1, the API returns items `(page * num_results)` through `((page + 1) * num_results - 1)`. So `page=1` fetches items 51100, not 150. All catalog-API service methods accept a 1-indexed `page` and subtract 1 at the axios call. The symptom of getting this wrong is silent: queries whose `total_results ≤ num_results` return an empty `products` array while `total_results` is populated (e.g. author searches for small catalogues). HTML paths use Audible's native 1-indexed `page` query param and omit it on the first page.
## Rate Limiting & Resilience
- 503s still possible but dramatically less frequent than the HTML surface.
- `fetchWithRetry()` — jittered exponential backoff, 5 retries, retries on 503/429/5xx.
- `AdaptivePacer` circuit-breaker preserved.
- Inter-page base delay on API paths: **5001500ms** (down from 20004000ms for HTML).
- API responses include `Cache-Control: private, max-age=1800`.
- **Real-time JSON API paths:** 503s are uncommon. `fetchWithRetry()` uses jittered exponential backoff, 5 retries, retries on 503/429/5xx. API responses include `Cache-Control: private, max-age=1800`.
- **Nightly HTML refresh paths:** 503s are more likely (HTML storefront is more rate-sensitive). Same `fetchWithRetry()`, but with `HTML_MAX_RETRIES=12` and `HTML_MAX_BACKOFF_MS=180_000` (3-minute cap on jittered backoff). Healthy refreshes still complete fast (per-page success on attempt 0); users hit by sustained 503 storms grind through patiently rather than abandoning the refresh.
- **`AdaptivePacer`** — inter-page delay 24 s baseline, scales up multiplicatively under retry pressure, with a 4560 s circuit-breaker cooldown after 3 consecutive retry-pages.
- **Per-batch cooldowns** in `audible-refresh.processor.ts` — 1530 s between popular/new-releases, 1020 s between categories.
## Region Configuration
@@ -101,8 +112,8 @@ Configurable Audible region for accurate metadata matching across international
- Automatic refresh: Region change triggers `audible_refresh` job.
**Per-region HTTP clients (on init):**
- `apiClient``baseURL=apiBaseUrl`, `Accept: application/json`, `User-Agent: ReadMeABook/1.0`, no language/ipRedirect params.
- `htmlClient``baseURL=baseUrl`, browser headers, default params `ipRedirectOverride=true` + `language=<audibleLocaleParam>`. Used only by `audible-series.ts` and `getBaseUrl()`-based link generation.
- `apiClient``baseURL=apiBaseUrl`, `Accept: application/json`, `User-Agent: ReadMeABook/1.0`, no language/ipRedirect params. Used for the real-time JSON catalog operations (search, author books, categories listing, per-ASIN details fallback).
- `htmlClient``baseURL=baseUrl`, rotating browser headers (`pickUserAgent` + `getBrowserHeaders`), default params `ipRedirectOverride=true` + `language=<audibleLocaleParam>`. Used by the nightly discovery refresh (`/adblbestsellers`, `/newreleases`, `/search?node=...`), by `audible-series.ts`, and by `getBaseUrl()`-based link generation.
- Audnexus calls include `region=<audnexusParam>`.
**Files:**
@@ -130,6 +141,44 @@ Single matching algorithm used everywhere (search, popular, new-releases, jobs).
**Note:** Fuzzy matching (70% threshold) is preserved in `ranking-algorithm.ts` for Prowlarr torrent ranking. Library availability checks require exact ASIN matches only.
## Dedup & Works Table
**Status:** ✅ Implemented | Two-pass dedup on every discovery view + cross-batch identity via works table
Discovery views (search, author books, series detail) collapse duplicate Audible listings for the same recording (publisher re-listings, regional re-issues, full-cast vs single-narrator productions) into a single card. Two passes run in sequence:
1. **Local pass — `deduplicateAndCollectGroups()`** (`src/lib/utils/deduplicate-audiobooks.ts`)
- Stateless, in-memory. Keys books by normalized title + sorted narrator set + duration (±max(5%, 10 min) tolerance), with subtitle compatibility to keep distinct series entries separate.
- Picks a canonical representative per group by `metadataScore()` (cover + rating + duration + description + narrator + release date + genres).
- Emits `DedupGroup[]` describing every multi-ASIN collapse → handed to `persistDedupGroups()` for the works table.
2. **Works pass — `collapseByExistingWorks()`** (`src/lib/services/works.service.ts`)
- Async DB lookup. Reads `work_asins` for every ASIN in the local-passed list and collapses any books sharing a `workId` to one representative (same `metadataScore()` ranking).
- Catches duplicates the local pass misses: source-metadata divergence (e.g. HTML scraper captured different narrators), cross-page splits (paginated series), or non-matching field shapes.
- Degrades gracefully — returns the input unchanged on DB failure (view still renders).
### Works Table Schema
- `Work { id, title, author }` — one row per logical book
- `WorkAsin { id, workId, asin, narrator?, durationMinutes?, isCanonical, source, createdAt }` — many ASINs per Work
### Population Layers
- **Layer 1 (auto):** `persistDedupGroups()` writes whenever the local pass finds a duplicate. Merges across pre-existing works when a new group spans them.
- **Layer 2 (seed):** `seedAsin()` writes a single-ASIN work at request creation time, ensuring every requested ASIN has an entry to grow from.
### Read Paths
- **`collapseByExistingWorks()`** — view-level collapse (this section).
- **`getSiblingAsins()`** — library availability matching (`audiobook-matcher.ts`), request-creation duplicate prevention (`request-creator.service.ts`), ignored-audiobook expansion. Returns sibling ASINs grouped by input ASIN.
### Narrator Capture in HTML Scrapers
- HTML scrapers (`audible-series.ts`, the two `parse*Items` parsers in `audible.service.ts`) capture **all** narrator anchors via `extractAllNarrators()` (`src/lib/utils/extract-narrator.ts`). Multi-narrator productions render each name as its own `<a href="?searchNarrator=...">` link; capturing only the first (prior bug) made co-narrated audiobooks fail to dedup. Order is not significant — `normalizeNarrator()` sorts before comparison.
### Wired Routes
- `src/app/api/audiobooks/search/route.ts`
- `src/app/api/authors/[asin]/books/route.ts`
- `src/app/api/series/[asin]/route.ts`
Watched-list background jobs (`watched-lists.service.ts`) run the local pass only — they don't render a view, and the downstream `request-creator.service.ts` already does sibling-aware dedup at request creation time.
## Database-First Approach
**Status:** Implemented
@@ -137,12 +186,12 @@ Single matching algorithm used everywhere (search, popular, new-releases, jobs).
Discovery APIs serve cached data from DB with real-time matching.
**Flow:**
1. `audible_refresh` cron runs daily → fetches 200 popular + 200 new releases + user-configured categories via catalog API.
1. `audible_refresh` cron runs daily → fetches 200 popular + 200 new releases + user-configured categories by scraping Audible's curated HTML storefronts (`/adblbestsellers`, `/newreleases`, `/search?node=<id>&sort=popularity-rank`).
2. Downloads and caches cover thumbnails locally.
3. Stores metadata in `audible_cache`, ranked entries in `audible_cache_categories` with reserved IDs (`__popular__`, `__new_releases__`) and user category IDs.
4. Cleans up unused thumbnails after sync.
5. API routes query `AudibleCacheCategory` by categoryId → join with `AudibleCache` metadata → apply real-time matching → return enriched results.
6. Homepage loads instantly (no Audible API hits).
6. Homepage loads instantly (no Audible HTTP hits at request time).
## Thumbnail Caching
@@ -228,12 +277,25 @@ interface AuthorBooksResult {
## Tech Stack
- `axios` (HTTP, two clients: `apiClient` for JSON catalog, `htmlClient` for series-page scraping only)
- `axios` (HTTP, two clients: `apiClient` for JSON catalog API, `htmlClient` for HTML refresh + series scraping)
- `cheerio` (HTML parsing for refresh job and `audible-series.ts`)
- Audnexus API (per-ASIN details, primary)
- PostgreSQL (`audible_cache`, `audible_cache_categories`)
## Fixed Issues
**Series-page duplicates not collapsing across user views (2026-05-14)**
- **Problem:** Two re-listings of the same audiobook (same title, same narrator set, same duration, different ASINs) showed as two cards on series detail pages, even after the works table had already linked them via search-page dedup.
- **Root cause (two-part):** (1) HTML scrapers used `$el.find('a[href*="searchNarrator="]').first()` for multi-narrator productions, capturing only the first co-narrator. So two listings of the same recording landed in `deduplicateAndCollectGroups` with mismatched single-narrator strings and never merged. (2) `deduplicateAndCollectGroups` was stateless — it wrote to the works table but never read it back, so even when one path (e.g. search) successfully merged two ASINs and persisted the Work, every other path (series, author books) re-derived the dedup decision from scratch and split them again.
- **Fix:** (1) New `extractAllNarrators()` helper (`src/lib/utils/extract-narrator.ts`) captures every `searchNarrator=` anchor and joins them; all three HTML scrapers route through it. (2) New `collapseByExistingWorks()` consults the works table after the local pass and collapses any remaining books sharing a `workId`. Wired into the three user-facing discovery routes (search / author books / series detail). Skipped for watched-list background jobs — those feed `request-creator.service.ts` which already does sibling-aware dedup.
- **Location:** `src/lib/utils/extract-narrator.ts` (new); `src/lib/integrations/audible-series.ts` (parseSeriesBooks); `src/lib/integrations/audible.service.ts` (parseProductListItems + parseSearchResultItems); `src/lib/utils/deduplicate-audiobooks.ts` (`metadataScore` exported); `src/lib/services/works.service.ts` (`collapseByExistingWorks` added); three API routes updated.
**Discovery refresh reverted to curated HTML scraping (2026-05-14)**
- **Problem:** After switching all catalog ops to the JSON catalog API in `f564d0a`, the nightly discovery refresh (Popular / New Releases / user-configured Categories) started serving junk: New Releases became 100% preorders out to 2027, and Popular was dominated by launch-day no-name shovelware.
- **Root cause:** `products_sort_by=BestSellers` is a right-now sales velocity rank that spikes on launch promos and preorder windows; `-ReleaseDate` returns all catalog items in date order with no released-only filter. The catalog API exposes no server-side filter to exclude preorders or sort by established popularity (verified by exhaustively testing `release_time`, `availability_status`, `customer_rights`, `Reviewed`/`MostListened`/`SalesRank` sorts — all silently ignored or rejected). Doing the curation client-side would have made RMAB the editorial curator, which Audible's storefront pages already do well.
- **Fix:** Hybrid architecture — the three refresh-only methods (`getPopularAudiobooks`, `getNewReleases`, `getCategoryBooks`) went back to scraping Audible's curated HTML storefronts (`/adblbestsellers`, `/newreleases`, `/search?node=<id>&sort=popularity-rank`). All user-facing real-time paths (search, author books, categories listing, per-ASIN details) stayed on the JSON catalog API. To keep the higher-503-risk HTML traffic resilient on the unattended nightly job, `fetchWithRetry()` accepts an optional `maxBackoffMs` cap and HTML callers use `HTML_MAX_RETRIES=12` + `HTML_MAX_BACKOFF_MS=180_000` (3-min cap). Healthy users finish quickly; 503-blocked users grind through patiently.
- **Location:** `src/lib/integrations/audible.service.ts` (three methods + two private parsers `parseProductListItems` / `parseSearchResultItems`); `src/lib/utils/scrape-resilience.ts` (`jitteredBackoff` cap parameter).
**Audiobookshelf metadata matching not respecting configured region (2026-01-28)**
- **Problem:** `triggerABSItemMatch()` hardcoded `'audible'` provider (audible.com) instead of respecting user's configured Audible region.
- **Impact:** Users with non-US regions (CA, UK, AU, IN) had incorrect metadata matching in Audiobookshelf, causing wrong ASINs.
+7 -4
View File
@@ -7,7 +7,7 @@ import { NextRequest, NextResponse } from 'next/server';
import { getAudibleService } from '@/lib/integrations/audible.service';
import { enrichAudiobooksWithMatches } from '@/lib/utils/audiobook-matcher';
import { deduplicateAndCollectGroups } from '@/lib/utils/deduplicate-audiobooks';
import { persistDedupGroups } from '@/lib/services/works.service';
import { persistDedupGroups, collapseByExistingWorks } from '@/lib/services/works.service';
import { getCurrentUser } from '@/lib/middleware/auth';
import { RMABLogger } from '@/lib/utils/logger';
import { annotateWithIgnoreStatus } from '@/lib/utils/ignored-audiobooks';
@@ -41,16 +41,19 @@ export async function GET(request: NextRequest) {
const currentUser = getCurrentUser(request);
const userId = currentUser?.sub || undefined;
// Deduplicate before enrichment to avoid wasted DB queries on duplicate entries
// Two-pass dedup: local title/narrator/duration matching first, then collapse
// any remaining duplicates that the works table already knows are the same book
// (handles cases where source metadata diverges across paths or pages).
const { books: dedupedResults, groups } = deduplicateAndCollectGroups(results.results);
// Fire-and-forget: persist dedup groups to works table for cross-ASIN matching
if (groups.length > 0) {
persistDedupGroups(groups).catch(() => {});
}
const collapsedResults = await collapseByExistingWorks(dedupedResults);
// Enrich search results with availability and request status information
const enrichedResults = await enrichAudiobooksWithMatches(dedupedResults, userId);
const enrichedResults = await enrichAudiobooksWithMatches(collapsedResults, userId);
// Annotate with per-user ignore status
const annotatedResults = await annotateWithIgnoreStatus(enrichedResults, userId);
+7 -4
View File
@@ -7,7 +7,7 @@ import { NextRequest, NextResponse } from 'next/server';
import { getAudibleService } from '@/lib/integrations/audible.service';
import { enrichAudiobooksWithMatches } from '@/lib/utils/audiobook-matcher';
import { deduplicateAndCollectGroups } from '@/lib/utils/deduplicate-audiobooks';
import { persistDedupGroups } from '@/lib/services/works.service';
import { persistDedupGroups, collapseByExistingWorks } from '@/lib/services/works.service';
import { getCurrentUser } from '@/lib/middleware/auth';
import { RMABLogger } from '@/lib/utils/logger';
import { annotateWithIgnoreStatus } from '@/lib/utils/ignored-audiobooks';
@@ -56,17 +56,20 @@ export async function GET(
const audibleService = getAudibleService();
const result = await audibleService.searchByAuthorAsin(authorName.trim(), asin, page);
// Deduplicate before enrichment to avoid wasted DB queries on duplicate entries
// Two-pass dedup: local title/narrator/duration matching first, then collapse
// any remaining duplicates that the works table already knows are the same book
// (handles cases where source metadata diverges across paths or pages).
const { books: dedupedBooks, groups } = deduplicateAndCollectGroups(result.books);
// Fire-and-forget: persist dedup groups to works table for cross-ASIN matching
if (groups.length > 0) {
persistDedupGroups(groups).catch(() => {});
}
const collapsedBooks = await collapseByExistingWorks(dedupedBooks);
// Enrich with library availability and request status
const userId = currentUser.sub || undefined;
const enrichedBooks = await enrichAudiobooksWithMatches(dedupedBooks, userId);
const enrichedBooks = await enrichAudiobooksWithMatches(collapsedBooks, userId);
// Annotate with per-user ignore status
const annotatedBooks = await annotateWithIgnoreStatus(enrichedBooks, userId);
+7 -4
View File
@@ -9,7 +9,7 @@ import { RMABLogger } from '@/lib/utils/logger';
import { scrapeSeriesPage } from '@/lib/integrations/audible-series';
import { enrichAudiobooksWithMatches } from '@/lib/utils/audiobook-matcher';
import { deduplicateAndCollectGroups } from '@/lib/utils/deduplicate-audiobooks';
import { persistDedupGroups } from '@/lib/services/works.service';
import { persistDedupGroups, collapseByExistingWorks } from '@/lib/services/works.service';
import { annotateWithIgnoreStatus } from '@/lib/utils/ignored-audiobooks';
const logger = RMABLogger.create('API.Series.Detail');
@@ -52,17 +52,20 @@ export async function GET(
);
}
// Deduplicate before enrichment to avoid wasted DB queries on duplicate entries
// Two-pass dedup: local title/narrator/duration matching first, then collapse
// any remaining duplicates that the works table already knows are the same book
// (handles cases where source metadata diverges across paths or pages).
const { books: dedupedBooks, groups } = deduplicateAndCollectGroups(detail.books);
// Fire-and-forget: persist dedup groups to works table for cross-ASIN matching
if (groups.length > 0) {
persistDedupGroups(groups).catch(() => {});
}
const collapsedBooks = await collapseByExistingWorks(dedupedBooks);
// Enrich books with library availability and request status
const userId = currentUser.sub || undefined;
const enrichedBooks = await enrichAudiobooksWithMatches(dedupedBooks, userId);
const enrichedBooks = await enrichAudiobooksWithMatches(collapsedBooks, userId);
// Annotate with per-user ignore status
const annotatedBooks = await annotateWithIgnoreStatus(enrichedBooks, userId);
+3 -4
View File
@@ -19,6 +19,7 @@ import {
import { RMABLogger } from '../utils/logger';
import { parseRuntime } from '../utils/parse-runtime';
import { randomDelay } from '../utils/scrape-resilience';
import { extractAllNarrators } from '../utils/extract-narrator';
const logger = RMABLogger.create('Audible.Series');
@@ -442,10 +443,8 @@ function parseSeriesBooks(
const authorHref = authorLink.attr('href') || '';
const authorAsinMatch = authorHref.match(/\/author\/[^/]+\/([A-Z0-9]{10})/);
// Narrator
const narratorText = $el.find('a[href*="searchNarrator="]').first().text().trim() ||
$el.find('.narratorLabel').text().trim() ||
'';
// Narrator — capture all narrator links (multi-narrator productions are common)
const narratorText = extractAllNarrators($, $el);
// Cover art
const coverArtUrl = $el.find('img').first().attr('src')?.replace(/\._.*_\./, '._SL500_.') || '';
+226 -81
View File
@@ -4,21 +4,26 @@
*/
import axios, { AxiosInstance } from 'axios';
import * as cheerio from 'cheerio';
import { RMABLogger } from '../utils/logger';
import { getConfigService } from '../services/config.service';
import { AudibleRegion, AUDIBLE_REGIONS, DEFAULT_AUDIBLE_REGION } from '../types/audible';
import {
getLanguageForRegion,
isAcceptedLanguage,
stripPrefixes,
buildContainsSelector,
type LanguageConfig,
} from '../constants/language-config';
import {
pickUserAgent,
getBrowserHeaders,
jitteredBackoff,
randomDelay,
AdaptivePacer,
FetchResultMeta,
} from '../utils/scrape-resilience';
import { parseRuntime as parseRuntimeUtil } from '../utils/parse-runtime';
import { extractAllNarrators } from '../utils/extract-narrator';
const logger = RMABLogger.create('Audible');
@@ -27,6 +32,13 @@ const AUDIBLE_PAGE_SIZE = 50;
const CATALOG_RESPONSE_GROUPS =
'contributors,product_desc,product_attrs,product_extended_attrs,media,rating,series,category_ladders,product_details';
// Retry/backoff knobs for HTML scraping (nightly refresh job only).
// Healthy users still finish quickly — per-page success returns on attempt 0
// with a 2-4s inter-page delay. Struggling users grind through 503 storms
// patiently: up to ~12 retries per request, with each backoff capped at 3 min.
const HTML_MAX_RETRIES = 12;
const HTML_MAX_BACKOFF_MS = 180_000;
export interface AudibleAudiobook {
asin: string;
title: string;
@@ -298,6 +310,7 @@ export class AudibleService {
config: any = {},
maxRetries: number = 5,
client: AxiosInstance = this.htmlClient,
maxBackoffMs: number = Number.POSITIVE_INFINITY,
): Promise<{ data: any; meta: FetchResultMeta }> {
let lastError: Error | null = null;
let retriesUsed = 0;
@@ -324,7 +337,7 @@ export class AudibleService {
retriesUsed++;
const backoffMs = jitteredBackoff(attempt);
const backoffMs = jitteredBackoff(attempt, 1000, maxBackoffMs);
logger.info(
` Request failed (${status || 'network error'}), retrying in ${backoffMs}ms (attempt ${attempt + 1}/${maxRetries})...`,
);
@@ -379,6 +392,12 @@ export class AudibleService {
throw lastError || new Error('External API request failed after retries');
}
/**
* Popular audiobooks from Audible's curated /adblbestsellers HTML page.
* Uses HTML scraping (not the catalog API) because the API's BestSellers sort
* is a right-now velocity rank that surfaces launch-day shovelware and preorders;
* the HTML page reflects Audible's editorial curation.
*/
async getPopularAudiobooks(limit: number = 20): Promise<AudibleAudiobook[]> {
await this.initialize();
@@ -395,42 +414,36 @@ export class AudibleService {
logger.info(` Fetching page ${page}/${maxPages}...`);
const { data: response, meta } = await this.fetchWithRetry(
'/1.0/catalog/products',
'/adblbestsellers',
{
params: {
products_sort_by: 'BestSellers',
num_results: AUDIBLE_PAGE_SIZE,
page: page - 1,
response_groups: CATALOG_RESPONSE_GROUPS,
ipRedirectOverride: 'true',
pageSize: AUDIBLE_PAGE_SIZE,
...(page > 1 ? { page } : {}),
},
},
5,
this.apiClient,
HTML_MAX_RETRIES,
this.htmlClient,
HTML_MAX_BACKOFF_MS,
);
const envelope: CatalogProductsResponse = response.data;
const products = envelope.products ?? [];
const totalResults = envelope.total_results ?? 0;
const foundOnPage = this.parseProductListItems(
response.data,
audiobooks,
limit,
);
for (const product of products) {
if (audiobooks.length >= limit) break;
if (audiobooks.some((b) => b.asin === product.asin)) continue;
audiobooks.push(mapCatalogProduct(product));
logger.info(` Found ${foundOnPage} audiobooks on page ${page}`);
if (foundOnPage < AUDIBLE_PAGE_SIZE / 2) {
logger.info(` Reached end of available pages`);
break;
}
logger.info(` Found ${products.length} audiobooks on page ${page}`);
const hasMore =
totalResults > 0
? totalResults > page * AUDIBLE_PAGE_SIZE
: products.length >= AUDIBLE_PAGE_SIZE;
if (!hasMore) break;
page++;
if (page <= maxPages && audiobooks.length < limit) {
await this.delay(this.apiPageDelay(meta));
await this.delay(this.pacer.reportPageResult(meta));
}
} catch (error) {
logger.error(`Failed to fetch page ${page} of popular audiobooks`, {
@@ -445,6 +458,11 @@ export class AudibleService {
return audiobooks;
}
/**
* New release audiobooks from Audible's curated /newreleases HTML page.
* Uses HTML scraping (not the catalog API) because the API's -ReleaseDate sort
* returns 100% future preorders with no released-only filter available.
*/
async getNewReleases(limit: number = 20): Promise<AudibleAudiobook[]> {
await this.initialize();
@@ -461,42 +479,36 @@ export class AudibleService {
logger.info(` Fetching page ${page}/${maxPages}...`);
const { data: response, meta } = await this.fetchWithRetry(
'/1.0/catalog/products',
'/newreleases',
{
params: {
products_sort_by: '-ReleaseDate',
num_results: AUDIBLE_PAGE_SIZE,
page: page - 1,
response_groups: CATALOG_RESPONSE_GROUPS,
ipRedirectOverride: 'true',
pageSize: AUDIBLE_PAGE_SIZE,
...(page > 1 ? { page } : {}),
},
},
5,
this.apiClient,
HTML_MAX_RETRIES,
this.htmlClient,
HTML_MAX_BACKOFF_MS,
);
const envelope: CatalogProductsResponse = response.data;
const products = envelope.products ?? [];
const totalResults = envelope.total_results ?? 0;
const foundOnPage = this.parseProductListItems(
response.data,
audiobooks,
limit,
);
for (const product of products) {
if (audiobooks.length >= limit) break;
if (audiobooks.some((b) => b.asin === product.asin)) continue;
audiobooks.push(mapCatalogProduct(product));
logger.info(` Found ${foundOnPage} audiobooks on page ${page}`);
if (foundOnPage < AUDIBLE_PAGE_SIZE / 2) {
logger.info(` Reached end of available pages`);
break;
}
logger.info(` Found ${products.length} audiobooks on page ${page}`);
const hasMore =
totalResults > 0
? totalResults > page * AUDIBLE_PAGE_SIZE
: products.length >= AUDIBLE_PAGE_SIZE;
if (!hasMore) break;
page++;
if (page <= maxPages && audiobooks.length < limit) {
await this.delay(this.apiPageDelay(meta));
await this.delay(this.pacer.reportPageResult(meta));
}
} catch (error) {
logger.error(`Failed to fetch page ${page} of new releases`, {
@@ -791,6 +803,11 @@ export class AudibleService {
}
}
/**
* Category audiobooks from Audible's HTML /search?node=<categoryId> page,
* sorted by popularity-rank. Uses HTML scraping (not the catalog API) so
* results match Audible's curated category-storefront ordering.
*/
async getCategoryBooks(categoryId: string, limit: number = 200): Promise<AudibleAudiobook[]> {
await this.initialize();
@@ -805,43 +822,35 @@ export class AudibleService {
while (audiobooks.length < limit && page <= maxPages) {
try {
const { data: response, meta } = await this.fetchWithRetry(
'/1.0/catalog/products',
'/search',
{
params: {
category_id: categoryId,
products_sort_by: 'BestSellers',
num_results: AUDIBLE_PAGE_SIZE,
page: page - 1,
response_groups: CATALOG_RESPONSE_GROUPS,
ipRedirectOverride: 'true',
node: categoryId,
pageSize: AUDIBLE_PAGE_SIZE,
sort: 'popularity-rank',
...(page > 1 ? { page } : {}),
},
},
5,
this.apiClient,
HTML_MAX_RETRIES,
this.htmlClient,
HTML_MAX_BACKOFF_MS,
);
const envelope: CatalogProductsResponse = response.data;
const products = envelope.products ?? [];
const totalResults = envelope.total_results ?? 0;
const foundOnPage = this.parseSearchResultItems(
response.data,
audiobooks,
limit,
);
for (const product of products) {
if (audiobooks.length >= limit) break;
if (audiobooks.some((b) => b.asin === product.asin)) continue;
audiobooks.push(mapCatalogProduct(product));
}
logger.info(`Category ${categoryId}: found ${foundOnPage} books on page ${page}`);
logger.info(`Category ${categoryId}: found ${products.length} books on page ${page}`);
const hasMore =
totalResults > 0
? totalResults > page * AUDIBLE_PAGE_SIZE
: products.length >= AUDIBLE_PAGE_SIZE;
if (!hasMore) break;
if (foundOnPage < AUDIBLE_PAGE_SIZE / 2) break;
page++;
if (page <= maxPages && audiobooks.length < limit) {
await this.delay(this.apiPageDelay(meta));
await this.delay(this.pacer.reportPageResult(meta));
}
} catch (error) {
logger.error(`Failed to fetch category ${categoryId} page ${page}`, {
@@ -858,12 +867,148 @@ export class AudibleService {
return audiobooks;
}
private apiPageDelay(meta: FetchResultMeta): number {
if (meta.retriesUsed > 0) {
return this.pacer.reportPageResult(meta);
}
this.pacer.reportPageResult(meta);
return randomDelay(500, 1500);
private getLangConfig(): LanguageConfig {
return getLanguageForRegion(this.region);
}
private parseRuntime(runtimeText: string): number | undefined {
return parseRuntimeUtil(runtimeText, this.getLangConfig());
}
/**
* Parse the `.productListItem` blocks used by /adblbestsellers and /newreleases.
* Pushes matched books into `audiobooks` (skipping duplicates and respecting `limit`)
* and returns the count parsed from this page.
*/
private parseProductListItems(
html: string,
audiobooks: AudibleAudiobook[],
limit: number,
): number {
const $ = cheerio.load(html);
const langConfig = this.getLangConfig();
let foundOnPage = 0;
$('.productListItem').each((_index, element) => {
if (audiobooks.length >= limit) return false;
const $el = $(element);
const asin =
$el.find('li').attr('data-asin') ||
$el.find('a').attr('href')?.match(/\/(?:pd|ac)\/[^\/]+\/([A-Z0-9]{10})/)?.[1] ||
'';
if (!asin) return;
if (audiobooks.some((book) => book.asin === asin)) return;
const title =
$el.find('h3 a').text().trim() ||
$el.find('.bc-heading a').text().trim();
const authorText =
$el.find('.authorLabel').text().trim() ||
$el.find('.bc-size-small .bc-text-bold').first().text().trim();
const authorHref = $el.find('a[href*="/author/"]').first().attr('href') || '';
const authorAsinMatch = authorHref.match(/\/author\/[^\/]+\/([A-Z0-9]{10})/);
// Narrator — capture all narrator links (multi-narrator productions are common);
// fall back to .narratorLabel text, then to the bc-text-bold sibling for layouts
// that omit both anchor links and the .narratorLabel span.
const narratorText =
extractAllNarrators($, $el) ||
$el.find('.bc-size-small .bc-text-bold').eq(1).text().trim();
const coverArtUrl = $el.find('img').attr('src') || '';
const ratingText = $el.find('.ratingsLabel').text().trim();
const rating = ratingText ? parseFloat(ratingText.split(' ')[0]) : undefined;
audiobooks.push({
asin,
title,
author: stripPrefixes(authorText, langConfig.scraping.authorPrefixes),
authorAsin: authorAsinMatch?.[1] || undefined,
narrator: stripPrefixes(narratorText, langConfig.scraping.narratorPrefixes),
coverArtUrl: coverArtUrl.replace(/\._.*_\./, '._SL500_.'),
rating,
});
foundOnPage++;
});
return foundOnPage;
}
/**
* Parse the `.s-result-item` / `.productListItem` blocks used by
* /search?node=<categoryId>. Pushes matched books into `audiobooks`
* (skipping duplicates and respecting `limit`) and returns the count parsed
* from this page.
*/
private parseSearchResultItems(
html: string,
audiobooks: AudibleAudiobook[],
limit: number,
): number {
const $ = cheerio.load(html);
const langConfig = this.getLangConfig();
let foundOnPage = 0;
$('.s-result-item, .productListItem').each((_index, element) => {
if (audiobooks.length >= limit) return false;
const $el = $(element);
const asin =
$el.find('li').attr('data-asin') ||
$el.find('a').attr('href')?.match(/\/(?:pd|ac)\/[^\/]+\/([A-Z0-9]{10})/)?.[1] ||
'';
if (!asin) return;
if (audiobooks.some((b) => b.asin === asin)) return;
const title =
$el.find('h2').first().text().trim() ||
$el.find('h3 a').text().trim() ||
$el.find('.bc-heading a').text().trim();
const authorLink = $el.find('a[href*="/author/"]').first();
const authorText =
authorLink.text().trim() ||
$el.find('.authorLabel').text().trim();
const authorHref = authorLink.attr('href') || '';
const authorAsinMatch = authorHref.match(/\/author\/[^\/]+\/([A-Z0-9]{10})/);
// Narrator — capture all narrator links (multi-narrator productions are common)
const narratorText = extractAllNarrators($, $el);
const coverArtUrl = $el.find('img').attr('src') || '';
const runtimeText =
$el.find('.runtimeLabel').text().trim() ||
$el.find(buildContainsSelector('span', langConfig.scraping.lengthLabels)).text().trim();
const durationMinutes = this.parseRuntime(runtimeText);
const ratingText =
$el.find('.ratingsLabel').text().trim() ||
$el.find('.a-icon-star span').first().text().trim();
const rating = ratingText ? parseFloat(ratingText.split(' ')[0]) : undefined;
audiobooks.push({
asin,
title,
author: stripPrefixes(authorText, langConfig.scraping.authorPrefixes),
authorAsin: authorAsinMatch?.[1] || undefined,
narrator: stripPrefixes(narratorText, langConfig.scraping.narratorPrefixes),
coverArtUrl: coverArtUrl.replace(/\._.*_\./, '._SL500_.'),
durationMinutes,
rating,
});
foundOnPage++;
});
return foundOnPage;
}
private async delay(ms: number): Promise<void> {
@@ -138,16 +138,37 @@ async function persistSectionBooks(
logger: ReturnType<typeof RMABLogger.forJob>,
labelForErrors: string,
): Promise<number> {
// Defensive dedup: the (asin, categoryId) unique constraint means a duplicate ASIN
// in `books` crashes the second .create() with P2002. The HTML parser already dedupes
// per page and across pages against the cumulative accumulator, but a warn-on-fire
// signal here lets us detect upstream surprises (e.g. Audible serving the same item
// in both a carousel and the main grid) without the noisy duplicate-key Postgres
// errors. Keep the first occurrence so Audible's editorial ordering is preserved.
const seenAsins = new Set<string>();
const dedupedBooks = books.filter((b) => {
if (!b?.asin || seenAsins.has(b.asin)) return false;
seenAsins.add(b.asin);
return true;
});
const droppedCount = books.length - dedupedBooks.length;
if (droppedCount > 0) {
logger.warn(
`Dropped ${droppedCount} duplicate ASIN(s) from ${categoryId} input list before persist`,
);
}
// Wipe previous entries for this section
logger.info(`Clearing previous data for ${categoryId}...`);
await prisma.audibleCacheCategory.deleteMany({
where: { categoryId },
});
logger.info(`Cleared previous entries for ${categoryId}, saving ${books.length} books...`);
logger.info(
`Cleared previous entries for ${categoryId}, saving ${dedupedBooks.length} books...`,
);
let saved = 0;
for (let i = 0; i < books.length; i++) {
const book = books[i];
for (let i = 0; i < dedupedBooks.length; i++) {
const book = dedupedBooks[i];
try {
// Cache thumbnail if coverArtUrl exists
let cachedCoverPath: string | null = null;
+92 -1
View File
@@ -9,7 +9,8 @@
import { prisma } from '@/lib/db';
import { RMABLogger } from '@/lib/utils/logger';
import type { DedupGroup } from '@/lib/utils/deduplicate-audiobooks';
import { metadataScore, type DedupGroup } from '@/lib/utils/deduplicate-audiobooks';
import type { AudibleAudiobook } from '@/lib/integrations/audible.service';
const logger = RMABLogger.create('WorksService');
@@ -182,6 +183,96 @@ export async function seedAsin(
}
}
// ---------------------------------------------------------------------------
// View-level collapse (consult the works table after local dedup)
// ---------------------------------------------------------------------------
/**
* Collapse books that already share a Work record according to the works table.
*
* The local `deduplicateAndCollectGroups()` pass is title/narrator/duration-based
* and stateless — it can fail to merge ASINs whose source metadata diverges (e.g.
* a series-page scrape captures different "first narrators" for two ASINs of the
* same recording, or two paginated pages each contain one ASIN and never compare
* them). The works table is the durable source of truth for "same book" identity,
* populated by every prior dedup pass and by request-time seeding. This pass
* applies that knowledge to the current view.
*
* Behavior:
* - Books whose ASINs map to a shared workId collapse to a single representative
* chosen by `metadataScore()` (same ranking as local dedup).
* - Books not present in any work, or in single-ASIN works, pass through untouched.
* - Original ordering is preserved (the kept representative sits at the position
* of the first occurrence of its work in the input list).
* - DB failure is non-fatal: the input list is returned unchanged so the view
* still renders (degrades to local-dedup-only behavior).
*/
export async function collapseByExistingWorks(
books: AudibleAudiobook[],
): Promise<AudibleAudiobook[]> {
if (books.length <= 1) return books;
try {
const asins = books.map(b => b.asin);
const entries = await prisma.workAsin.findMany({
where: { asin: { in: asins } },
select: { asin: true, workId: true },
});
if (entries.length === 0) return books;
// Map ASIN → workId for fast lookup in the loop below
const asinToWorkId = new Map<string, string>();
for (const entry of entries) {
asinToWorkId.set(entry.asin, entry.workId);
}
// Walk the input once, preserving position. For each work seen, keep a
// running "best" book; for books not in any work, emit immediately.
const result: AudibleAudiobook[] = [];
const workIdToResultIndex = new Map<string, number>();
for (const book of books) {
const workId = asinToWorkId.get(book.asin);
if (!workId) {
result.push(book);
continue;
}
const existingIndex = workIdToResultIndex.get(workId);
if (existingIndex === undefined) {
workIdToResultIndex.set(workId, result.length);
result.push(book);
continue;
}
// A sibling from this work is already in the result. Keep whichever
// has the richer metadata; on tie, keep the earlier entry (already there).
const existing = result[existingIndex];
if (metadataScore(book) > metadataScore(existing)) {
result[existingIndex] = book;
}
}
const collapsed = books.length - result.length;
if (collapsed > 0) {
logger.debug('Collapsed books via works table', {
inputCount: books.length,
outputCount: result.length,
collapsed,
});
}
return result;
} catch (error) {
logger.error('collapseByExistingWorks failed; returning input unchanged', {
error: error instanceof Error ? error.message : String(error),
bookCount: books.length,
});
return books;
}
}
// ---------------------------------------------------------------------------
// Sibling ASIN lookup (for library matching expansion)
// ---------------------------------------------------------------------------
+6 -1
View File
@@ -109,7 +109,12 @@ export function areDurationsCompatible(a?: number, b?: number): boolean {
// Metadata scoring (for picking best representative)
// ---------------------------------------------------------------------------
function metadataScore(book: AudibleAudiobook): number {
/**
* Score a book by how much metadata it carries. Used as the tie-breaker when
* collapsing duplicates — the entry with the richest metadata wins. Exported
* so the works-table collapse pass can apply the same ranking.
*/
export function metadataScore(book: AudibleAudiobook): number {
let score = 0;
if (book.coverArtUrl) score++;
if (book.rating != null) score++;
+37
View File
@@ -0,0 +1,37 @@
/**
* Component: Narrator Extraction Utility
* Documentation: documentation/integrations/audible.md
*
* Shared helper for Audible HTML scrapers. Audible product listings render
* each narrator as a separate `<a href="?searchNarrator=...">` link; using
* `.first()` on that selector silently drops co-narrators and breaks dedup
* for multi-narrator productions (e.g. full-cast audiobooks). This helper
* captures every narrator link and joins them, falling back to the
* `.narratorLabel` span when no anchor links are present.
*/
import type * as cheerio from 'cheerio';
import type { AnyNode } from 'domhandler';
/**
* Extract a comma-joined narrator string from an Audible product list item.
*
* Order is not semantically significant — downstream `normalizeNarrator()`
* sorts before comparison — but document-order preserves a stable, legible
* value for caching and logging.
*/
export function extractAllNarrators(
$: cheerio.CheerioAPI,
$el: cheerio.Cheerio<AnyNode>,
): string {
const links = $el.find('a[href*="searchNarrator="]');
if (links.length > 0) {
const names: string[] = [];
links.each((_, link) => {
const name = $(link).text().trim();
if (name) names.push(name);
});
if (names.length > 0) return names.join(', ');
}
return $el.find('.narratorLabel').text().trim();
}
+9 -3
View File
@@ -38,12 +38,18 @@ export function getBrowserHeaders(userAgent: string): Record<string, string> {
}
/**
* Jittered exponential backoff: 2^attempt * baseMs * random(0.5, 1.5)
* Jittered exponential backoff: 2^attempt * baseMs * random(0.5, 1.5),
* optionally capped so high attempt counts don't produce absurd waits.
* Avoids predictable retry timing that is trivially fingerprinted.
*/
export function jitteredBackoff(attempt: number, baseMs: number = 1000): number {
export function jitteredBackoff(
attempt: number,
baseMs: number = 1000,
maxBackoffMs: number = Number.POSITIVE_INFINITY,
): number {
const jitter = 0.5 + Math.random(); // 0.5 1.5
return Math.round(Math.pow(2, attempt) * baseMs * jitter);
const raw = Math.pow(2, attempt) * baseMs * jitter;
return Math.round(Math.min(raw, maxBackoffMs));
}
/** Random integer in [minMs, maxMs] */
+371 -88
View File
@@ -81,6 +81,122 @@ function apiResponse(envelope: object) {
return { data: envelope };
}
// ---------------------------------------------------------------------------
// HTML fixture helpers (for getPopularAudiobooks / getNewReleases / getCategoryBooks,
// which scrape Audible's curated HTML pages)
// ---------------------------------------------------------------------------
interface HtmlBookOverrides {
asin?: string;
title?: string;
author?: string;
authorAsin?: string;
/** Single-narrator shorthand; mutually exclusive with `narrators`. */
narrator?: string;
/** Multi-narrator productions render each name as its own searchNarrator anchor. */
narrators?: string[];
coverArtUrl?: string;
rating?: number;
}
/** Render one or more narrator anchor links suitable for embedding in .narratorLabel. */
function renderNarratorLinks(names: string[]): string {
return names
.map(
(name) =>
`<a href="/search?searchNarrator=${encodeURIComponent(name)}">${name}</a>`,
)
.join(', ');
}
/**
* Produces a single .productListItem block matching the selectors parsed by
* parseProductListItems(). The parser looks for an `<li data-asin>` descendant,
* with an `<a href="/pd/...">` fallback — using a real `<li>` here both
* exercises the primary path and keeps the markup well-formed.
*/
function makeProductListItemHtml(overrides: HtmlBookOverrides = {}): string {
const {
asin = 'B000000001',
title = 'Test Book',
author = 'Test Author',
authorAsin = 'A000000001',
narrator = 'Test Narrator',
narrators,
coverArtUrl = 'https://images.example.com/cover._SL500_.jpg',
rating = 4.5,
} = overrides;
// Real Audible storefront markup embeds each narrator as its own anchor inside
// .narratorLabel for multi-narrator productions. The single-narrator case keeps
// the original plain-text span for backward compatibility with existing tests.
const narratorMarkup = narrators && narrators.length > 0
? `<span class="narratorLabel">Narrated by: ${renderNarratorLinks(narrators)}</span>`
: `<span class="narratorLabel">${narrator}</span>`;
return `
<div class="productListItem">
<ul>
<li data-asin="${asin}">
<img src="${coverArtUrl}" />
<h3><a href="/pd/test/${asin}">${title}</a></h3>
<a class="authorLabel" href="/author/test/${authorAsin}">${author}</a>
${narratorMarkup}
<span class="ratingsLabel">${rating} out of 5</span>
</li>
</ul>
</div>
`;
}
/**
* Produces a single .s-result-item block matching the selectors parsed by
* parseSearchResultItems(). Used for /search?node=<categoryId> category pages.
*/
function makeSearchResultItemHtml(overrides: HtmlBookOverrides = {}): string {
const {
asin = 'B000000001',
title = 'Test Book',
author = 'Test Author',
authorAsin = 'A000000001',
narrator = 'Test Narrator',
narrators,
coverArtUrl = 'https://images.example.com/cover._SL500_.jpg',
rating = 4.5,
} = overrides;
const narratorLinks = narrators && narrators.length > 0
? renderNarratorLinks(narrators)
: `<a href="/search?searchNarrator=${encodeURIComponent(narrator)}">${narrator}</a>`;
return `
<div class="s-result-item">
<ul>
<li data-asin="${asin}">
<img src="${coverArtUrl}" />
<h2><a href="/pd/test/${asin}">${title}</a></h2>
<a href="/author/test/${authorAsin}">${author}</a>
${narratorLinks}
<span class="ratingsLabel">${rating} out of 5</span>
</li>
</ul>
</div>
`;
}
/** Wrap one or more item-HTML strings in a minimal page document. */
function makeHtmlPage(items: string[]): string {
return `<html><body>${items.join('')}</body></html>`;
}
/**
* Produces the value that client.get() should resolve to for HTML responses.
* cheerio.load() is called on response.data, so .data must be the raw HTML string.
*/
function htmlResponse(html: string) {
return { data: html };
}
// ---------------------------------------------------------------------------
// Test setup
// ---------------------------------------------------------------------------
@@ -683,61 +799,66 @@ describe('AudibleService', () => {
});
// -------------------------------------------------------------------------
// getPopularAudiobooks()
// getPopularAudiobooks() — HTML scraping of /adblbestsellers
// -------------------------------------------------------------------------
describe('getPopularAudiobooks()', () => {
it('uses products_sort_by: BestSellers', async () => {
apiClientMock.get.mockResolvedValue(apiResponse(makeProductsResponse([])));
it('hits /adblbestsellers on the htmlClient with pageSize=50', async () => {
htmlClientMock.get.mockResolvedValue(htmlResponse(makeHtmlPage([makeProductListItemHtml()])));
const service = new AudibleService();
await service.getPopularAudiobooks(1);
expect(apiClientMock.get.mock.calls[0][1].params.products_sort_by).toBe('BestSellers');
expect(htmlClientMock.get).toHaveBeenCalledWith(
'/adblbestsellers',
expect.objectContaining({
params: expect.objectContaining({ pageSize: 50 }),
}),
);
});
it('subtracts 1 from public page=1 before calling the API', async () => {
apiClientMock.get.mockResolvedValue(apiResponse(makeProductsResponse([])));
it('does not include a page param on the first request (only from page 2 onward)', async () => {
htmlClientMock.get.mockResolvedValue(htmlResponse(makeHtmlPage([makeProductListItemHtml()])));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
await service.getPopularAudiobooks(1);
expect(apiClientMock.get.mock.calls[0][1].params.page).toBe(0);
expect(htmlClientMock.get.mock.calls[0][1].params.page).toBeUndefined();
delaySpy.mockRestore();
});
it('makes a second call with page=1 when paginating to page 2', async () => {
const page1Products = Array.from({ length: 50 }, (_, i) =>
makeProduct({ asin: `B${String(i).padStart(9, '0')}`, title: `Book ${i}` }),
it('includes page=2 on the second request when paginating', async () => {
const page1Items = Array.from({ length: 50 }, (_, i) =>
makeProductListItemHtml({ asin: `B${String(i).padStart(9, '0')}`, title: `Book ${i}` }),
);
const page2Products = Array.from({ length: 25 }, (_, i) =>
makeProduct({ asin: `B${String(i + 50).padStart(9, '0')}`, title: `Book ${i + 50}` }),
const page2Items = Array.from({ length: 25 }, (_, i) =>
makeProductListItemHtml({ asin: `B${String(i + 50).padStart(9, '0')}`, title: `Book ${i + 50}` }),
);
apiClientMock.get
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page1Products, 75)))
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page2Products, 75)));
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
await service.getPopularAudiobooks(75);
expect(apiClientMock.get.mock.calls[1][1].params.page).toBe(1);
expect(htmlClientMock.get.mock.calls[1][1].params.page).toBe(2);
delaySpy.mockRestore();
});
it('paginates and returns up to the requested limit', async () => {
const page1Products = Array.from({ length: 50 }, (_, i) =>
makeProduct({ asin: `B${String(i).padStart(9, '0')}`, title: `Book ${i}` }),
it('paginates across pages and returns up to the requested limit', async () => {
const page1Items = Array.from({ length: 50 }, (_, i) =>
makeProductListItemHtml({ asin: `B${String(i).padStart(9, '0')}`, title: `Book ${i}` }),
);
const page2Products = Array.from({ length: 25 }, (_, i) =>
makeProduct({ asin: `B${String(i + 50).padStart(9, '0')}`, title: `Book ${i + 50}` }),
const page2Items = Array.from({ length: 25 }, (_, i) =>
makeProductListItemHtml({ asin: `B${String(i + 50).padStart(9, '0')}`, title: `Book ${i + 50}` }),
);
apiClientMock.get
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page1Products, 75)))
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page2Products, 75)));
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
@@ -747,176 +868,338 @@ describe('AudibleService', () => {
delaySpy.mockRestore();
});
it('stops early when a page returns fewer than the page size', async () => {
const products = [makeProduct()];
apiClientMock.get.mockResolvedValueOnce(apiResponse(makeProductsResponse(products, 1)));
it('stops early when a page returns fewer than half the page size', async () => {
htmlClientMock.get.mockResolvedValueOnce(
htmlResponse(makeHtmlPage([makeProductListItemHtml()])),
);
const service = new AudibleService();
const results = await service.getPopularAudiobooks(50);
expect(results).toHaveLength(1);
expect(apiClientMock.get).toHaveBeenCalledTimes(1);
expect(htmlClientMock.get).toHaveBeenCalledTimes(1);
});
it('deduplicates by ASIN across pages', async () => {
const sharedProduct = makeProduct({ asin: 'BDUP000001', title: 'Duplicated Book' });
const uniqueProduct = makeProduct({ asin: 'BUNIQ000001', title: 'Unique Book' });
const sharedAsin = 'BDUP000001';
const uniqueAsin = 'BUNIQ000001';
apiClientMock.get
.mockResolvedValueOnce(
apiResponse(makeProductsResponse([sharedProduct], 51)),
)
.mockResolvedValueOnce(
// page 2 returns the same ASIN plus a new one
apiResponse(makeProductsResponse([sharedProduct, uniqueProduct], 51)),
);
// Build a "full" first page (50 items, all with the shared ASIN duplicated as filler)
// so the parser proceeds to page 2.
const page1Items = [
makeProductListItemHtml({ asin: sharedAsin, title: 'Duplicated Book' }),
...Array.from({ length: 49 }, (_, i) =>
makeProductListItemHtml({ asin: `BFILL${String(i).padStart(5, '0')}`, title: `Filler ${i}` }),
),
];
const page2Items = [
makeProductListItemHtml({ asin: sharedAsin, title: 'Duplicated Book' }),
makeProductListItemHtml({ asin: uniqueAsin, title: 'Unique Book' }),
...Array.from({ length: 48 }, (_, i) =>
makeProductListItemHtml({ asin: `BFILL2${String(i).padStart(4, '0')}`, title: `Filler2 ${i}` }),
),
];
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
const results = await service.getPopularAudiobooks(100);
const results = await service.getPopularAudiobooks(150);
const asins = results.map((r) => r.asin);
expect(asins.filter((a) => a === 'BDUP000001')).toHaveLength(1);
expect(asins.filter((a) => a === sharedAsin)).toHaveLength(1);
expect(asins).toContain(uniqueAsin);
delaySpy.mockRestore();
});
it('returns empty array on error without throwing', async () => {
const error: Error & { response?: { status: number } } = new Error('Not Found');
error.response = { status: 404 };
apiClientMock.get.mockRejectedValue(error);
htmlClientMock.get.mockRejectedValue(error);
const service = new AudibleService();
const results = await service.getPopularAudiobooks(5);
expect(results).toEqual([]);
});
it('uses htmlClient (not apiClient) for the request', async () => {
htmlClientMock.get.mockResolvedValue(htmlResponse(makeHtmlPage([makeProductListItemHtml()])));
const service = new AudibleService();
await service.getPopularAudiobooks(1);
expect(htmlClientMock.get).toHaveBeenCalled();
expect(apiClientMock.get).not.toHaveBeenCalled();
});
it('maps title, author, narrator, and rating from the parsed item', async () => {
htmlClientMock.get.mockResolvedValue(
htmlResponse(
makeHtmlPage([
makeProductListItemHtml({
asin: 'B0HTMLMAP1',
title: 'Mapped Title',
author: 'Mapped Author',
authorAsin: 'A00MAPAUTH',
narrator: 'Mapped Narrator',
rating: 4.7,
}),
]),
),
);
const service = new AudibleService();
const [book] = await service.getPopularAudiobooks(1);
expect(book.asin).toBe('B0HTMLMAP1');
expect(book.title).toBe('Mapped Title');
expect(book.author).toBe('Mapped Author');
expect(book.authorAsin).toBe('A00MAPAUTH');
expect(book.narrator).toBe('Mapped Narrator');
expect(book.rating).toBeCloseTo(4.7);
});
it('captures every co-narrator on multi-narrator productions (regression: prior code took only the first link)', async () => {
htmlClientMock.get.mockResolvedValue(
htmlResponse(
makeHtmlPage([
makeProductListItemHtml({
asin: 'B0FULLCAST',
narrators: [
'Kristin Atherton',
'Roy McMillan',
'Clare Corbett',
'Tom Bateman',
'Patience Tomlinson',
'Shaheen Khan',
],
}),
]),
),
);
const service = new AudibleService();
const [book] = await service.getPopularAudiobooks(1);
// Every narrator must round-trip — order is not significant downstream,
// but document order should be preserved for stable cache values.
expect(book.narrator).toBe(
'Kristin Atherton, Roy McMillan, Clare Corbett, Tom Bateman, Patience Tomlinson, Shaheen Khan',
);
});
});
// -------------------------------------------------------------------------
// getNewReleases()
// getNewReleases() — HTML scraping of /newreleases
// -------------------------------------------------------------------------
describe('getNewReleases()', () => {
it('uses products_sort_by: -ReleaseDate', async () => {
apiClientMock.get.mockResolvedValue(apiResponse(makeProductsResponse([])));
it('hits /newreleases on the htmlClient with pageSize=50', async () => {
htmlClientMock.get.mockResolvedValue(htmlResponse(makeHtmlPage([makeProductListItemHtml()])));
const service = new AudibleService();
await service.getNewReleases(1);
expect(apiClientMock.get.mock.calls[0][1].params.products_sort_by).toBe('-ReleaseDate');
expect(htmlClientMock.get).toHaveBeenCalledWith(
'/newreleases',
expect.objectContaining({
params: expect.objectContaining({ pageSize: 50 }),
}),
);
});
it('subtracts 1 from public page=1 before calling the API', async () => {
apiClientMock.get.mockResolvedValue(apiResponse(makeProductsResponse([])));
it('does not include a page param on the first request', async () => {
htmlClientMock.get.mockResolvedValue(htmlResponse(makeHtmlPage([makeProductListItemHtml()])));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
await service.getNewReleases(1);
expect(apiClientMock.get.mock.calls[0][1].params.page).toBe(0);
expect(htmlClientMock.get.mock.calls[0][1].params.page).toBeUndefined();
delaySpy.mockRestore();
});
it('subtracts 1 from public page=2 when paginating to the second page', async () => {
const page1Products = Array.from({ length: 50 }, (_, i) =>
makeProduct({ asin: `B${String(i).padStart(9, '0')}` }),
it('includes page=2 on the second request when paginating', async () => {
const page1Items = Array.from({ length: 50 }, (_, i) =>
makeProductListItemHtml({ asin: `B${String(i).padStart(9, '0')}` }),
);
const page2Items = Array.from({ length: 50 }, (_, i) =>
makeProductListItemHtml({ asin: `B${String(i + 50).padStart(9, '0')}` }),
);
const page2Products = [makeProduct({ asin: 'BNEW000099' })];
apiClientMock.get
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page1Products, 51)))
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page2Products, 51)));
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
await service.getNewReleases(51);
expect(apiClientMock.get.mock.calls[1][1].params.page).toBe(1);
await service.getNewReleases(100);
expect(htmlClientMock.get.mock.calls[1][1].params.page).toBe(2);
delaySpy.mockRestore();
});
it('deduplicates by ASIN across pages', async () => {
const sharedProduct = makeProduct({ asin: 'BDUP000002' });
apiClientMock.get
.mockResolvedValueOnce(apiResponse(makeProductsResponse([sharedProduct], 51)))
.mockResolvedValueOnce(apiResponse(makeProductsResponse([sharedProduct], 51)));
const sharedAsin = 'BDUP000002';
const page1Items = [
makeProductListItemHtml({ asin: sharedAsin }),
...Array.from({ length: 49 }, (_, i) =>
makeProductListItemHtml({ asin: `BNEW${String(i).padStart(6, '0')}` }),
),
];
const page2Items = [
makeProductListItemHtml({ asin: sharedAsin }),
...Array.from({ length: 49 }, (_, i) =>
makeProductListItemHtml({ asin: `BNEW2${String(i).padStart(5, '0')}` }),
),
];
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
const results = await service.getNewReleases(100);
const results = await service.getNewReleases(150);
expect(results.filter((r) => r.asin === 'BDUP000002')).toHaveLength(1);
expect(results.filter((r) => r.asin === sharedAsin)).toHaveLength(1);
delaySpy.mockRestore();
});
it('returns empty array on error without throwing', async () => {
const error: Error & { response?: { status: number } } = new Error('Not Found');
error.response = { status: 404 };
apiClientMock.get.mockRejectedValue(error);
htmlClientMock.get.mockRejectedValue(error);
const service = new AudibleService();
const results = await service.getNewReleases(5);
expect(results).toEqual([]);
});
it('uses htmlClient (not apiClient) for the request', async () => {
htmlClientMock.get.mockResolvedValue(htmlResponse(makeHtmlPage([makeProductListItemHtml()])));
const service = new AudibleService();
await service.getNewReleases(1);
expect(htmlClientMock.get).toHaveBeenCalled();
expect(apiClientMock.get).not.toHaveBeenCalled();
});
});
// -------------------------------------------------------------------------
// getCategoryBooks()
// getCategoryBooks() — HTML scraping of /search?node=<categoryId>
// -------------------------------------------------------------------------
describe('getCategoryBooks()', () => {
it('sends category_id and BestSellers sort param', async () => {
apiClientMock.get.mockResolvedValue(apiResponse(makeProductsResponse([])));
it('hits /search on the htmlClient with node, pageSize, and popularity-rank sort', async () => {
htmlClientMock.get.mockResolvedValue(
htmlResponse(makeHtmlPage([makeSearchResultItemHtml()])),
);
const service = new AudibleService();
await service.getCategoryBooks('18685580011', 1);
const params = apiClientMock.get.mock.calls[0][1].params;
expect(params.category_id).toBe('18685580011');
expect(params.products_sort_by).toBe('BestSellers');
const params = htmlClientMock.get.mock.calls[0][1].params;
expect(htmlClientMock.get.mock.calls[0][0]).toBe('/search');
expect(params.node).toBe('18685580011');
expect(params.pageSize).toBe(50);
expect(params.sort).toBe('popularity-rank');
});
it('subtracts 1 from public page=1 before calling the API', async () => {
apiClientMock.get.mockResolvedValue(apiResponse(makeProductsResponse([])));
it('does not include a page param on the first request', async () => {
htmlClientMock.get.mockResolvedValue(
htmlResponse(makeHtmlPage([makeSearchResultItemHtml()])),
);
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
await service.getCategoryBooks('CAT001', 1);
expect(apiClientMock.get.mock.calls[0][1].params.page).toBe(0);
expect(htmlClientMock.get.mock.calls[0][1].params.page).toBeUndefined();
delaySpy.mockRestore();
});
it('subtracts 1 from public page=2 when paginating to the second page', async () => {
const page1Products = Array.from({ length: 50 }, (_, i) =>
makeProduct({ asin: `B${String(i).padStart(9, '0')}` }),
it('includes page=2 on the second request when paginating', async () => {
const page1Items = Array.from({ length: 50 }, (_, i) =>
makeSearchResultItemHtml({ asin: `B${String(i).padStart(9, '0')}` }),
);
const page2Items = Array.from({ length: 50 }, (_, i) =>
makeSearchResultItemHtml({ asin: `B${String(i + 50).padStart(9, '0')}` }),
);
const page2Products = [makeProduct({ asin: 'BCAT000099' })];
apiClientMock.get
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page1Products, 51)))
.mockResolvedValueOnce(apiResponse(makeProductsResponse(page2Products, 51)));
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
await service.getCategoryBooks('CAT001', 51);
expect(apiClientMock.get.mock.calls[1][1].params.page).toBe(1);
await service.getCategoryBooks('CAT001', 100);
expect(htmlClientMock.get.mock.calls[1][1].params.page).toBe(2);
delaySpy.mockRestore();
});
it('deduplicates by ASIN across pages', async () => {
const sharedProduct = makeProduct({ asin: 'BDUP000003' });
apiClientMock.get
.mockResolvedValueOnce(apiResponse(makeProductsResponse([sharedProduct], 51)))
.mockResolvedValueOnce(apiResponse(makeProductsResponse([sharedProduct], 51)));
const sharedAsin = 'BDUP000003';
const page1Items = [
makeSearchResultItemHtml({ asin: sharedAsin }),
...Array.from({ length: 49 }, (_, i) =>
makeSearchResultItemHtml({ asin: `BCAT${String(i).padStart(6, '0')}` }),
),
];
const page2Items = [
makeSearchResultItemHtml({ asin: sharedAsin }),
...Array.from({ length: 49 }, (_, i) =>
makeSearchResultItemHtml({ asin: `BCAT2${String(i).padStart(5, '0')}` }),
),
];
htmlClientMock.get
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page1Items)))
.mockResolvedValueOnce(htmlResponse(makeHtmlPage(page2Items)));
const service = new AudibleService();
const delaySpy = vi.spyOn(service as any, 'delay').mockResolvedValue(undefined);
const results = await service.getCategoryBooks('CAT001', 100);
const results = await service.getCategoryBooks('CAT001', 150);
expect(results.filter((r) => r.asin === 'BDUP000003')).toHaveLength(1);
expect(results.filter((r) => r.asin === sharedAsin)).toHaveLength(1);
delaySpy.mockRestore();
});
it('uses htmlClient (not apiClient) for the request', async () => {
htmlClientMock.get.mockResolvedValue(
htmlResponse(makeHtmlPage([makeSearchResultItemHtml()])),
);
const service = new AudibleService();
await service.getCategoryBooks('CAT001', 1);
expect(htmlClientMock.get).toHaveBeenCalled();
expect(apiClientMock.get).not.toHaveBeenCalled();
});
it('captures every co-narrator on multi-narrator productions (regression: prior code took only the first link)', async () => {
htmlClientMock.get.mockResolvedValue(
htmlResponse(
makeHtmlPage([
makeSearchResultItemHtml({
asin: 'B0FULLCAST',
narrators: ['Alice', 'Bob', 'Carol', 'Dan'],
}),
]),
),
);
const service = new AudibleService();
const [book] = await service.getCategoryBooks('CAT001', 1);
expect(book.narrator).toBe('Alice, Bob, Carol, Dan');
});
});
// -------------------------------------------------------------------------
@@ -198,4 +198,69 @@ describe('processAudibleRefresh', () => {
const { processAudibleRefresh } = await import('@/lib/processors/audible-refresh.processor');
await expect(processAudibleRefresh({ jobId: 'job-2' })).rejects.toThrow('DB down');
});
it('deduplicates ASINs in the input list before persisting, preserving order', async () => {
// Two `A` entries should collapse to one. Final ranks must be contiguous
// (1, 2, 3) and follow Audible's editorial ordering (A, B, C).
const popular = [
{ asin: 'A', title: 'Book A', author: 'X', coverArtUrl: null },
{ asin: 'B', title: 'Book B', author: 'X', coverArtUrl: null },
{ asin: 'A', title: 'Book A (duplicate)', author: 'X', coverArtUrl: null },
{ asin: 'C', title: 'Book C', author: 'X', coverArtUrl: null },
];
audibleServiceMock.getPopularAudiobooks.mockResolvedValue(popular);
audibleServiceMock.getNewReleases.mockResolvedValue([]);
thumbnailCacheMock.cleanupUnusedThumbnails.mockResolvedValue(0);
prismaMock.audibleCache.upsert.mockResolvedValue({});
prismaMock.audibleCacheCategory.deleteMany.mockResolvedValue({ count: 0 });
prismaMock.audibleCacheCategory.create.mockResolvedValue({});
prismaMock.userHomeSection.findMany.mockResolvedValue([]);
prismaMock.audibleCache.findMany.mockResolvedValue([]);
const { processAudibleRefresh } = await import('@/lib/processors/audible-refresh.processor');
const result = await processAudibleRefresh({ jobId: 'job-dedup' });
expect(result.popularSaved).toBe(3);
// Only 3 category entries created — the duplicate `A` was dropped.
const popularCreates = (prismaMock.audibleCacheCategory.create.mock.calls as Array<[{ data: { asin: string; categoryId: string; rank: number } }]>)
.map((c) => c[0].data)
.filter((d) => d.categoryId === '__popular__');
expect(popularCreates).toHaveLength(3);
expect(popularCreates.map((d) => d.asin)).toEqual(['A', 'B', 'C']);
expect(popularCreates.map((d) => d.rank)).toEqual([1, 2, 3]);
// upsert called once per unique ASIN, not per input row.
expect(prismaMock.audibleCache.upsert).toHaveBeenCalledTimes(3);
});
it('drops entries with missing ASINs as part of dedup', async () => {
const popular = [
{ asin: 'A', title: 'Book A', author: 'X', coverArtUrl: null },
{ asin: '', title: 'Book with empty asin', author: 'X', coverArtUrl: null },
{ asin: null, title: 'Book with null asin', author: 'X', coverArtUrl: null },
{ asin: 'B', title: 'Book B', author: 'X', coverArtUrl: null },
];
audibleServiceMock.getPopularAudiobooks.mockResolvedValue(popular as any);
audibleServiceMock.getNewReleases.mockResolvedValue([]);
thumbnailCacheMock.cleanupUnusedThumbnails.mockResolvedValue(0);
prismaMock.audibleCache.upsert.mockResolvedValue({});
prismaMock.audibleCacheCategory.deleteMany.mockResolvedValue({ count: 0 });
prismaMock.audibleCacheCategory.create.mockResolvedValue({});
prismaMock.userHomeSection.findMany.mockResolvedValue([]);
prismaMock.audibleCache.findMany.mockResolvedValue([]);
const { processAudibleRefresh } = await import('@/lib/processors/audible-refresh.processor');
const result = await processAudibleRefresh({ jobId: 'job-empty-asin' });
expect(result.popularSaved).toBe(2);
const popularCreates = (prismaMock.audibleCacheCategory.create.mock.calls as Array<[{ data: { asin: string; categoryId: string; rank: number } }]>)
.map((c) => c[0].data)
.filter((d) => d.categoryId === '__popular__');
expect(popularCreates.map((d) => d.asin)).toEqual(['A', 'B']);
expect(popularCreates.map((d) => d.rank)).toEqual([1, 2]);
});
});
+189
View File
@@ -6,6 +6,15 @@
import { beforeEach, describe, expect, it, vi } from 'vitest';
import { createPrismaMock } from '../helpers/prisma';
import type { DedupGroup } from '@/lib/utils/deduplicate-audiobooks';
import type { AudibleAudiobook } from '@/lib/integrations/audible.service';
function makeBook(overrides: Partial<AudibleAudiobook> & { asin: string }): AudibleAudiobook {
return {
title: 'Test Book',
author: 'Test Author',
...overrides,
};
}
const prismaMock = createPrismaMock();
@@ -304,3 +313,183 @@ describe('getSiblingAsins', () => {
expect(result.has('ASIN_LONELY')).toBe(false);
});
});
describe('collapseByExistingWorks', () => {
beforeEach(() => {
vi.clearAllMocks();
vi.resetModules();
});
it('returns input unchanged when the list is empty or has one entry', async () => {
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
expect(await collapseByExistingWorks([])).toEqual([]);
expect(prismaMock.workAsin.findMany).not.toHaveBeenCalled();
const single = [makeBook({ asin: 'A1' })];
expect(await collapseByExistingWorks(single)).toEqual(single);
expect(prismaMock.workAsin.findMany).not.toHaveBeenCalled();
});
it('returns input unchanged when none of the ASINs are in any work', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
const books = [
makeBook({ asin: 'A1', title: 'Alpha' }),
makeBook({ asin: 'A2', title: 'Beta' }),
];
const result = await collapseByExistingWorks(books);
expect(result).toEqual(books);
});
it('collapses two ASINs that share a work to a single representative', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([
{ asin: 'A1', workId: 'work-1' },
{ asin: 'A2', workId: 'work-1' },
]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
const books = [
makeBook({ asin: 'A1', title: 'The Passengers', coverArtUrl: 'cover.jpg' }),
makeBook({ asin: 'A2', title: 'The Passengers' }),
];
const result = await collapseByExistingWorks(books);
expect(result).toHaveLength(1);
// A1 wins — it has the cover URL (higher metadata score)
expect(result[0].asin).toBe('A1');
});
it('keeps the richest-metadata entry when collapsing, regardless of input order', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([
{ asin: 'A1', workId: 'work-1' },
{ asin: 'A2', workId: 'work-1' },
]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
// A1 first (sparse), A2 second (rich) — A2 should win on score
const books = [
makeBook({ asin: 'A1', title: 'Book' }),
makeBook({
asin: 'A2',
title: 'Book',
coverArtUrl: 'cover.jpg',
rating: 4.5,
durationMinutes: 600,
narrator: 'Full Cast',
description: 'Rich book',
releaseDate: '2024-01-01',
genres: ['Fiction'],
}),
];
const result = await collapseByExistingWorks(books);
expect(result).toHaveLength(1);
expect(result[0].asin).toBe('A2');
});
it('preserves position of the work in the input order', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([
{ asin: 'A2', workId: 'work-1' },
{ asin: 'A4', workId: 'work-1' },
]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
const books = [
makeBook({ asin: 'A1', title: 'Alpha' }),
makeBook({ asin: 'A2', title: 'Beta' }),
makeBook({ asin: 'A3', title: 'Gamma' }),
makeBook({ asin: 'A4', title: 'Beta' }),
makeBook({ asin: 'A5', title: 'Delta' }),
];
const result = await collapseByExistingWorks(books);
// A2 and A4 collapse to one entry at position 1 (the first occurrence)
expect(result.map(b => b.asin)).toEqual(['A1', 'A2', 'A3', 'A5']);
});
it('handles multiple independent works in the same batch', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([
{ asin: 'A1', workId: 'work-1' },
{ asin: 'A2', workId: 'work-1' },
{ asin: 'B1', workId: 'work-2' },
{ asin: 'B2', workId: 'work-2' },
{ asin: 'B3', workId: 'work-2' },
]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
const books = [
makeBook({ asin: 'A1' }),
makeBook({ asin: 'B1' }),
makeBook({ asin: 'A2' }),
makeBook({ asin: 'B2' }),
makeBook({ asin: 'B3' }),
makeBook({ asin: 'C1' }),
];
const result = await collapseByExistingWorks(books);
expect(result.map(b => b.asin)).toEqual(['A1', 'B1', 'C1']);
});
it('passes through books that are not in any work alongside collapsed ones', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([
{ asin: 'A1', workId: 'work-1' },
{ asin: 'A2', workId: 'work-1' },
]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
const books = [
makeBook({ asin: 'STANDALONE_1', title: 'Standalone 1' }),
makeBook({ asin: 'A1', title: 'Same Book' }),
makeBook({ asin: 'STANDALONE_2', title: 'Standalone 2' }),
makeBook({ asin: 'A2', title: 'Same Book' }),
];
const result = await collapseByExistingWorks(books);
expect(result).toHaveLength(3);
expect(result.map(b => b.asin)).toEqual(['STANDALONE_1', 'A1', 'STANDALONE_2']);
});
it('returns input unchanged on DB failure (does not throw)', async () => {
prismaMock.workAsin.findMany.mockRejectedValue(new Error('DB exploded'));
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
const books = [
makeBook({ asin: 'A1' }),
makeBook({ asin: 'A2' }),
];
const result = await collapseByExistingWorks(books);
expect(result).toEqual(books);
});
it('only queries the workAsin table once per call', async () => {
prismaMock.workAsin.findMany.mockResolvedValue([
{ asin: 'A1', workId: 'work-1' },
{ asin: 'A2', workId: 'work-1' },
]);
const { collapseByExistingWorks } = await import('@/lib/services/works.service');
await collapseByExistingWorks([
makeBook({ asin: 'A1' }),
makeBook({ asin: 'A2' }),
makeBook({ asin: 'A3' }),
]);
expect(prismaMock.workAsin.findMany).toHaveBeenCalledTimes(1);
expect(prismaMock.workAsin.findMany).toHaveBeenCalledWith({
where: { asin: { in: ['A1', 'A2', 'A3'] } },
select: { asin: true, workId: true },
});
});
});
+95
View File
@@ -0,0 +1,95 @@
/**
* Component: Narrator Extraction Utility Tests
* Documentation: documentation/integrations/audible.md
*/
import { describe, expect, it } from 'vitest';
import * as cheerio from 'cheerio';
import { extractAllNarrators } from '@/lib/utils/extract-narrator';
function load(html: string) {
const $ = cheerio.load(`<div id="item">${html}</div>`);
return { $, $el: $('#item') };
}
describe('extractAllNarrators', () => {
it('returns the single narrator name when only one searchNarrator link is present', () => {
const { $, $el } = load(
`<a href="/search?searchNarrator=Andy%20Serkis">Andy Serkis</a>`,
);
expect(extractAllNarrators($, $el)).toBe('Andy Serkis');
});
it('joins multiple narrator names from separate searchNarrator links', () => {
const { $, $el } = load(`
<a href="/search?searchNarrator=Kristin%20Atherton">Kristin Atherton</a>,
<a href="/search?searchNarrator=Roy%20McMillan">Roy McMillan</a>,
<a href="/search?searchNarrator=Clare%20Corbett">Clare Corbett</a>,
<a href="/search?searchNarrator=Tom%20Bateman">Tom Bateman</a>,
<a href="/search?searchNarrator=Patience%20Tomlinson">Patience Tomlinson</a>,
<a href="/search?searchNarrator=Shaheen%20Khan">Shaheen Khan</a>
`);
expect(extractAllNarrators($, $el)).toBe(
'Kristin Atherton, Roy McMillan, Clare Corbett, Tom Bateman, Patience Tomlinson, Shaheen Khan',
);
});
it('preserves document order (downstream sorts before comparing, but order should be stable)', () => {
const { $, $el } = load(`
<a href="/search?searchNarrator=Z">Zelda</a>
<a href="/search?searchNarrator=A">Alice</a>
<a href="/search?searchNarrator=M">Mallory</a>
`);
expect(extractAllNarrators($, $el)).toBe('Zelda, Alice, Mallory');
});
it('falls back to .narratorLabel text when no searchNarrator links exist', () => {
const { $, $el } = load(
`<span class="narratorLabel">Narrated by: Single Narrator</span>`,
);
expect(extractAllNarrators($, $el)).toBe('Narrated by: Single Narrator');
});
it('prefers searchNarrator links over .narratorLabel when both are present', () => {
const { $, $el } = load(`
<span class="narratorLabel">Narrated by: ONLY ONE</span>
<a href="/search?searchNarrator=First">First</a>
<a href="/search?searchNarrator=Second">Second</a>
`);
expect(extractAllNarrators($, $el)).toBe('First, Second');
});
it('returns empty string when neither links nor .narratorLabel exist', () => {
const { $, $el } = load(`<span>some other content</span>`);
expect(extractAllNarrators($, $el)).toBe('');
});
it('skips empty link text and joins only non-empty names', () => {
const { $, $el } = load(`
<a href="/search?searchNarrator=A"></a>
<a href="/search?searchNarrator=B">Bob</a>
<a href="/search?searchNarrator=C"> </a>
<a href="/search?searchNarrator=D">Diana</a>
`);
expect(extractAllNarrators($, $el)).toBe('Bob, Diana');
});
it('trims whitespace from each captured name', () => {
const { $, $el } = load(`
<a href="/search?searchNarrator=A"> Alice </a>
<a href="/search?searchNarrator=B">
Bob
</a>
`);
expect(extractAllNarrators($, $el)).toBe('Alice, Bob');
});
it('falls back to .narratorLabel when all searchNarrator links are empty', () => {
const { $, $el } = load(`
<a href="/search?searchNarrator=A"></a>
<a href="/search?searchNarrator=B"> </a>
<span class="narratorLabel">Fallback Narrator</span>
`);
expect(extractAllNarrators($, $el)).toBe('Fallback Narrator');
});
});
+18
View File
@@ -67,6 +67,24 @@ describe('jitteredBackoff', () => {
expect(value).toBeGreaterThanOrEqual(250);
expect(value).toBeLessThanOrEqual(750);
});
it('caps the result at maxBackoffMs when the raw backoff would exceed it', () => {
// attempt=10 with base=1000 produces 2^10 * 1000 * [0.5..1.5] = 512_000..1_536_000,
// all of which exceed a 60_000ms cap.
for (let i = 0; i < 50; i++) {
const value = jitteredBackoff(10, 1000, 60_000);
expect(value).toBeLessThanOrEqual(60_000);
}
});
it('returns the un-capped jittered value when below the cap', () => {
// attempt=0 with base=1000 produces 500..1500, all below a 60_000ms cap.
for (let i = 0; i < 50; i++) {
const value = jitteredBackoff(0, 1000, 60_000);
expect(value).toBeGreaterThanOrEqual(500);
expect(value).toBeLessThanOrEqual(1500);
}
});
});
describe('randomDelay', () => {