mirror of
https://github.com/kikootwo/ReadMeABook.git
synced 2026-06-20 21:20:10 +00:00
Audible: HTML refresh, multi-narrator & works dedup
Switch nightly discovery refresh to scrape Audible's curated HTML storefronts (popular, new releases, category pages) while keeping real-time user paths on the JSON catalog API. Add robust HTML resilience knobs (increased retries, capped jittered backoff, AdaptivePacer changes and per-batch cooldowns) to avoid failing nightly jobs during 503 storms. Implement multi-narrator capture via a new extractAllNarrators helper and update parsers to preserve all narrator anchors. Introduce two-pass dedup: in-memory deduplicateAndCollectGroups + collapseByExistingWorks that consults the works table, export metadataScore for consistent representative selection, and persist dedup groups (fire-and-forget). Wire collapseByExistingWorks into search/author/series routes and make defensive dedup in the refresh processor. Add HTML parsing helpers, runtime/lang-aware parsing, jitteredBackoff cap, and tests for the new behaviors.
This commit is contained in:
@@ -19,6 +19,7 @@ import {
|
||||
import { RMABLogger } from '../utils/logger';
|
||||
import { parseRuntime } from '../utils/parse-runtime';
|
||||
import { randomDelay } from '../utils/scrape-resilience';
|
||||
import { extractAllNarrators } from '../utils/extract-narrator';
|
||||
|
||||
const logger = RMABLogger.create('Audible.Series');
|
||||
|
||||
@@ -442,10 +443,8 @@ function parseSeriesBooks(
|
||||
const authorHref = authorLink.attr('href') || '';
|
||||
const authorAsinMatch = authorHref.match(/\/author\/[^/]+\/([A-Z0-9]{10})/);
|
||||
|
||||
// Narrator
|
||||
const narratorText = $el.find('a[href*="searchNarrator="]').first().text().trim() ||
|
||||
$el.find('.narratorLabel').text().trim() ||
|
||||
'';
|
||||
// Narrator — capture all narrator links (multi-narrator productions are common)
|
||||
const narratorText = extractAllNarrators($, $el);
|
||||
|
||||
// Cover art
|
||||
const coverArtUrl = $el.find('img').first().attr('src')?.replace(/\._.*_\./, '._SL500_.') || '';
|
||||
|
||||
Reference in New Issue
Block a user