mirror of
https://github.com/kikootwo/ReadMeABook.git
synced 2026-06-03 04:40:09 +00:00
Audible: HTML refresh, multi-narrator & works dedup
Switch nightly discovery refresh to scrape Audible's curated HTML storefronts (popular, new releases, category pages) while keeping real-time user paths on the JSON catalog API. Add robust HTML resilience knobs (increased retries, capped jittered backoff, AdaptivePacer changes and per-batch cooldowns) to avoid failing nightly jobs during 503 storms. Implement multi-narrator capture via a new extractAllNarrators helper and update parsers to preserve all narrator anchors. Introduce two-pass dedup: in-memory deduplicateAndCollectGroups + collapseByExistingWorks that consults the works table, export metadataScore for consistent representative selection, and persist dedup groups (fire-and-forget). Wire collapseByExistingWorks into search/author/series routes and make defensive dedup in the refresh processor. Add HTML parsing helpers, runtime/lang-aware parsing, jitteredBackoff cap, and tests for the new behaviors.
This commit is contained in:
@@ -138,16 +138,37 @@ async function persistSectionBooks(
|
||||
logger: ReturnType<typeof RMABLogger.forJob>,
|
||||
labelForErrors: string,
|
||||
): Promise<number> {
|
||||
// Defensive dedup: the (asin, categoryId) unique constraint means a duplicate ASIN
|
||||
// in `books` crashes the second .create() with P2002. The HTML parser already dedupes
|
||||
// per page and across pages against the cumulative accumulator, but a warn-on-fire
|
||||
// signal here lets us detect upstream surprises (e.g. Audible serving the same item
|
||||
// in both a carousel and the main grid) without the noisy duplicate-key Postgres
|
||||
// errors. Keep the first occurrence so Audible's editorial ordering is preserved.
|
||||
const seenAsins = new Set<string>();
|
||||
const dedupedBooks = books.filter((b) => {
|
||||
if (!b?.asin || seenAsins.has(b.asin)) return false;
|
||||
seenAsins.add(b.asin);
|
||||
return true;
|
||||
});
|
||||
const droppedCount = books.length - dedupedBooks.length;
|
||||
if (droppedCount > 0) {
|
||||
logger.warn(
|
||||
`Dropped ${droppedCount} duplicate ASIN(s) from ${categoryId} input list before persist`,
|
||||
);
|
||||
}
|
||||
|
||||
// Wipe previous entries for this section
|
||||
logger.info(`Clearing previous data for ${categoryId}...`);
|
||||
await prisma.audibleCacheCategory.deleteMany({
|
||||
where: { categoryId },
|
||||
});
|
||||
logger.info(`Cleared previous entries for ${categoryId}, saving ${books.length} books...`);
|
||||
logger.info(
|
||||
`Cleared previous entries for ${categoryId}, saving ${dedupedBooks.length} books...`,
|
||||
);
|
||||
|
||||
let saved = 0;
|
||||
for (let i = 0; i < books.length; i++) {
|
||||
const book = books[i];
|
||||
for (let i = 0; i < dedupedBooks.length; i++) {
|
||||
const book = dedupedBooks[i];
|
||||
try {
|
||||
// Cache thumbnail if coverArtUrl exists
|
||||
let cachedCoverPath: string | null = null;
|
||||
|
||||
Reference in New Issue
Block a user