Files
ReadMeABook/documentation/fixes/file-hash-matching.md
T
kikootwo a97979358f Implement file hash-based library matching and remove fuzzy ASIN matching
Adds file hash-based matching for Audiobookshelf library items to ensure 100% accurate ASIN assignment for RMAB-organized content. Removes fuzzy matching from library availability checks, making all matching ASIN-only to eliminate false positives and race conditions. Updates database schema, processors, and matcher utilities; adds new tests and documentation for the new matching strategy. Removes obsolete scripts, Dockerfile, and related tests; updates docker-compose for test environments.
2026-01-28 11:42:00 -05:00

6.9 KiB

File Hash-Based Library Matching

Status: Implemented | Accurate ASIN matching for RMAB-organized audiobooks

Overview

Solves false positive matches in Audiobookshelf fuzzy search by using file hash matching for RMAB-downloaded content.

Problem

  • New ABS items without ASIN → fuzzy Audible search by title/author
  • Risk: Wrong book matches (e.g., "Foundation" → "Foundation and Empire")
  • Result: Incorrect metadata, false positives

Solution

File Hash Matching Strategy:

  1. Generate SHA256 hash of audio filenames during organization
  2. Store hash in Audiobook.filesHash field
  3. During library scan: compare ABS item files against database hashes
  4. Match found → Use request's ASIN for 100% accurate metadata
  5. No match → Fallback to fuzzy search (external content)

How It Works

Organization Phase

File: src/lib/processors/organize-files.processor.ts

const filesHash = generateFilesHash(result.audioFiles);
await prisma.audiobook.update({
  data: {
    filesHash: filesHash,  // SHA256 of sorted audio filenames
    // ... other fields
  }
});

Library Scan Phase

Files: scan-plex.processor.ts, plex-recently-added.processor.ts

Phase 1: File Hash Matching (Items WITHOUT ASIN)

const itemsWithoutAsin = libraryItems.filter(item => !item.asin && item.externalId);

for (const item of itemsWithoutAsin) {
  // 1. Fetch ABS item details
  const absItem = await getABSItem(item.externalId);

  // 2. Generate hash from ABS audio filenames
  const audioFilenames = absItem.media.audioFiles.map(f => f.metadata.filename);
  const itemHash = generateFilesHash(audioFilenames);

  // 3. Query for matching RMAB download
  const matched = await prisma.audiobook.findFirst({
    where: { filesHash: itemHash, status: 'completed' }
  });

  // 4. Trigger metadata match (with ASIN if matched, undefined if not)
  await triggerABSItemMatch(item.externalId, matched?.audibleAsin);
}

Phase 2: Request Matching

// Match requests to library items and mark as available
const match = await findPlexMatch({
  asin: audiobook.audibleAsin,
  title: audiobook.title,
  author: audiobook.author
});

if (match) {
  // Update audiobook and request status
  await prisma.audiobook.update({ data: { absItemId: match.plexGuid } });
  await prisma.request.update({ data: { status: 'available' } });

  // No metadata match triggering needed:
  // - Items without ASIN: Already handled in Phase 1
  // - Items with ASIN: Already have correct metadata
}

Hash Generation Algorithm

File: src/lib/utils/files-hash.ts

Process:

  1. Extract basenames from file paths
  2. Filter to audio extensions: .m4b, .m4a, .mp3, .mp4, .aa, .aax
  3. Normalize to lowercase (case-insensitive)
  4. Sort alphabetically (deterministic order)
  5. Generate SHA256: crypto.createHash('sha256').update(JSON.stringify(sorted)).digest('hex')

Properties:

  • Deterministic: Same files → same hash (regardless of order/path)
  • Path-agnostic: Only basenames matter
  • Case-insensitive: "CHAPTER 01.mp3" === "chapter 01.mp3"
  • Fast: O(1) database lookup with indexed field

Database Schema

Model: Audiobook

model Audiobook {
  // ... existing fields
  filesHash String? @map("files_hash") @db.Text  // SHA256 (64 chars)

  @@index([filesHash])  // Fast O(1) lookups
}

Migration: 20260126100000_add_audiobook_files_hash

Implementation Details

Metadata Match Strategy

Phase 1 (File Hash): Handle NEW items WITHOUT ASIN

  • Filter: libraryItems.filter(item => !item.asin)
  • Trigger metadata match with file-hash-matched ASIN or undefined
  • This is the ONLY phase that triggers ABS metadata matching

Phase 2 (Request Match): Match requests, no metadata triggering

  • Match requests to library items by ASIN/title/author
  • Update request status to 'available'
  • No metadata match triggering - items either:
    • Were handled in Phase 1 (new items without ASIN)
    • Already have correct metadata (items with ASIN from ABS)

Why This Works:

  • Single source of truth: Only file hash phase triggers metadata matching
  • No redundant API calls: Items with ASIN already have correct metadata
  • Clean separation: Phase 1 = metadata, Phase 2 = request matching
  • Simple and efficient: No duplicate checks, no wasted API calls

Edge Cases

Externally-Added Content

  • User manually imports audiobook to ABS (not via RMAB)
  • No matching filesHash in database
  • Fallback: Fuzzy metadata match (current behavior preserved)

Modified Files

  • User adds/removes chapters after organization
  • ABS hash won't match RMAB hash
  • Fallback: Fuzzy metadata match

Existing Content (Before Feature)

  • Audiobooks organized before hash feature
  • filesHash field is NULL
  • Behavior: Continues using fuzzy matching
  • Future: Admin job could backfill hashes (out of scope)

Chapter-Merged Files

  • 20 MP3s → 1 M4B via chapter merging
  • Hash generated AFTER merging
  • Works correctly: Hash reflects final organized state

Multiple Downloads (Same Book)

  • User re-downloads same audiobook (different edition/request)
  • Multiple records with same filesHash
  • Solution: findFirst() returns first match (acceptable - same ASIN)

Performance

Storage:

  • New index: ~8 bytes per row (minimal)
  • SHA256 hash: 64 characters per record

API Calls:

  • One additional getABSItem() call per item without ASIN
  • Typical response: ~1-5KB JSON
  • Latency: ~50-100ms per call

Database:

  • Index lookup: O(1) with hash index (extremely fast)

Impact:

  • 10 items without ASIN → +500-1000ms per scan (acceptable)

Logging

Organization:

[INFO] Generated files hash: abc123def456... (5 audio files)

Library Scan (Match Found):

[INFO] File hash match found for "Foundation" → ASIN: B08G9PRS1K (from "Foundation (Unabridged)")
[INFO] Triggered metadata match with ASIN B08G9PRS1K for: "Foundation"

Library Scan (No Match):

[INFO] No file match found, triggering fuzzy metadata match for: "The Expanse"

Benefits

100% Accurate Matching - RMAB-organized content always gets correct ASIN Path-Agnostic - Works regardless of folder structure differences Fast Lookups - O(1) database query with indexed field Graceful Fallback - External content still works via fuzzy matching No Breaking Changes - Existing content continues working

Testing

Unit Tests: tests/utils/files-hash.test.ts

  • Hash generation correctness
  • Deterministic behavior
  • Edge case handling

Integration Tests: tests/processors/*.test.ts

  • Hash storage during organization
  • Hash matching during library scan
  • Fallback to fuzzy matching