Implement file hash-based library matching and remove fuzzy ASIN matching

Adds file hash-based matching for Audiobookshelf library items to ensure 100% accurate ASIN assignment for RMAB-organized content. Removes fuzzy matching from library availability checks, making all matching ASIN-only to eliminate false positives and race conditions. Updates database schema, processors, and matcher utilities; adds new tests and documentation for the new matching strategy. Removes obsolete scripts, Dockerfile, and related tests; updates docker-compose for test environments.
This commit is contained in:
kikootwo
2026-01-28 10:32:14 -05:00
parent 497849f427
commit a97979358f
111 changed files with 6571 additions and 1426 deletions
+4 -3
View File
@@ -43,9 +43,10 @@ Result: Douglas Adams/Stephen Fry/The Hitchhiker's Guide to the Galaxy/
5. **Copy** files (not move - originals stay for seeding)
6. **Tag metadata** (if enabled) - writes correct title, author, narrator, ASIN to audio files
7. Copy cover art if found, else download from Audible
8. Update request status to `downloaded`
9. **Trigger filesystem scan** (if enabled) - tells Plex/ABS to scan for new files
10. Originals remain until seeding requirements met
8. **Generate file hash** - SHA256 of sorted audio filenames for library matching (see: [fixes/file-hash-matching.md](../fixes/file-hash-matching.md))
9. Update request status to `downloaded` and store file hash in `audiobooks.files_hash`
10. **Trigger filesystem scan** (if enabled) - tells Plex/ABS to scan for new files
11. Originals remain until seeding requirements met
## Filesystem Scan Triggering
+140 -11
View File
@@ -1,9 +1,36 @@
# Intelligent Ranking Algorithm
**Status:** ✅ Implemented
**Status:** ✅ Implemented | Comprehensive edge case test coverage
**Tests:** tests/utils/ranking-algorithm.test.ts (73 test cases)
Evaluates and scores torrents to automatically select best audiobook download.
## Test Coverage
**Comprehensive edge case testing includes:**
- ✅ Parenthetical/bracketed content handling (4 tests)
- ✅ Structured metadata prefix validation (5 tests)
- ✅ Suffix validation (5 tests)
- ✅ Multi-author handling (6 tests)
- ✅ Bonus modifiers (indexer priority + flags, 7 tests)
- ✅ Tiebreaker sorting (2 tests)
- ✅ Word coverage edge cases (4 tests)
- ✅ Format detection (5 tests)
-**Author presence check (10 tests)**
-**Context-aware filtering (3 tests)**
-**API compatibility (2 tests)**
**Tested edge cases prevent regressions from previous tweaks:**
- "We Are Legion (We Are Bob)" matching with/without subtitle
- "This Inevitable Ruin Dungeon Crawler Carl" NOT matching "Dungeon Crawler Carl"
- "The Housemaid's Secret" NOT matching "The Housemaid"
- Multiple author splitting and role filtering
- Flag bonus stacking and case-insensitive matching
- Tiebreaker sorting by publish date
- **"Project Hail Mary" (no author) NOT matching when Andy Weir required (automatic mode)**
- **All results shown in interactive mode regardless of author**
- **Middle initials, name order, and role filtering for author matching**
## Scoring Criteria (100 points max)
**1. Title/Author Match (60 pts max) - MOST IMPORTANT**
@@ -15,13 +42,35 @@ Evaluates and scores torrents to automatically select best audiobook download.
- **Parenthetical/bracketed content is optional**: Content in () [] {} treated as subtitle (may be omitted from torrents)
- "We Are Legion (We Are Bob)" → Required: ["we", "are", "legion"], Optional: ["bob"]
- "Title [Series Name]" → Required: ["title"], Optional: ["series", "name"]
- "Book Title {Extra Info}" → Required: ["book", "title"], Optional: ["extra", "info"]
- Calculates coverage: % of **required** words found in torrent title
- **Hard requirement: 80%+ coverage of required words or automatic 0 score**
- Example: "The Wild Robot on the Island" → ["wild", "robot", "island"]
- "The Wild Robot" → ["wild", "robot"] → 2/3 = 67% → **REJECTED**
- "The Wild Robot on the Island" → 3/3 = 100% → **PASSES**
- Example: "We Are Legion (We Are Bob)" → Required: ["we", "are", "legion"]
- "Dennis E. Taylor - Bobiverse - 01 - We Are Legion" → 3/3 = 100% → **PASSES**
**Stage 1.5: Author Presence Check (CONTEXT-AWARE)**
- **Automatic mode (requireAuthor: true - default):** At least ONE author must be present with high confidence
- **Interactive mode (requireAuthor: false):** Check disabled, all results shown to user
- **High confidence = any of:**
1. Exact substring match: "dennis e. taylor" in torrent
2. High fuzzy similarity (≥ 0.85): handles spacing/punctuation
3. Core components present: First name + Last name within 30 chars
- Handles variations:
- Middle initials: "Dennis E. Taylor" ↔ "Dennis Taylor"
- Name order: "Brandon Sanderson" ↔ "Sanderson, Brandon"
- Multiple authors: Only ONE needs to match (OR logic)
- Filters roles: "translator", "narrator" ignored
- **If check fails in automatic mode → automatic 0 score**
- **Prevents wrong-author matches**: Stops "Project Hail Mary" (no author) from matching request for Andy Weir
**Edge Cases - Coverage Examples:**
- "The Wild Robot on the Island" → ["wild", "robot", "island"]
- ✅ "The Wild Robot on the Island" → 3/3 = 100% → **PASSES**
- ❌ "The Wild Robot" → 2/3 = 67% → **REJECTED**
- "We Are Legion (We Are Bob)" → Required: ["we", "are", "legion"]
- ✅ "Dennis E. Taylor - Bobiverse - 01 - We Are Legion" → 3/3 = 100% → **PASSES**
- ✅ "We Are Legion (We Are Bob)" → 3/3 = 100% → **PASSES**
- "Harry Potter and the Philosopher Stone" → ["harry", "potter", "philosopher", "stone"] (stop words filtered)
- ✅ "Harry Potter Philosopher Stone" → 4/4 = 100% → **PASSES**
- ❌ "Harry Potter" → 2/4 = 50% → **REJECTED**
- Prevents wrong series books from matching while handling common subtitle patterns
**Stage 2: Title Matching (0-45 pts)**
@@ -35,22 +84,44 @@ Evaluates and scores torrents to automatically select best audiobook download.
- Title preceded by metadata separator (` - `, `: `, `—`) — handles "Author - Series - 01 - Title"
- Author name appears in prefix — handles "Author Name - Title"
- **Acceptable suffix**: Followed by metadata markers: " by", " [", " -", " (", " {", " :", "," or end of string
- Also accepts author name in suffix (e.g., "Title AuthorName Year")
- Complete match → 45 pts
- Unstructured prefix (words without separators) → fuzzy similarity (partial credit)
- Prevents: "This Inevitable Ruin Dungeon Crawler Carl" matching "Dungeon Crawler Carl"
- Suffix continues with non-metadata → fuzzy similarity (partial credit)
- Prevents: "The Housemaid's Secret" matching "The Housemaid"
- No substring match → fuzzy similarity (best score from full or required title)
**Edge Cases - Prefix Validation:**
- ✅ "Brandon Sanderson - Mistborn - 01 - The Final Empire" (structured metadata prefix)
- ✅ "Brandon Sanderson The Way of Kings" (author name in prefix)
- ✅ "Series Name: Book Title" (colon separator)
- ✅ "Author Name — Book Title" (em-dash separator)
- ❌ "This Inevitable Ruin Dungeon Crawler Carl" → REJECTED for "Dungeon Crawler Carl" (unstructured words before title)
**Edge Cases - Suffix Validation:**
- ✅ "The Great Book by Author Name" (metadata marker " by")
- ✅ "Book Title [Unabridged] (2024)" (bracketed metadata)
- ✅ "Book Title John Smith 2024" (author name in suffix)
- ✅ "Author - Book Title" (title at end of string)
- ❌ "The Housemaid's Secret - Freida McFadden" → REJECTED for "The Housemaid" (suffix continues with "'s Secret")
**Stage 3: Author Matching (0-15 pts)**
- Exact substring match → proportional credit
- No exact match → fuzzy similarity (partial credit)
- Splits authors on delimiters (comma, &, "and", " - ")
- Filters out roles ("translator", "narrator")
- Order-independent, no structure assumptions
- Ensures correct book is selected over wrong book with better format
**Edge Cases - Multi-Author Handling:**
- ✅ "Jane Doe, John Smith" → splits on comma
- ✅ "Jane Doe & John Smith" → splits on ampersand
- ✅ "Jane Doe and John Smith" → splits on "and"
- ✅ "Jane Doe, translator" → filters out "translator" role
- ✅ "Jane Doe, narrator" → filters out "narrator" role
- Proportional credit: If 1 of 3 authors matches → 5 pts (1/3 × 15)
- Proportional credit: If 2 of 3 authors match → 10 pts (2/3 × 15)
- Full credit: If all authors match → 15 pts
**2. Format Quality (25 pts max)**
- M4B with chapters: 25
- M4B without chapters: 22
@@ -93,6 +164,16 @@ Evaluates and scores torrents to automatically select best audiobook download.
- Universal across all indexers (not indexer-specific)
- Multiple flag bonuses stack (additive)
**Edge Cases - Flag Matching:**
- ✅ "FREELEECH" matches config "freeleech" (case-insensitive)
- ✅ " Freeleech " matches config " Freeleech " (whitespace-trimmed)
- ✅ Multiple flags: ["Freeleech", "Double Upload"] → both bonuses applied
- Example stacking: Freeleech (+50%) + Double Upload (+25%) on 80 base score
- Freeleech bonus: 80 × 0.5 = +40
- Double Upload bonus: 80 × 0.25 = +20
- Total bonus: +60 points
- Final score: 80 + 60 = 140
**Future Modifiers (planned):**
- User preferences
- Custom rules
@@ -114,6 +195,14 @@ When multiple torrents have identical final scores:
- Ensures latest uploads are preferred when quality is equal
- Example: 3 torrents with 171 final score → newest upload ranks #1
**Edge Cases - Tiebreaker Examples:**
- ✅ Same score, different dates:
- Torrent A: Score 85, published 2024-06-01 → **Ranks #1**
- Torrent B: Score 85, published 2023-01-01 → Ranks #2
- ❌ Different scores, ignore date:
- Torrent A: Score 95, published 2020-01-01 → **Ranks #1** (better match wins despite older date)
- Torrent B: Score 75, published 2024-01-01 → Ranks #2
## Interface
```typescript
@@ -122,6 +211,12 @@ interface IndexerFlagConfig {
modifier: number; // -100 to 100 (percentage)
}
interface RankTorrentsOptions {
indexerPriorities?: Map<number, number>; // indexerId -> priority (1-25)
flagConfigs?: IndexerFlagConfig[]; // Flag bonus configurations
requireAuthor?: boolean; // Enforce author check (default: true)
}
interface BonusModifier {
type: 'indexer_priority' | 'indexer_flag' | 'custom';
value: number; // Multiplier (e.g., 0.4 for 40%)
@@ -149,12 +244,46 @@ interface RankedTorrent extends TorrentResult {
};
}
// New API (recommended)
function rankTorrents(
torrents: TorrentResult[],
audiobook: AudiobookRequest,
indexerPriorities?: Map<number, number>, // indexerId -> priority (1-25)
flagConfigs?: IndexerFlagConfig[] // Flag bonus configurations
options?: RankTorrentsOptions
): RankedTorrent[];
// Legacy API (backwards compatible)
function rankTorrents(
torrents: TorrentResult[],
audiobook: AudiobookRequest,
indexerPriorities?: Map<number, number>,
flagConfigs?: IndexerFlagConfig[]
): RankedTorrent[];
```
## Usage Examples
**Automatic selection (strict author filtering):**
```typescript
// Background job - safe auto-download
const ranked = rankTorrents(torrents, audiobook, {
indexerPriorities,
flagConfigs,
requireAuthor: true // Default - prevents wrong authors
});
const topResult = ranked[0]; // Safe to auto-download
```
**Interactive search (show all results):**
```typescript
// User browsing - let user decide
const ranked = rankTorrents(torrents, audiobook, {
indexerPriorities,
flagConfigs,
requireAuthor: false // Show everything, including edge cases
});
return ranked; // User can see torrents without author info
```
## Tech Stack
+37 -1
View File
@@ -24,7 +24,10 @@ Free, open-source Usenet/NZB download client with comprehensive Web API. Industr
**GET /api?mode=history&limit=100&output=json&apikey={key}** - Get completed/failed downloads
**GET /api?mode=pause&value={nzbId}&output=json&apikey={key}** - Pause download
**GET /api?mode=resume&value={nzbId}&output=json&apikey={key}** - Resume download
**GET /api?mode=queue&name=delete&value={nzbId}&del_files={0|1}&output=json&apikey={key}** - Delete download
**GET /api?mode=queue&name=delete&value={nzbId}&del_files={0|1}&output=json&apikey={key}** - Delete download from queue
**GET /api?mode=history&name=delete&value={nzbId}&del_files={0|1}&archive={0|1}&output=json&apikey={key}** - Delete/archive download from history
- `archive=1` (default): Move to hidden archive (preserves for troubleshooting)
- `archive=0`: Permanently delete from history
**GET /api?mode=get_config&output=json&apikey={key}** - Get configuration (categories)
**GET /api?mode=set_config&section=categories&keyword={cat}&value={path}&output=json&apikey={key}** - Create/update category
@@ -179,6 +182,38 @@ interface HistoryItem {
**4. Queue vs History Logic** - Checks queue first, falls back to history
**5. SSL Certificate Errors** - Optional SSL verification disable for self-signed certs
## Automatic Cleanup
**Per-Indexer Configuration:**
- Usenet indexers have "Remove After Processing" option (default: enabled)
- When enabled, NZB downloads are automatically cleaned up after files are organized
- Saves disk space by removing completed download files
**Two-Stage Cleanup Process:**
1. **Filesystem Cleanup:** Manually deletes download directory/files using `fs.rm()`
- Removes extracted files from category download directory
- Handles both single files and directories recursively
- Gracefully handles already-deleted files (ENOENT)
2. **SABnzbd Archive:** Archives NZB from history (hides from UI)
- Uses SABnzbd's archive feature (default: `archive=1`)
- Preserves job in hidden archive for troubleshooting/auditing
- Does NOT permanently delete from history
- Does NOT attempt queue deletion (if still in queue, something went wrong)
**Implementation:**
- Location: `organize-files.processor.ts`
- After file organization completes, checks if indexer has `removeAfterProcessing` enabled
- Filesystem cleanup performed first (critical for disk space)
- SABnzbd archive performed second (UI cleanup)
- Non-blocking: logs warnings but doesn't fail the job if cleanup fails
**Why Archive Instead of Delete:**
- Preserves download history for troubleshooting
- Maintains records for duplicate detection
- Allows reviewing past downloads if issues arise
- Can be viewed in SABnzbd by toggling "Show Archive" in history
## Comparison: SABnzbd vs qBittorrent
| Feature | SABnzbd | qBittorrent |
@@ -190,6 +225,7 @@ interface HistoryItem {
| Seeding | N/A (Usenet is not P2P) | Required (tracker) |
| Categories | Path-based | Path + tag-based |
| File Handling | Auto-extracts archives | Downloads as-is |
| Cleanup | Automatic (optional, per-indexer) | Seeding time based |
## Tech Stack