mirror of
https://github.com/kikootwo/ReadMeABook.git
synced 2026-06-02 20:30:10 +00:00
a97979358f
Adds file hash-based matching for Audiobookshelf library items to ensure 100% accurate ASIN assignment for RMAB-organized content. Removes fuzzy matching from library availability checks, making all matching ASIN-only to eliminate false positives and race conditions. Updates database schema, processors, and matcher utilities; adds new tests and documentation for the new matching strategy. Removes obsolete scripts, Dockerfile, and related tests; updates docker-compose for test environments.
293 lines
12 KiB
Markdown
293 lines
12 KiB
Markdown
# Intelligent Ranking Algorithm
|
||
|
||
**Status:** ✅ Implemented | Comprehensive edge case test coverage
|
||
**Tests:** tests/utils/ranking-algorithm.test.ts (73 test cases)
|
||
|
||
Evaluates and scores torrents to automatically select best audiobook download.
|
||
|
||
## Test Coverage
|
||
|
||
**Comprehensive edge case testing includes:**
|
||
- ✅ Parenthetical/bracketed content handling (4 tests)
|
||
- ✅ Structured metadata prefix validation (5 tests)
|
||
- ✅ Suffix validation (5 tests)
|
||
- ✅ Multi-author handling (6 tests)
|
||
- ✅ Bonus modifiers (indexer priority + flags, 7 tests)
|
||
- ✅ Tiebreaker sorting (2 tests)
|
||
- ✅ Word coverage edge cases (4 tests)
|
||
- ✅ Format detection (5 tests)
|
||
- ✅ **Author presence check (10 tests)**
|
||
- ✅ **Context-aware filtering (3 tests)**
|
||
- ✅ **API compatibility (2 tests)**
|
||
|
||
**Tested edge cases prevent regressions from previous tweaks:**
|
||
- "We Are Legion (We Are Bob)" matching with/without subtitle
|
||
- "This Inevitable Ruin Dungeon Crawler Carl" NOT matching "Dungeon Crawler Carl"
|
||
- "The Housemaid's Secret" NOT matching "The Housemaid"
|
||
- Multiple author splitting and role filtering
|
||
- Flag bonus stacking and case-insensitive matching
|
||
- Tiebreaker sorting by publish date
|
||
- **"Project Hail Mary" (no author) NOT matching when Andy Weir required (automatic mode)**
|
||
- **All results shown in interactive mode regardless of author**
|
||
- **Middle initials, name order, and role filtering for author matching**
|
||
|
||
## Scoring Criteria (100 points max)
|
||
|
||
**1. Title/Author Match (60 pts max) - MOST IMPORTANT**
|
||
|
||
**Multi-Stage Matching:**
|
||
|
||
**Stage 1: Word Coverage Filter (MANDATORY)**
|
||
- Extracts significant words from request (filters stop words: "the", "a", "an", "of", "on", "in", "at", "by", "for")
|
||
- **Parenthetical/bracketed content is optional**: Content in () [] {} treated as subtitle (may be omitted from torrents)
|
||
- "We Are Legion (We Are Bob)" → Required: ["we", "are", "legion"], Optional: ["bob"]
|
||
- "Title [Series Name]" → Required: ["title"], Optional: ["series", "name"]
|
||
- "Book Title {Extra Info}" → Required: ["book", "title"], Optional: ["extra", "info"]
|
||
- Calculates coverage: % of **required** words found in torrent title
|
||
- **Hard requirement: 80%+ coverage of required words or automatic 0 score**
|
||
|
||
**Stage 1.5: Author Presence Check (CONTEXT-AWARE)**
|
||
- **Automatic mode (requireAuthor: true - default):** At least ONE author must be present with high confidence
|
||
- **Interactive mode (requireAuthor: false):** Check disabled, all results shown to user
|
||
- **High confidence = any of:**
|
||
1. Exact substring match: "dennis e. taylor" in torrent
|
||
2. High fuzzy similarity (≥ 0.85): handles spacing/punctuation
|
||
3. Core components present: First name + Last name within 30 chars
|
||
- Handles variations:
|
||
- Middle initials: "Dennis E. Taylor" ↔ "Dennis Taylor"
|
||
- Name order: "Brandon Sanderson" ↔ "Sanderson, Brandon"
|
||
- Multiple authors: Only ONE needs to match (OR logic)
|
||
- Filters roles: "translator", "narrator" ignored
|
||
- **If check fails in automatic mode → automatic 0 score**
|
||
- **Prevents wrong-author matches**: Stops "Project Hail Mary" (no author) from matching request for Andy Weir
|
||
|
||
**Edge Cases - Coverage Examples:**
|
||
- "The Wild Robot on the Island" → ["wild", "robot", "island"]
|
||
- ✅ "The Wild Robot on the Island" → 3/3 = 100% → **PASSES**
|
||
- ❌ "The Wild Robot" → 2/3 = 67% → **REJECTED**
|
||
- "We Are Legion (We Are Bob)" → Required: ["we", "are", "legion"]
|
||
- ✅ "Dennis E. Taylor - Bobiverse - 01 - We Are Legion" → 3/3 = 100% → **PASSES**
|
||
- ✅ "We Are Legion (We Are Bob)" → 3/3 = 100% → **PASSES**
|
||
- "Harry Potter and the Philosopher Stone" → ["harry", "potter", "philosopher", "stone"] (stop words filtered)
|
||
- ✅ "Harry Potter Philosopher Stone" → 4/4 = 100% → **PASSES**
|
||
- ❌ "Harry Potter" → 2/4 = 50% → **REJECTED**
|
||
- Prevents wrong series books from matching while handling common subtitle patterns
|
||
|
||
**Stage 2: Title Matching (0-45 pts)**
|
||
- Only scored if Stage 1 passes
|
||
- **Tries full title first, then required title (without parentheses)** if no match
|
||
- Example: "We Are Legion (We Are Bob)" tries both full title and "We Are Legion"
|
||
- Handles torrents that include subtitle AND those that omit it
|
||
- Complete title match requirements (both must be true):
|
||
- **Acceptable prefix** (any of these):
|
||
- No significant words before title (clean match)
|
||
- Title preceded by metadata separator (` - `, `: `, `—`) — handles "Author - Series - 01 - Title"
|
||
- Author name appears in prefix — handles "Author Name - Title"
|
||
- **Acceptable suffix**: Followed by metadata markers: " by", " [", " -", " (", " {", " :", "," or end of string
|
||
- Also accepts author name in suffix (e.g., "Title AuthorName Year")
|
||
- Complete match → 45 pts
|
||
- Unstructured prefix (words without separators) → fuzzy similarity (partial credit)
|
||
- Suffix continues with non-metadata → fuzzy similarity (partial credit)
|
||
- No substring match → fuzzy similarity (best score from full or required title)
|
||
|
||
**Edge Cases - Prefix Validation:**
|
||
- ✅ "Brandon Sanderson - Mistborn - 01 - The Final Empire" (structured metadata prefix)
|
||
- ✅ "Brandon Sanderson The Way of Kings" (author name in prefix)
|
||
- ✅ "Series Name: Book Title" (colon separator)
|
||
- ✅ "Author Name — Book Title" (em-dash separator)
|
||
- ❌ "This Inevitable Ruin Dungeon Crawler Carl" → REJECTED for "Dungeon Crawler Carl" (unstructured words before title)
|
||
|
||
**Edge Cases - Suffix Validation:**
|
||
- ✅ "The Great Book by Author Name" (metadata marker " by")
|
||
- ✅ "Book Title [Unabridged] (2024)" (bracketed metadata)
|
||
- ✅ "Book Title John Smith 2024" (author name in suffix)
|
||
- ✅ "Author - Book Title" (title at end of string)
|
||
- ❌ "The Housemaid's Secret - Freida McFadden" → REJECTED for "The Housemaid" (suffix continues with "'s Secret")
|
||
|
||
**Stage 3: Author Matching (0-15 pts)**
|
||
- Exact substring match → proportional credit
|
||
- No exact match → fuzzy similarity (partial credit)
|
||
- Splits authors on delimiters (comma, &, "and", " - ")
|
||
- Filters out roles ("translator", "narrator")
|
||
- Order-independent, no structure assumptions
|
||
- Ensures correct book is selected over wrong book with better format
|
||
|
||
**Edge Cases - Multi-Author Handling:**
|
||
- ✅ "Jane Doe, John Smith" → splits on comma
|
||
- ✅ "Jane Doe & John Smith" → splits on ampersand
|
||
- ✅ "Jane Doe and John Smith" → splits on "and"
|
||
- ✅ "Jane Doe, translator" → filters out "translator" role
|
||
- ✅ "Jane Doe, narrator" → filters out "narrator" role
|
||
- Proportional credit: If 1 of 3 authors matches → 5 pts (1/3 × 15)
|
||
- Proportional credit: If 2 of 3 authors match → 10 pts (2/3 × 15)
|
||
- Full credit: If all authors match → 15 pts
|
||
|
||
**2. Format Quality (25 pts max)**
|
||
- M4B with chapters: 25
|
||
- M4B without chapters: 22
|
||
- M4A: 16
|
||
- MP3: 10
|
||
- Other: 3
|
||
|
||
**3. Seeder Count (15 pts max)**
|
||
- Formula: `Math.min(15, Math.log10(seeders + 1) * 6)`
|
||
- 1 seeder: 0pts, 10 seeders: 6pts, 100 seeders: 12pts, 1000+: 15pts
|
||
- Note: Usenet/NZB results without seeders get full 15 pts (centralized availability)
|
||
|
||
## Bonus Points System
|
||
|
||
**Extensible multiplicative bonus system** for external quality factors:
|
||
|
||
**Indexer Priority Bonus (configurable 1-25, default: 10)**
|
||
- Formula: `bonusPoints = baseScore × (priority / 25)`
|
||
- Priority 10/25 (40%) → 95 base score → +38 bonus = 133 final
|
||
- Priority 20/25 (80%) → 95 base score → +76 bonus = 171 final
|
||
- Priority 25/25 (100%) → 95 base score → +95 bonus = 190 final
|
||
- Ensures high-quality torrent from low-priority indexer beats low-quality from high-priority
|
||
- Bonus scales with quality (better torrents get more benefit from priority)
|
||
|
||
**Indexer Flag Bonus (configurable -100% to +100%, default: 0%)**
|
||
- Formula: `bonusPoints = baseScore × (modifier / 100)`
|
||
- Positive modifiers reward desired flags (e.g., "Freeleech" at +50%)
|
||
- +50% modifier → 85 base score → +42.5 bonus = 127.5 final
|
||
- Negative modifiers penalize undesired flags (e.g., "Unwanted" at -60%)
|
||
- -60% modifier → 85 base score → -51 penalty = 34 final
|
||
- Dual threshold filtering:
|
||
- Base score must be ≥ 50 (quality minimum)
|
||
- Final score must be ≥ 50 (not disqualified by negative bonuses)
|
||
- Negative bonuses can disqualify otherwise good torrents
|
||
- Flag extraction from Prowlarr API:
|
||
- `downloadVolumeFactor: 0` → "Freeleech"
|
||
- `downloadVolumeFactor: <1` → "Partial Freeleech"
|
||
- `uploadVolumeFactor: >1` → "Double Upload"
|
||
- Case-insensitive, whitespace-trimmed matching
|
||
- Universal across all indexers (not indexer-specific)
|
||
- Multiple flag bonuses stack (additive)
|
||
|
||
**Edge Cases - Flag Matching:**
|
||
- ✅ "FREELEECH" matches config "freeleech" (case-insensitive)
|
||
- ✅ " Freeleech " matches config " Freeleech " (whitespace-trimmed)
|
||
- ✅ Multiple flags: ["Freeleech", "Double Upload"] → both bonuses applied
|
||
- Example stacking: Freeleech (+50%) + Double Upload (+25%) on 80 base score
|
||
- Freeleech bonus: 80 × 0.5 = +40
|
||
- Double Upload bonus: 80 × 0.25 = +20
|
||
- Total bonus: +60 points
|
||
- Final score: 80 + 60 = 140
|
||
|
||
**Future Modifiers (planned):**
|
||
- User preferences
|
||
- Custom rules
|
||
|
||
**Final Score Calculation:**
|
||
1. Calculate base score (0-100) using standard criteria
|
||
2. Calculate bonus modifiers (indexer priority, flag bonuses, etc.)
|
||
3. Sum bonus points
|
||
4. Final score = base score + bonus points
|
||
5. Apply dual threshold filter:
|
||
- Base score ≥ 50 (quality minimum)
|
||
- Final score ≥ 50 (not disqualified by negative bonuses)
|
||
6. Sort by final score (descending), then publish date (descending)
|
||
|
||
## Tiebreaker Sorting
|
||
|
||
When multiple torrents have identical final scores:
|
||
- **Secondary sort:** Publish date descending (newest first)
|
||
- Ensures latest uploads are preferred when quality is equal
|
||
- Example: 3 torrents with 171 final score → newest upload ranks #1
|
||
|
||
**Edge Cases - Tiebreaker Examples:**
|
||
- ✅ Same score, different dates:
|
||
- Torrent A: Score 85, published 2024-06-01 → **Ranks #1**
|
||
- Torrent B: Score 85, published 2023-01-01 → Ranks #2
|
||
- ❌ Different scores, ignore date:
|
||
- Torrent A: Score 95, published 2020-01-01 → **Ranks #1** (better match wins despite older date)
|
||
- Torrent B: Score 75, published 2024-01-01 → Ranks #2
|
||
|
||
## Interface
|
||
|
||
```typescript
|
||
interface IndexerFlagConfig {
|
||
name: string; // Flag name (e.g., "Freeleech")
|
||
modifier: number; // -100 to 100 (percentage)
|
||
}
|
||
|
||
interface RankTorrentsOptions {
|
||
indexerPriorities?: Map<number, number>; // indexerId -> priority (1-25)
|
||
flagConfigs?: IndexerFlagConfig[]; // Flag bonus configurations
|
||
requireAuthor?: boolean; // Enforce author check (default: true)
|
||
}
|
||
|
||
interface BonusModifier {
|
||
type: 'indexer_priority' | 'indexer_flag' | 'custom';
|
||
value: number; // Multiplier (e.g., 0.4 for 40%)
|
||
points: number; // Calculated bonus points
|
||
reason: string; // Human-readable explanation
|
||
}
|
||
|
||
interface TorrentResult {
|
||
// ... existing fields
|
||
flags?: string[]; // Extracted flags from Prowlarr API
|
||
}
|
||
|
||
interface RankedTorrent extends TorrentResult {
|
||
score: number; // Base score (0-100)
|
||
bonusModifiers: BonusModifier[];
|
||
bonusPoints: number; // Sum of all bonus points
|
||
finalScore: number; // score + bonusPoints
|
||
rank: number;
|
||
breakdown: {
|
||
formatScore: number;
|
||
seederScore: number;
|
||
matchScore: number;
|
||
totalScore: number; // Same as score
|
||
notes: string[];
|
||
};
|
||
}
|
||
|
||
// New API (recommended)
|
||
function rankTorrents(
|
||
torrents: TorrentResult[],
|
||
audiobook: AudiobookRequest,
|
||
options?: RankTorrentsOptions
|
||
): RankedTorrent[];
|
||
|
||
// Legacy API (backwards compatible)
|
||
function rankTorrents(
|
||
torrents: TorrentResult[],
|
||
audiobook: AudiobookRequest,
|
||
indexerPriorities?: Map<number, number>,
|
||
flagConfigs?: IndexerFlagConfig[]
|
||
): RankedTorrent[];
|
||
```
|
||
|
||
## Usage Examples
|
||
|
||
**Automatic selection (strict author filtering):**
|
||
```typescript
|
||
// Background job - safe auto-download
|
||
const ranked = rankTorrents(torrents, audiobook, {
|
||
indexerPriorities,
|
||
flagConfigs,
|
||
requireAuthor: true // Default - prevents wrong authors
|
||
});
|
||
|
||
const topResult = ranked[0]; // Safe to auto-download
|
||
```
|
||
|
||
**Interactive search (show all results):**
|
||
```typescript
|
||
// User browsing - let user decide
|
||
const ranked = rankTorrents(torrents, audiobook, {
|
||
indexerPriorities,
|
||
flagConfigs,
|
||
requireAuthor: false // Show everything, including edge cases
|
||
});
|
||
|
||
return ranked; // User can see torrents without author info
|
||
```
|
||
|
||
## Tech Stack
|
||
|
||
- string-similarity (fuzzy matching)
|
||
- Regex for format detection
|