ReadMeABook/documentation/fixes/file-hash-matching.md

# File Hash-Based Library Matching

**Status:** ✅ Implemented | Accurate ASIN matching for RMAB-organized audiobooks

## Overview
Solves false positive matches in Audiobookshelf fuzzy search by using file hash matching for RMAB-downloaded content.

## Problem
- New ABS items without ASIN → fuzzy Audible search by title/author
- Risk: Wrong book matches (e.g., "Foundation" → "Foundation and Empire")
- Result: Incorrect metadata, false positives

## Solution
**File Hash Matching Strategy:**
1. Generate SHA256 hash of audio filenames during organization
2. Store hash in `Audiobook.filesHash` field
3. During library scan: compare ABS item files against database hashes
4. Match found → Use request's ASIN for 100% accurate metadata
5. No match → Fallback to fuzzy search (external content)

## How It Works

### Organization Phase
**File:** `src/lib/processors/organize-files.processor.ts`

```typescript
const filesHash = generateFilesHash(result.audioFiles);
await prisma.audiobook.update({
  data: {
    filesHash: filesHash,  // SHA256 of sorted audio filenames
    // ... other fields
  }
});
```

### Library Scan Phase
**Files:** `scan-plex.processor.ts`, `plex-recently-added.processor.ts`

**Phase 1: File Hash Matching (Items WITHOUT ASIN)**
```typescript
const itemsWithoutAsin = libraryItems.filter(item => !item.asin && item.externalId);

for (const item of itemsWithoutAsin) {
  // 1. Fetch ABS item details
  const absItem = await getABSItem(item.externalId);

  // 2. Generate hash from ABS audio filenames
  const audioFilenames = absItem.media.audioFiles.map(f => f.metadata.filename);
  const itemHash = generateFilesHash(audioFilenames);

  // 3. Query for matching RMAB download
  const matched = await prisma.audiobook.findFirst({
    where: { filesHash: itemHash, status: 'completed' }
  });

  // 4. Trigger metadata match (with ASIN if matched, undefined if not)
  await triggerABSItemMatch(item.externalId, matched?.audibleAsin);
}
```

**Phase 2: Request Matching**
```typescript
// Match requests to library items and mark as available
const match = await findPlexMatch({
  asin: audiobook.audibleAsin,
  title: audiobook.title,
  author: audiobook.author
});

if (match) {
  // Update audiobook and request status
  await prisma.audiobook.update({ data: { absItemId: match.plexGuid } });
  await prisma.request.update({ data: { status: 'available' } });

  // No metadata match triggering needed:
  // - Items without ASIN: Already handled in Phase 1
  // - Items with ASIN: Already have correct metadata
}
```

## Hash Generation Algorithm
**File:** `src/lib/utils/files-hash.ts`

**Process:**
1. Extract basenames from file paths
2. Filter to audio extensions: `.m4b`, `.m4a`, `.mp3`, `.mp4`, `.aa`, `.aax`
3. Normalize to lowercase (case-insensitive)
4. Sort alphabetically (deterministic order)
5. Generate SHA256: `crypto.createHash('sha256').update(JSON.stringify(sorted)).digest('hex')`

**Properties:**
- Deterministic: Same files → same hash (regardless of order/path)
- Path-agnostic: Only basenames matter
- Case-insensitive: "CHAPTER 01.mp3" === "chapter 01.mp3"
- Fast: O(1) database lookup with indexed field

## Database Schema

**Model:** `Audiobook`

```prisma
model Audiobook {
  // ... existing fields
  filesHash String? @map("files_hash") @db.Text  // SHA256 (64 chars)

  @@index([filesHash])  // Fast O(1) lookups
}
```

**Migration:** `20260126100000_add_audiobook_files_hash`

## Implementation Details

### Metadata Match Strategy

**Phase 1 (File Hash):** Handle NEW items WITHOUT ASIN
- Filter: `libraryItems.filter(item => !item.asin)`
- Trigger metadata match with file-hash-matched ASIN or undefined
- **This is the ONLY phase that triggers ABS metadata matching**

**Phase 2 (Request Match):** Match requests, no metadata triggering
- Match requests to library items by ASIN/title/author
- Update request status to 'available'
- **No metadata match triggering** - items either:
  - Were handled in Phase 1 (new items without ASIN)
  - Already have correct metadata (items with ASIN from ABS)

**Why This Works:**
- **Single source of truth**: Only file hash phase triggers metadata matching
- **No redundant API calls**: Items with ASIN already have correct metadata
- **Clean separation**: Phase 1 = metadata, Phase 2 = request matching
- **Simple and efficient**: No duplicate checks, no wasted API calls

## Edge Cases

### Externally-Added Content
- User manually imports audiobook to ABS (not via RMAB)
- No matching `filesHash` in database
- **Fallback:** Fuzzy metadata match (current behavior preserved)

### Modified Files
- User adds/removes chapters after organization
- ABS hash won't match RMAB hash
- **Fallback:** Fuzzy metadata match

### Existing Content (Before Feature)
- Audiobooks organized before hash feature
- `filesHash` field is NULL
- **Behavior:** Continues using fuzzy matching
- **Future:** Admin job could backfill hashes (out of scope)

### Chapter-Merged Files
- 20 MP3s → 1 M4B via chapter merging
- Hash generated AFTER merging
- **Works correctly:** Hash reflects final organized state

### Multiple Downloads (Same Book)
- User re-downloads same audiobook (different edition/request)
- Multiple records with same `filesHash`
- **Solution:** `findFirst()` returns first match (acceptable - same ASIN)

## Performance

**Storage:**
- New index: ~8 bytes per row (minimal)
- SHA256 hash: 64 characters per record

**API Calls:**
- One additional `getABSItem()` call per item without ASIN
- Typical response: ~1-5KB JSON
- Latency: ~50-100ms per call

**Database:**
- Index lookup: O(1) with hash index (extremely fast)

**Impact:**
- 10 items without ASIN → +500-1000ms per scan (acceptable)

## Logging

**Organization:**
```
[INFO] Generated files hash: abc123def456... (5 audio files)
```

**Library Scan (Match Found):**
```
[INFO] File hash match found for "Foundation" → ASIN: B08G9PRS1K (from "Foundation (Unabridged)")
[INFO] Triggered metadata match with ASIN B08G9PRS1K for: "Foundation"
```

**Library Scan (No Match):**
```
[INFO] No file match found, triggering fuzzy metadata match for: "The Expanse"
```

## Benefits

✅ **100% Accurate Matching** - RMAB-organized content always gets correct ASIN
✅ **Path-Agnostic** - Works regardless of folder structure differences
✅ **Fast Lookups** - O(1) database query with indexed field
✅ **Graceful Fallback** - External content still works via fuzzy matching
✅ **No Breaking Changes** - Existing content continues working

## Testing

**Unit Tests:** `tests/utils/files-hash.test.ts`
- Hash generation correctness
- Deterministic behavior
- Edge case handling

**Integration Tests:** `tests/processors/*.test.ts`
- Hash storage during organization
- Hash matching during library scan
- Fallback to fuzzy matching

## Related
- [Audiobookshelf Integration](../integrations/audiobookshelf.md) - Backend mode
- [File Organization](../phase3/file-organization.md) - Organization flow
- [Database Schema](../backend/database.md) - Audiobook model