Explore perceptual hashing for image spam detection

## Problem

Spammers can evade duplicate message detection by varying attachment filenames, compression settings, or binary data while posting visually identical images across channels.

Current behavior:
- ✅ Catches identical text with varying attachments (content hash is text-only)
- ❌ Misses identical images with varied text (e.g., "check this", "see this", "look here")
- ❌ Misses near-identical images (re-compressed, slightly cropped, filtered)

## Proposed Solution

Implement **perceptual hashing** (pHash/dHash) to detect visually similar images.

### How it works
1. Generate perceptual hash for each image attachment (~10-50ms per image)
2. Store hashes in `RecentMessage` interface
3. Compare incoming image hashes to recent hashes using Hamming distance
4. Flag messages with similar images as potential duplicates

### Algorithm Options

**dHash (Difference Hash)** - Recommended
- Fast, simple, good accuracy for near-duplicates
- Compares adjacent pixel differences
- Node.js: `imghash` library

**pHash (Perceptual Hash)**
- More robust, uses DCT (like JPEG compression)
- Slightly slower but better for modified images

**PDQ Hash (Facebook)**
- Industry standard, open source
- Best accuracy but more complex
- https://github.com/facebook/ThreatExchange/tree/main/pdq

### Example Implementation

```typescript
// Add to RecentMessage interface
export interface RecentMessage {
  messageId: string;
  channelId: string;
  contentHash: string;
  timestamp: number;
  hasLink: boolean;
  imageHashes?: string[]; // NEW: perceptual hashes
}

// In velocityAnalyzer.ts
function countSimilarImages(
  recentMessages: RecentMessage[],
  currentHashes: string[],
  windowMs: number,
  threshold: number = 5, // Hamming distance threshold
): number {
  // Compare hashes, count similar images
}

// New spam signal
if (similarImageCount >= 3) {
  signals.push({
    name: "similar_image_spam",
    score: 5,
    description: `${similarImageCount} visually similar images`,
  });
}
```

### Benefits

- ✅ Catches spammers varying filenames/compression
- ✅ Catches slightly modified images (cropped, filtered, re-encoded)
- ✅ Fast enough for real-time (~10-50ms per image)
- ✅ Low memory footprint (64-bit hash per image)
- ✅ Works with existing velocity detection system

### Trade-offs

- ⚠️ Adds latency to message processing (need to download + hash images)
- ⚠️ Won't catch heavily edited images (mirrored, rotated 90°, etc.)
- ⚠️ Requires new dependency (`imghash` or similar)
- ⚠️ False positives possible (legitimate users posting similar memes)

### Recommended Approach

1. **Research phase**: Test `imghash` library with sample spam images to validate accuracy
2. **Prototype**: Implement basic dHash comparison without spam scoring
3. **Tune thresholds**: Determine optimal Hamming distance threshold (likely 5-10)
4. **Add signal**: Create `similar_image_spam` signal with appropriate scoring
5. **Monitor**: Watch for false positives in production

### Libraries to Evaluate

- `imghash` - Simple wrapper around sharp for perceptual hashing
- `sharp` - Fast image processing (could implement custom dHash)
- `blockhash-js` - Alternative perceptual hash implementation
- PDQ hash - Facebook's open source algorithm (more complex but better)

### Not Recommended

- AI embeddings (CLIP, etc.) - Too slow and resource-intensive
- PhotoDNA - Requires Microsoft partnership, overkill for spam

## Next Steps

- [ ] Research and test perceptual hashing libraries
- [ ] Collect sample spam images to validate approach
- [ ] Prototype implementation
- [ ] Determine appropriate scoring and thresholds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore perceptual hashing for image spam detection #282

Problem

Proposed Solution

How it works

Algorithm Options

Example Implementation

Benefits

Trade-offs

Recommended Approach

Libraries to Evaluate

Not Recommended

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore perceptual hashing for image spam detection #282

Description

Problem

Proposed Solution

How it works

Algorithm Options

Example Implementation

Benefits

Trade-offs

Recommended Approach

Libraries to Evaluate

Not Recommended

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions