-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Problem
Spammers can evade duplicate message detection by varying attachment filenames, compression settings, or binary data while posting visually identical images across channels.
Current behavior:
- ✅ Catches identical text with varying attachments (content hash is text-only)
- ❌ Misses identical images with varied text (e.g., "check this", "see this", "look here")
- ❌ Misses near-identical images (re-compressed, slightly cropped, filtered)
Proposed Solution
Implement perceptual hashing (pHash/dHash) to detect visually similar images.
How it works
- Generate perceptual hash for each image attachment (~10-50ms per image)
- Store hashes in
RecentMessageinterface - Compare incoming image hashes to recent hashes using Hamming distance
- Flag messages with similar images as potential duplicates
Algorithm Options
dHash (Difference Hash) - Recommended
- Fast, simple, good accuracy for near-duplicates
- Compares adjacent pixel differences
- Node.js:
imghashlibrary
pHash (Perceptual Hash)
- More robust, uses DCT (like JPEG compression)
- Slightly slower but better for modified images
PDQ Hash (Facebook)
- Industry standard, open source
- Best accuracy but more complex
- https://github.com/facebook/ThreatExchange/tree/main/pdq
Example Implementation
// Add to RecentMessage interface
export interface RecentMessage {
messageId: string;
channelId: string;
contentHash: string;
timestamp: number;
hasLink: boolean;
imageHashes?: string[]; // NEW: perceptual hashes
}
// In velocityAnalyzer.ts
function countSimilarImages(
recentMessages: RecentMessage[],
currentHashes: string[],
windowMs: number,
threshold: number = 5, // Hamming distance threshold
): number {
// Compare hashes, count similar images
}
// New spam signal
if (similarImageCount >= 3) {
signals.push({
name: "similar_image_spam",
score: 5,
description: `${similarImageCount} visually similar images`,
});
}Benefits
- ✅ Catches spammers varying filenames/compression
- ✅ Catches slightly modified images (cropped, filtered, re-encoded)
- ✅ Fast enough for real-time (~10-50ms per image)
- ✅ Low memory footprint (64-bit hash per image)
- ✅ Works with existing velocity detection system
Trade-offs
⚠️ Adds latency to message processing (need to download + hash images)⚠️ Won't catch heavily edited images (mirrored, rotated 90°, etc.)⚠️ Requires new dependency (imghashor similar)⚠️ False positives possible (legitimate users posting similar memes)
Recommended Approach
- Research phase: Test
imghashlibrary with sample spam images to validate accuracy - Prototype: Implement basic dHash comparison without spam scoring
- Tune thresholds: Determine optimal Hamming distance threshold (likely 5-10)
- Add signal: Create
similar_image_spamsignal with appropriate scoring - Monitor: Watch for false positives in production
Libraries to Evaluate
imghash- Simple wrapper around sharp for perceptual hashingsharp- Fast image processing (could implement custom dHash)blockhash-js- Alternative perceptual hash implementation- PDQ hash - Facebook's open source algorithm (more complex but better)
Not Recommended
- AI embeddings (CLIP, etc.) - Too slow and resource-intensive
- PhotoDNA - Requires Microsoft partnership, overkill for spam
Next Steps
- Research and test perceptual hashing libraries
- Collect sample spam images to validate approach
- Prototype implementation
- Determine appropriate scoring and thresholds
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels