Is it possible to check duplication of documents into a search engine system? – RChilli HelpDesk

Yes, it is possible to check for duplication of documents when indexing into RChilli’s Search Engine System, but it depends on how you implement the solution. Below is a detailed guide on RChilli's capabilities and how to manage duplicates effectively:

1. Built-In De-Duplication in RChilli Search & Match

RChilli's Search & Match Engine includes a deduplication mechanism that helps prevent duplicate entries during indexing and searching.

Key Features:

Detects duplicates based on unique attributes such as:
- Candidate Name
- Email Address
- Mobile Number
- Resume Title
Ensures only one instance of similar documents is retained in the index.
Available through Search & Match APIs, especially if using ParseAndIndex.

Index Documents Documentation

2. Custom Duplicate Detection Logic

For more advanced or specific deduplication needs, you can implement custom logic before indexing.

Common Pre-Index Checks:

Hash the resume content (SHA256/MD5) and store it in your database. Compare new uploads against existing hashes.
Use a unique identifier like candidate ID, email, or phone number to filter duplicates.
Enable real-time duplicate detection during parsing using Resume Parser’s capabilities:
- This detects if the resume has already been parsed within your environment.

You can enable settings in the API such as duplicatecheck: true if you're using enhanced dynamic API configurations.

3. Managing Duplicates via API Endpoints

Delete Existing Indexed Documents

Use the /deleteAllDocuments API if you need to reset your index to remove previously stored (and possibly duplicated) documents.

DeleteAllDocuments API

Update Existing Documents

Some systems prefer to:

Index documents with a unique documentId
If a duplicate is detected, overwrite or update the existing indexed entry using that ID.

Check if your integration is passing a unique ID or not—this is often the key to identifying duplicates.

What Doesn't Work

Indexing raw resumes without parsing will make duplication detection harder.
If you bypass structured data checks, documents with small differences may be indexed as new entries.

Best Practices to Avoid Duplication

Strategy	Description
Use consistent `documentId`	Prevents re-indexing of the same resume
Enable duplicate detection in Resume Parser	Identifies repeated documents at parse time
Store content hash	Pre-checks for file content duplication
Configure deduplication rules	Customize what constitutes a duplicate (e.g., same phone + email)

Summary

Question	Answer
Can RChilli check for duplicates during indexing?	Yes, via built-in or custom logic
Can duplicates be auto-prevented?	With proper API configuration
Can I delete or reset indexed data?	Yes, using `/deleteAllDocuments`

If you need help implementing deduplication logic or setting up your environment for clean indexing, reach out to support@rchilli.com.