Yes, it is possible to check for duplication of documents when indexing into RChilli’s Search Engine System, but it depends on how you implement the solution. Below is a detailed guide on RChilli's capabilities and how to manage duplicates effectively:
1. Built-In De-Duplication in RChilli Search & Match
RChilli's Search & Match Engine includes a deduplication mechanism that helps prevent duplicate entries during indexing and searching.
Key Features:
-
Detects duplicates based on unique attributes such as:
-
Candidate Name
-
Email Address
-
Mobile Number
-
Resume Title
-
-
Ensures only one instance of similar documents is retained in the index.
-
Available through Search & Match APIs, especially if using
ParseAndIndex.
2. Custom Duplicate Detection Logic
For more advanced or specific deduplication needs, you can implement custom logic before indexing.
Common Pre-Index Checks:
-
Hash the resume content (SHA256/MD5) and store it in your database. Compare new uploads against existing hashes.
-
Use a unique identifier like candidate ID, email, or phone number to filter duplicates.
-
Enable real-time duplicate detection during parsing using Resume Parser’s capabilities:
-
This detects if the resume has already been parsed within your environment.
-
You can enable settings in the API such as
duplicatecheck: trueif you're using enhanced dynamic API configurations.
3. Managing Duplicates via API Endpoints
Delete Existing Indexed Documents
Use the /deleteAllDocuments API if you need to reset your index to remove previously stored (and possibly duplicated) documents.
Update Existing Documents
Some systems prefer to:
-
Index documents with a unique
documentId -
If a duplicate is detected, overwrite or update the existing indexed entry using that ID.
Check if your integration is passing a unique ID or not—this is often the key to identifying duplicates.
What Doesn't Work
-
Indexing raw resumes without parsing will make duplication detection harder.
-
If you bypass structured data checks, documents with small differences may be indexed as new entries.
Best Practices to Avoid Duplication
| Strategy | Description |
|---|---|
Use consistent documentId
|
Prevents re-indexing of the same resume |
| Enable duplicate detection in Resume Parser | Identifies repeated documents at parse time |
| Store content hash | Pre-checks for file content duplication |
| Configure deduplication rules | Customize what constitutes a duplicate (e.g., same phone + email) |
Summary
| Question | Answer |
|---|---|
| Can RChilli check for duplicates during indexing? | Yes, via built-in or custom logic |
| Can duplicates be auto-prevented? | With proper API configuration |
| Can I delete or reset indexed data? | Yes, using /deleteAllDocuments
|
If you need help implementing deduplication logic or setting up your environment for clean indexing, reach out to support@rchilli.com.
Comments
0 comments
Please sign in to leave a comment.