iMerit Scholar Payment Analyzer

Splink Probabilistic Record Linkage

We use Splink with a 70% confidence threshold to identify potential duplicate vendors based on name similarity, email domains, and location patterns. The algorithm assigns match probabilities using machine learning trained on known duplicate patterns.

Key Finding: 70% threshold balances precision vs recall, capturing obvious duplicates while minimizing false positives from common names.

NetworkX Graph Clustering

Vendors are grouped into clusters using NetworkX graph analysis. Each cluster represents a connected component of potentially related vendors, whether through shared tax identifiers, similar names, or other linking factors.

Cluster Insights: 47 total clusters identified, ranging from simple 2-vendor pairs to complex networks of 4+ related profiles.

Risk Classification

Risk levels are assigned based on multiple factors:

CRITICAL: Clear evidence of fraud or major name mismatches
HIGH: Shared tax identifiers between different individuals
MEDIUM: Name similarities or minor discrepancies requiring review
LOW: Technical duplicates or minor administrative issues

Known Limitations

False Positive Alert: Common Kenyan and Indian names (Kumar, Otieno, Ali) may trigger false matches. Manual review recommended for South Asian and East African vendor clusters.

Additional limitations include:

Transliteration variations not fully captured
Maiden name changes may appear as mismatches
Bank account updates could flag as suspicious
Cultural naming patterns may confuse the algorithm

Threshold Tuning Recommendations

Based on this payrun's results:

Consider lowering Splink threshold to 65% to catch more subtle duplicates
Implement cultural name pattern exceptions for common false positives
Add bank account change detection with temporal analysis
Develop region-specific validation rules

Data Freshness

Last Updated: 2026-03-08 15:41:58 UTC

Analysis includes all vendor data, payment history, and tax information as of the generation timestamp. Real-time updates are not reflected in this static analysis.

▼ Bank Beneficiary Identity Groups

▼ Shared Tax ID Groups

▼ Bank Entry Diversity (Single Vendor, Multiple Accounts)

▼ IP Clusters