We use Splink with a 70% confidence threshold to identify potential duplicate vendors based on name similarity, email domains, and location patterns. The algorithm assigns match probabilities using machine learning trained on known duplicate patterns.
Key Finding: 70% threshold balances precision vs recall, capturing obvious duplicates while minimizing false positives from common names.
NetworkX Graph Clustering
Vendors are grouped into clusters using NetworkX graph analysis. Each cluster represents a connected component of potentially related vendors, whether through shared tax identifiers, similar names, or other linking factors.
Cluster Insights: 47 total clusters identified, ranging from simple 2-vendor pairs to complex networks of 4+ related profiles.
Risk Classification
Risk levels are assigned based on multiple factors:
CRITICAL: Clear evidence of fraud or major name mismatches
HIGH: Shared tax identifiers between different individuals
MEDIUM: Name similarities or minor discrepancies requiring review
LOW: Technical duplicates or minor administrative issues
Known Limitations
False Positive Alert: Common Kenyan and Indian names (Kumar, Otieno, Ali) may trigger false matches. Manual review recommended for South Asian and East African vendor clusters.
Additional limitations include:
Transliteration variations not fully captured
Maiden name changes may appear as mismatches
Bank account updates could flag as suspicious
Cultural naming patterns may confuse the algorithm
Threshold Tuning Recommendations
Based on this payrun's results:
Consider lowering Splink threshold to 65% to catch more subtle duplicates
Implement cultural name pattern exceptions for common false positives
Add bank account change detection with temporal analysis
Develop region-specific validation rules
Data Freshness
Last Updated:2026-03-08 15:41:58 UTC
Analysis includes all vendor data, payment history, and tax information as of the generation timestamp. Real-time updates are not reflected in this static analysis.