2025-04-19
Geospatial Credit Risk Modeling: DBSCAN Clustering for Subprime Segmentation
Tech Stack: Python, DBSCAN, scikit-learn, pandas, GeoPandas, Matplotlib, SHAP, XGBoost, Silhouette Analysis
Project Overview
The Geospatial Credit Risk Modeling project applies advanced unsupervised learning techniques to segment subprime and thin-file credit applicants using geospatial clustering. By implementing DBSCAN on customer geolocation data enriched with socioeconomic features, this project identified four distinct risk profiles with a silhouette score of 0.53. Integration with XGBoost predictive modeling achieved an AUC-ROC of 0.74, representing a 9% improvement over baseline models. The implementation reduced portfolio default rates by 18% while increasing approval rates for lower-risk segments by 23%, demonstrating effective balance between risk management and financial inclusion.
Motivation
Traditional credit scoring systematically excludes thin-file and subprime applicants due to insufficient credit history, despite significant risk variation within these populations. Geographic and socioeconomic factors strongly correlate with credit behavior but remain underutilized in conventional models. This project addresses the gap by leveraging spatial clustering algorithms to discover risk patterns in alternative data sources, enabling differentiated underwriting strategies that improve both portfolio performance and credit accessibility for underbanked populations.
Technical Details
Data Engineering and Preprocessing
The analysis utilized 127,000 credit applications spanning 18 months, integrating internal application data with census demographics, ZIP-level economic indicators, and geolocation services. The preprocessing pipeline implemented logarithmic transformation for right-skewed income distributions, target encoding for categorical variables based on historical default rates, and ZIP code centroid imputation for 8% of missing coordinates. Geographic coordinates were projected using appropriate CRS transformations to ensure accurate euclidean distance calculations in the clustering algorithm.
Feature Engineering Strategy
Engineered features augmented raw geolocation data with risk-relevant context. Population density calculations within 5-mile radii quantified urbanization levels. Haversine distance computations measured proximity to financial institutions as a banking access proxy. Neighborhood credit health scores aggregated anonymized performance metrics within geographic boundaries. These derived features created a multidimensional space combining spatial proximity with socioeconomic risk indicators, optimizing DBSCAN's ability to identify meaningful density-based clusters.
DBSCAN Implementation and Hyperparameter Optimization
DBSCAN was selected for its ability to identify arbitrary-shaped clusters and handle noise points without predetermined cluster counts. Hyperparameter tuning employed grid search across epsilon values (0.05 to 0.5 degrees) with silhouette score optimization. The optimal configuration set epsilon at 0.15 degrees (approximately 10 miles) and minimum samples at 50, balancing geographic coherence with statistical significance. StandardScaler normalization addressed feature scale disparities between geographic coordinates and socioeconomic variables, ensuring balanced distance metric contributions. The algorithm partitioned the dataset into four primary clusters plus a noise category. Silhouette analysis validated cluster quality at 0.53, while the Davies-Bouldin Index of 0.89 confirmed effective separation. Intra-cluster variance reduction of 34% relative to population variance demonstrated successful homogeneity within segments. Chi-square tests verified statistically significant inter-cluster default rate differences (p < 0.001).
Risk Profile Analytics
Cluster characterization revealed distinct risk segments. Urban Established (34,000 applicants, 12% default rate) concentrated in high-density metros with strong banking access. Suburban Transitional (41,000 applicants, 19% default rate) exhibited moderate density and income variability. Rural Constrained (28,000 applicants, 27% default rate) showed limited financial infrastructure access and elevated risk. Economic Opportunity Zones (19,000 applicants, 16% default rate) demonstrated improving employment trajectories despite current lower incomes. The 5,000 noise-classified applicants required individualized assessment due to geographic dispersion.
Supervised Learning Integration
Cluster labels were engineered as categorical features in an XGBoost gradient boosting framework predicting binary default outcomes. The 70/30 train-validation split enabled robust performance evaluation. SHAP value analysis quantified feature importance, revealing cluster membership as the third most predictive feature after debt-to-income ratio and employment duration. This validated the clustering's incremental predictive power. The model achieved AUC-ROC of 0.74, a 9% lift over models excluding geographic segmentation, demonstrating effective integration of unsupervised and supervised techniques.
Visualization and Deployment
GeoPandas choropleth maps visualized spatial risk distributions, overlaying cluster assignments with default rate heat maps. Interactive dashboards prototyped in Tableau integrated real-time application scoring with cluster-based risk signals, enabling dynamic underwriting workflows. Automated cluster assignment reduced decision latency by 22% through pre-computed risk stratification.
Technical Challenges and Solutions
• Epsilon Sensitivity: DBSCAN performance depends critically on epsilon parameter selection. Systematic grid search with silhouette score evaluation across candidate values identified epsilon = 0.15 as optimal. Sensitivity analysis confirmed robustness within the 0.12 to 0.18 range, ensuring stability against minor parameter variations.
• Feature Scaling: Disparate measurement scales between geographic degrees and socioeconomic units required careful normalization. StandardScaler transformation ensured proportional distance metric contributions while preserving relative feature importance for interpretability.
• Noise Handling: Four percent of applicants exhibited insufficient local density for cluster assignment. Rather than degrading cluster purity through forced assignment, a dedicated underwriting pathway processed noise cases individually, maintaining analytical rigor.
• Regulatory Compliance: Geographic segmentation underwent legal review for fair lending compliance. Disparate impact analysis confirmed approval rate variations remained within regulatory thresholds when controlling for legitimate creditworthiness factors, validating the methodology's legal defensibility.
Results and Impact
• Implementation delivered quantifiable improvements across risk and business metrics. The 18% default rate reduction translated directly to decreased charge-offs and improved portfolio quality. Approval rate increases of 23% for low-risk clusters expanded credit access to 7,820 previously excluded applicants. Revenue per approval increased 12% through risk-aligned interest rate stratification. Underwriting efficiency gains of 22% in decision time demonstrated operational scalability.
• Strategically, the framework enabled targeted marketing in favorable geographies, cluster-specific product development, and data-driven partnership strategies with community development financial institutions. The analytical approach provided competitive differentiation in underserved markets where traditional lenders applied overly conservative uniform policies
Conclusion
This project demonstrates the effective application of density-based clustering algorithms to alternative credit data for risk segmentation. The technical implementation successfully combined DBSCAN's spatial pattern recognition with XGBoost's predictive modeling, validated through rigorous statistical metrics and business outcomes. By uncovering geographic risk structure in thin-file populations, the methodology enabled more precise, equitable underwriting while maintaining portfolio performance, establishing a reproducible framework for financial inclusion through advanced analytics.