Loan Prediction App | Vincent Cheng Yun Sheng

I designed this project with the purposes of using AWS RDS to host the data, AutoGluon for building Machine Learning model and Flask for the app. The use case for this project is to predict whether a loan will be paid off.

App

GitHub

Tech Stack

Frontend: Flask
Database: AWS RDS (MariaDB)
File Storage: AWS S3
Machine Learning: AutoGluon
Deployment: Render

Flask was chosen for frontend as it was a popular app building framework for Python. RDS was chosen for the database as it is also a popular platform to deploy database. MariaDB variant was chosen due to my own familiarity. AutoGluon was chosen for its robust ensemble methods and ease of baseline benchmarking. Render was chosen as the deployment platform due to its low cost, and it includes recent Python libraries.

Architecture

Step 1: Searching for Open Source Data

I started by searching available data for this project. I found the PKDD’99 Financial dataset from CTU Relational Dataset Repository. This dataset consists of 606 successful and 76 not successful loans along with their information and transactions. There are 8 tables in this dataset. For this project, this dataset was filtered to only finished loans.

Step 2: Download Dataset

I used MySQL Workbench to connect to the CTU Relational database and exported the dataset in SQL dump format.

Step 3: Setting up RDS Instance

Next, I created a db.t4g.micro instance running MariaDB as this is under AWS free tier and is sufficient for this project. I set up a EC2 instance as a bastion host to connect to the RDS instance because the RDS instance is provisioned in private network. This is best practice as RDS instance should not be reachable over public network for security purpose. I connected to the RDS instance and uploaded the dataset.

Step 4: EDA

To create the machine learning model, I performed EDA using Jupyter notebook to understand the dataset and what can be used as features. I used my device to connect to the EC2 instance to connect to the RDS instance. This method only works for querying small size of data. This means that only aggregation and dataset preview is feasible for this method. EDA is performed to understand the distribution of the data, for example count of card types.

Example EDA: Loan Amount by Status

Step 5: Export Data from RDS to S3

To train the model on the dataset, all the dataset was exported from RDS to S3. A database snapshot was created, and the snapshot data was exported to a S3 bucket.

Step 6: Feature Engineering

Initial manual feature engineering yielded poor separation. I adopted a proven feature set (attributed to Zhou Xu), focusing on temporal aggregations like days_between (loan date vs. account creation) and financial velocity metrics like avg_trans_amount.

Below are the features used to create the prediction model

loan duration
loan amount
loan payments
day between loan date and account creation date
account frequency
average order amount
average transaction amount
average transaction balance
number of transactions
card type
average salary in the district
gender
age

Loan table was joined with Account, Displacement, Card, Client, District, Transaction, and Orders tables to create the features.

Step 7: Model Training

The dataset with features was split into 70% (202 rows) for training and 30% (87 rows) for testing. TabularPredictor class from AutoGluon library was used to train the model with below hyperparameters.

hyperparameters = {
	'NN_TORCH': [{}],
	# 'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, {'learning_rate': 0.03, 'num_leaves': 128, 'feature_fraction': 0.9, 'min_data_in_leaf': 3, 'ag_args': {'name_suffix': 'Large', 'priority': 0, 'hyperparameter_tune_kwargs': None}}],
	'CAT': [{}],
	# 'XGB': [{}],
	'FASTAI': [{}],
	'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}

predictor = (
  TabularPredictor(
    label='loan_status',
    path='AutogluonModels/final'
  )
  .fit(
    train_data,
    hyperparameters=hyperparameters,
    presets='optimize_for_deployment'
  )
)

The training was quite fast due to small data size. The trained model was saved under AutogluonModels/final folder. The trained model was evaluated against test dataset with below performance result.

{'accuracy': 0.8850574712643678,
 'balanced_accuracy': 0.5435064935064935,
 'mcc': 0.185184647595632,
 'roc_auc': 0.5701298701298704,
 'f1': 0.16666666666666666,
 'precision': 0.5,
 'recall': 0.1}

Based on the predictor leaderboard, the models kept are NeuralNetFastAI and WeightedEnsemble_L2.

Step 8: App Development

The web app was created using Flask framework with Pico CSS due to its ease of use. This app consisted of a form to collect the input which will be used to predict the loan status whether it will be paid off. The form was created using FlaskForm due to its native integration with Flask app framework.

Application Screenshot with Expanded Sections

After filling up the form and clicking the “Predict loan status!” button, the app will use the trained model to predict the loan status.

Application Screenshot with Loan Prediction

Step 9: App Deployment

Initially, I tried to make this app serverless using AWS SAM (zip method). However, the app size was too big due to the inclusion of the AutoGluon library. Next, I tried to deploy this app on Digital Ocean using both zip and Docker methods, but both failed. The zip method failed due to Python library incompatibility. The Docker method failed because the health check was unable to reach the container port. Eventually, I found the Render platform from Reddit threads.

I followed this documentation Render Deploy Flask to deploy my app. Then, I followed this documentation Custom Domains on Render to add my domain.

Now the app is reachable from here! 🎉

Afterthoughts

Full-Stack ML Engineering Challenges Integrating a secure cloud database with a modern ML pipeline revealed the complexity of end-to-end development. The primary challenge was deployment constraints: high-performance AutoML libraries like AutoGluon generate large model artifacts that exceed the limits of standard serverless functions (AWS Lambda).

Lessons Learned

Infrastructure Security: Configuring a Bastion Host to access a private RDS instance is critical for production security, though it introduces latency during the EDA phase.
Model Trade-offs: While AutoGluon accelerated training, its heavy footprint required a containerized deployment (Render/Docker) rather than serverless one.
Imbalanced Data: The high accuracy but low recall score highlights the need for advanced handling of class imbalance (e.g., SMOTE) in future iterations.

Future Enhancements

Apply SMOTE (Synthetic Minority Over-sampling Technique) or adjust the clas weights to prioritize Recall over Accuracy
Improve app UI & UX
Improve the robustness of prediction model
Change it to serverless app by using different ML framework
Migrate to a serverless architecture (AWS Lambda) by refactoring the model from AutoGluon to a lightweight framework like Scikit-Learn or ONNX runtime to meet package size constraints