Avatar

Wes Kim

(Unyoung)

Read Resume
thumbnail

Synthetic Data Generation for Anomaly Detection

Machine LearningSynthetic DataFraud DetectionData Augmentation

Synthetic data generation / Data augmentation to increase the robustness of blockchain fraud detection.

Anomaly detection problems typically suffer from extreme class imbalance: "normal" events dominate, while anomalies (e.g., fraud) are rare but costly to miss. We investigate whether synthetic minority data augmentation improves supervised fraud detection. Using the public credit-card transactions dataset (features V1–V28, Time, Amount; label Class), we compare (i) baseline Logistic Regression and (ii) Isolation Forest, then evaluate two augmentation strategies—SMOTE and CTGAN—applied only to the training set to avoid leakage.

Our primary objective is to increase recall on the minority class (fraud) while maintaining reasonable precision. CTGAN augmentation increases minority recall from 0.71 (baseline) to up to 0.85 (with 100% synthetic anomalies relative to the original dataset), with modest precision trade-offs and an overall F1 improvement (e.g., from 0.79 → 0.83 at ~5–10% synthetic anomalies). We outline implications for blockchain anomaly detection and discuss best practices for robust evaluation under heavy imbalance.

1. Problem Framing & Metrics

Goal: Maximize detection of fraudulent transactions (positive class = Class=1) at acceptable false-positive rates.

Why accuracy is insufficient: With ~0.17% positives, a trivial "all normal" classifier attains high accuracy but zero utility. Report Precision, Recall, and F1; PR curves/PR-AUC are recommended.

2. Data & Preprocessing

  • Features: V1..V28, Time, Amount; label Class
  • Scaling: RobustScaler for Time and Amount
  • Split: 80/20 train/test with random_state=2
  • No leakage: augmentation occurs only in training

3. Models

  • Logistic Regression (supervised) — simple, strong baseline
  • Isolation Forest (unsupervised) — anomaly scoring (fit on X only), performance depends on contamination/threshold choice

4. Baseline Results (No Augmentation)

  • Logistic Regression: Precision 0.90, Recall 0.71, F1 0.79 (Accuracy 0.999—uninformative here)
  • Isolation Forest: Precision 0.30, Recall 0.44, F1 0.36 (Accuracy 0.999)

Takeaway: Supervised baseline outperforms out-of-the-box Isolation Forest.

5. Data Augmentation Methods

5.1 SMOTE: Interpolates between minority neighbors to balance classes. Outcome: Recall ≈ 0.90 but Precision ≈ 0.05 → minority F1 ≈ 0.10 (too many false positives).

5.2 CTGAN: Generates synthetic frauds; we add synthetic anomalies to training at ratios 0.5%, 1%, 5%, 10%, 20%, 50%, 100% of dataset size.

6. Results

All metrics below are on the same held-out real test set (no synthetic data).

Baselines:

  • Logistic Regression: P=0.90, R=0.71, F1=0.79
  • Isolation Forest: P=0.30, R=0.44, F1=0.36

SMOTE (Logistic Regression): Minority Recall ≈ 0.90, Precision ≈ 0.05, F1 ≈ 0.10. Interpretation: excessive false positives; poor precision–recall balance.

CTGAN (Logistic Regression) — representative points:

  • 0.5% synthetic: P=0.86, R=0.74, F1=0.79
  • 1% synthetic: P=0.87, R=0.74, F1=0.80
  • 5% synthetic: P=0.84, R=0.77, F1=0.81
  • 10% synthetic: P=0.87, R=0.79, F1=0.82
  • 20% synthetic: P=0.86, R=0.81, F1=0.83
  • 50% synthetic: P=0.82, R=0.83, F1=0.83
  • 100% synthetic: P=0.82, R=0.85, F1=0.83

Summary: Versus baseline, CTGAN improves F1 by ~3–4 points and Recall by ~8–14 points, with precision remaining reasonable (0.82–0.87). Gains are consistent across sample ratios from ~5% to 100%, with a sweet spot around 10–20% where recall rises and precision is still strong. Metric variance is expected due to the small number of positives in the test set; repeated runs/stratified splits are advised.

8. Discussion

CTGAN better captures complex minority structure than linear interpolation (SMOTE), expanding the decision boundary where it matters. It substantially boosts recall without collapsing precision. Guard against generator artifacts with realism checks (TSTR/TSNR, nearest-neighbor distances, UMAP visualization).

10. Extension to Blockchain Anomaly Detection

Feature families: transaction-level, address-level, graph-structural, temporal, and cross-entity.

Preprocessing: robust scaling/log transforms; avoid temporal/causal leakage.

Augmentation: CTGAN per segment (token/protocol/behavior cluster); treat categorical/discrete columns explicitly; validate realism so a classifier can't trivially separate real vs synthetic.

11. Conclusion

On an imbalanced credit-card dataset, CTGAN-based minority augmentation consistently improves recall and F1 over a strong logistic baseline while keeping precision high—unlike SMOTE, which induces excessive false positives. With careful temporal validation, threshold selection, and realism checks, this approach is promising for blockchain anomaly detection, where labeled anomalies are scarce and heterogeneous.

Live Preview
2025 — Wes Kim's Personal Website