<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://blog.karanjkar.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.karanjkar.com/" rel="alternate" type="text/html" /><updated>2026-06-04T16:21:03+00:00</updated><id>https://blog.karanjkar.com/feed.xml</id><title type="html">Vijay Karanjkar’s Blog</title><subtitle>Software, Machine Learning, &amp; Quantitative Finance. Writing and thoughts by Vijay Karanjkar.</subtitle><entry><title type="html">Catching Fraudsters with Graphs: Credit Card Fraud Detection</title><link href="https://blog.karanjkar.com/machine-learning/data-science/2026/06/04/credit-card-fraud-detection.html" rel="alternate" type="text/html" title="Catching Fraudsters with Graphs: Credit Card Fraud Detection" /><published>2026-06-04T16:14:00+00:00</published><updated>2026-06-04T16:14:00+00:00</updated><id>https://blog.karanjkar.com/machine-learning/data-science/2026/06/04/credit-card-fraud-detection</id><content type="html" xml:base="https://blog.karanjkar.com/machine-learning/data-science/2026/06/04/credit-card-fraud-detection.html"><![CDATA[<p>Credit card fraud is an ever-evolving arms race. As security measures improve, fraudsters develop increasingly sophisticated techniques to bypass them, often operating in organized, complex networks. Recently, I embarked on a project to tackle this challenge using the <strong>IEEE-CIS Fraud Detection dataset</strong>—and the journey took me from standard machine learning algorithms all the way to advanced Graph Neural Networks (GNNs).</p>

<p>Here is a look into how I built the pipeline and why modeling transactions as a graph fundamentally changed the game.</p>

<h2 id="the-challenge--the-data">The Challenge &amp; The Data</h2>

<p>The core of the problem lies in predicting a binary target: <code class="language-plaintext highlighter-rouge">isFraud</code>. The dataset provides a rich playground, split across two main tables:</p>
<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">transaction</code></strong>: Contains the core details of the payment (amounts, timestamps as timedeltas, card information).</li>
  <li><strong><code class="language-plaintext highlighter-rouge">identity</code></strong>: Contains network and device information associated with the transaction.</li>
</ul>

<p>Joined by <code class="language-plaintext highlighter-rouge">TransactionID</code>, this data provides a comprehensive, albeit extremely imbalanced, view of online transactions. The biggest hurdle? Fraudulent transactions make up a tiny fraction of the dataset, meaning a model that simply guesses “not fraud” every time will still achieve a seemingly high accuracy.</p>

<h2 id="building-the-standard-pipeline">Building the Standard Pipeline</h2>

<p>Before jumping into complex architectures, it’s crucial to establish a strong baseline. My initial approach involved a complete standard machine learning pipeline:</p>

<ol>
  <li><strong>Extensive Preprocessing</strong>: Cleaning data, handling missing values, feature engineering, and dimensionality reduction.</li>
  <li><strong>Handling Imbalance</strong>: I utilized <strong>SMOTE (Synthetic Minority Over-sampling Technique)</strong> to generate synthetic examples of fraud, forcing the models to learn the patterns of fraudulent behavior rather than just ignoring them.</li>
  <li><strong>Training &amp; Evaluation</strong>: I trained a suite of models including SVMs, K-Nearest Neighbors, AdaBoost, Random Forests, XGBoost, and LightGBM.</li>
</ol>

<h3 id="the-baseline-results">The Baseline Results</h3>

<p>Evaluating these models on an unsampled 50k validation set yielded expected results. Gradient boosting frameworks led the pack among tabular models:</p>

<ul>
  <li><strong>Random Forest</strong> achieved a respectable 43.85% F1-Score with $111k in projected savings.</li>
  <li><strong>LightGBM</strong> pushed the F1-Score to 49.55%.</li>
  <li><strong>XGBoost</strong> topped the standard models with an F1-Score of 50.52% and a projected savings of <strong>$174,060</strong>.</li>
</ul>

<p>While an F1-score of ~50% might sound low in other domains, in the highly imbalanced world of fraud detection, this is a solid baseline. But looking at the recall (around ~36% for XGBoost), I knew there was a lot of fraud slipping through the cracks.</p>

<h2 id="the-secret-weapon-graph-neural-networks-gnns">The Secret Weapon: Graph Neural Networks (GNNs)</h2>

<p>Standard tabular models treat every transaction as an isolated event. But fraud rarely happens in a vacuum. Fraudsters use shared devices, similar IP addresses, and connected email networks. To capture these <em>relationships</em>, I turned to <strong>Graph Neural Networks</strong>.</p>

<p>By modeling the data as a graph—where nodes represent entities (like a specific credit card or an IP address) and edges represent the transactions between them—the model can learn to identify complex fraud rings.</p>

<p>I evaluated the GNN on a SMOTE-resampled graph encompassing over 1.1 million relationships.</p>

<h3 id="the-gnn-impact">The GNN Impact</h3>

<p>The results were nothing short of astronomical.</p>

<p>While the accuracy dropped slightly to 85.00% (due to a higher false-positive rate inherent in aggressive recall strategies), the <strong>Recall skyrocketed to 84.00%</strong>.</p>

<p>Because high-dollar fraud rings were no longer slipping through unnoticed, the <strong>Projected Savings jumped to an incredible $68,169,432.00</strong>.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Recall</th>
      <th>F1-Score</th>
      <th>ROC-AUC</th>
      <th>Projected Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>XGBoost</strong></td>
      <td>36.63%</td>
      <td>50.52%</td>
      <td>89.00%</td>
      <td>$174,060.43</td>
    </tr>
    <tr>
      <td><strong>Graph Neural Network</strong></td>
      <td><strong>84.00%</strong></td>
      <td><strong>85.00%</strong></td>
      <td><strong>92.76%</strong></td>
      <td><strong>$68,169,432.00</strong></td>
    </tr>
  </tbody>
</table>

<p><em>Note: The GNN’s unique capability to map transaction identities led to a substantial improvement in Recall, making it heavily favorable for high-dollar fraud detection compared to tabular models.</em></p>

<h2 id="conclusion">Conclusion</h2>

<p>This project underscored a vital lesson in data science: <strong>context matters</strong>. Tabular models are fantastic, fast, and reliable. However, when the problem fundamentally revolves around relationships and networks—like organized credit card fraud—re-framing the problem into a graph can unlock performance that isolated data points simply cannot reach.</p>

<p>If you are interested in exploring the codebase, including the data preprocessing scripts, the PyTorch Multi-Layer Perceptrons, and the GNN implementation, you can check out the source code in my repository: <a href="https://github.com/Vijay-K-2003/CSE_575_Fraud_Detection">Vijay-K-2003/CSE_575_Fraud_Detection</a>.</p>

<hr />

<p><em>To run the code yourself, simply install the dependencies via <code class="language-plaintext highlighter-rouge">pip install -r requirements.txt</code> and run <code class="language-plaintext highlighter-rouge">python train_models.py</code> for the tabular models, or explore the <code class="language-plaintext highlighter-rouge">gnn/</code> directory for the graph-based approach.</em></p>]]></content><author><name></name></author><category term="machine-learning" /><category term="data-science" /><summary type="html"><![CDATA[A deep dive into using standard ML models and Graph Neural Networks (GNNs) to uncover complex fraud rings in the IEEE-CIS dataset.]]></summary></entry><entry><title type="html">Welcome to Jekyll!</title><link href="https://blog.karanjkar.com/jekyll/update/2026/06/04/welcome-to-jekyll.html" rel="alternate" type="text/html" title="Welcome to Jekyll!" /><published>2026-06-04T15:28:42+00:00</published><updated>2026-06-04T15:28:42+00:00</updated><id>https://blog.karanjkar.com/jekyll/update/2026/06/04/welcome-to-jekyll</id><content type="html" xml:base="https://blog.karanjkar.com/jekyll/update/2026/06/04/welcome-to-jekyll.html"><![CDATA[<p>You’ll find this post in your <code class="language-plaintext highlighter-rouge">_posts</code> directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run <code class="language-plaintext highlighter-rouge">jekyll serve</code>, which launches a web server and auto-regenerates your site when a file is updated.</p>

<p>Jekyll requires blog post files to be named according to the following format:</p>

<p><code class="language-plaintext highlighter-rouge">YEAR-MONTH-DAY-title.MARKUP</code></p>

<p>Where <code class="language-plaintext highlighter-rouge">YEAR</code> is a four-digit number, <code class="language-plaintext highlighter-rouge">MONTH</code> and <code class="language-plaintext highlighter-rouge">DAY</code> are both two-digit numbers, and <code class="language-plaintext highlighter-rouge">MARKUP</code> is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.</p>

<p>Jekyll also offers powerful support for code snippets:</p>

<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">def</span> <span class="nf">print_hi</span><span class="p">(</span><span class="nb">name</span><span class="p">)</span>
  <span class="nb">puts</span> <span class="s2">"Hi, </span><span class="si">#{</span><span class="nb">name</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span>
<span class="n">print_hi</span><span class="p">(</span><span class="s1">'Tom'</span><span class="p">)</span>
<span class="c1">#=&gt; prints 'Hi, Tom' to STDOUT.</span></code></pre></figure>

<p>Check out the <a href="https://jekyllrb.com/docs/home">Jekyll docs</a> for more info on how to get the most out of Jekyll. File all bugs/feature requests at <a href="https://github.com/jekyll/jekyll">Jekyll’s GitHub repo</a>. If you have questions, you can ask them on <a href="https://talk.jekyllrb.com/">Jekyll Talk</a>.</p>]]></content><author><name></name></author><category term="jekyll" /><category term="update" /><summary type="html"><![CDATA[You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.]]></summary></entry></feed>