Reviving My M-PESA Statement Analysis Project — A Year Later

How a dormant repo turned into a working M-PESA statement analysis tool built with Flask and Python — from PDF parsing struggles to building a clean analytics UI.

Posted Oct 15, 2025

By Joseph Ndungi

12 min read

Reviving My M-PESA Statement Analysis Project — A Year Later

Introduction

(Just a quick note: The figures and cashflow summaries you see in the sample graphs and screenshots do not reflect my actual income, expenditure, or net worth. I wish they didn’t!)

The Origin Story

If you live in Kenya, M-PESA isn’t just a mobile wallet; it’s the financial central nervous system of your life. Some earnings, every grocery run, every bill payment flows through it.

But here’s the challenge: Ever downloaded your official M-PESA statement and squinted at the wall of text, thinking, “There has to be an easier way to see exactly where my money vanished this month?”

That thought was the spark for this project—a mission to create a personal finance tool that could automatically analyze expenditures directly from M-PESA PDF statements. The core goal was straightforward: extract, categorize, and visualize spending patterns to bring clarity to chaos. Simple in theory, but as is often the case in programming, the devil was waiting patiently in the details.

Background

I created the GitHub repository for this project about a year ago, fueled by initial enthusiasm. Then, the usual culprits—work deadlines, a surge of procrastination, and the general complexity of life—kicked in. The code sat dormant, collecting digital dust.

The Catalyst

Fast-forward to a few months ago. I stumbled upon David Amunga’s mpesa2csv, an impressive project that converts M-PESA statement PDFs into a structured CSV/Excel format complete with basic analytics.

His work was the push I needed. My original intention was just to extract the raw CSV data and analyze it separately, but when I ran my own statements through a few tools, I kept running into frustrating parsing errors. These were mostly small, pesky issues like stray footer text lines or subtle formatting differences that broke the data export.

I found myself at a crossroads:

The Easy Route: Manually clean up the exported data in a spreadsheet every time, or…
The Challenge Route: Dive deep and build a robust parser to handle the entire PDF extraction process in Python.

I chose the long, messy, and ultimately more rewarding path.

The Toolkit and Approach

I committed to building the solution using Python, a language perfect for data manipulation and quick web application development. The core libraries I relied on were:

PyPDF2: The workhorse for reading and unlocking the password-protected M-PESA PDFs (usually emailed ones).
re (Regular Expressions): My secret weapon for finding and extracting the tiny, specific patterns that make up a transaction, like dates, receipt numbers, and amounts.
pandas: The undisputed champion for data analysis, grouping transactions, and exporting the clean data.
Flask: A lightweight web framework I used to create a simple front-end interface, making the tool accessible outside the terminal.

The Messy Reality of Parsing

At a glance, an M-PESA statement looks like a neat table. Once you try to automate reading it, you realize it’s a structural nightmare! To tame this beast, I had to construct a dedicated parsing pipeline with several static methods to handle the following sequential steps:

Safe Loading: Get the raw PDF content into memory.
Unlocking: Handle the usual password encryption of emailed statements.
Text Consolidation: Extract and merge fragmented text blocks that the PDF reader sees as separate chunks.
Transaction Line Parsing: Use regular expressions to cleanly isolate each transaction’s details.
Edge Case Handling: This was the toughest part—managing transactions with multi-line descriptions and dealing with repeated or ambiguous transaction IDs.

Parsing felt chaotic at times, but every bug fixed forced me to structure the code better, ultimately leading to a more reliable system.

The Core Struggle

The most crucial challenge was accurately identifying and extracting the amount of a transaction and determining if it was a deposit (money Paid In) or a withdrawal (money Withdrawn).

I discovered a subtle but crucial difference between the two main sources of statements:

Statement Source	Withdrawn Column	Encryption	Key Difference for Logic
Email Statement	Uses negative values (e.g., -1,500.00) to denote withdrawals.	Usually Encrypted.	The negative sign makes it easy for the regex to correctly identify a withdrawal.
M-PESA App Statement	Uses absolute values (e.g., 1,500.00) for all transactions.	Usually Not Encrypted.	Lacks a negative sign, making it extremely difficult to distinguish between a Paid In and a Withdrawn amount.

My initial logic, which works perfectly for the email version, relies on identifying that negative sign:

  
# The logic to identify PaidIn vs. Withdrawn:
# 1. Find all monetary values in a transaction line.
# 2. Separate them into 'positives' (Paid In) and 'negatives' (Withdrawn).

paidIn, withdrawn, balance = None, None, None

if amounts:
    balance = amounts[-1] # The last amount is usually the running balance
    negatives = [a for a in amounts[:-1] if a < 0] # Withdrawals
    positives = [a for a in amounts[:-1] if a > 0] # Deposits
    withdrawn = negatives[-1] if negatives else None
    paidIn = positives[-1] if positives else None

This elegant system utterly fails with the app-generated PDF because the negative values are stripped. Fixing this app compatibility issue remains one of my top priorities.

From Terminal Scripts to an Interactive UI

My analysis started as a bare-bones Python script, executed from the terminal with command-line arguments. But soon, I realized running a script wasn’t convenient for regular use.

“Why not wrap this powerful engine in a simple user interface?” I thought.

That decision led to the creation of the Flask application. It’s lightweight, fast, and does exactly what I need: allows for easy PDF upload, processes the data on the server, and presents the spending trends clearly. All the raw data is currently stored in a simple uploads/ folder structure.

Key Analytics & Features Implemented

The application currently provides an immediate, visual picture of my finances:

Cashflow Summary: A clear breakdown of Total Inflow, Total Outflow, and the resulting Net Flow over the statement period.
Spending Trends: Granular views showing transaction activity and amounts for Daily, Weekly, and Monthly periods.
Interactive Tables: A clean preview of all parsed transactions, complete with filtering and pagination for easy browsing.
Responsive Layout: A user interface designed to be usable on both desktop and mobile devices.

Crucial Lessons Learned on the Journey

Every programmer knows the path to a finished project is paved with painful but necessary lessons. Here are the most valuable takeaways:

PDF Parsing is a Trap: Never underestimate the difficulty of parsing arbitrary PDFs. The same M-PESA statement can look different depending on the day it was downloaded, and those small formatting differences will break your extraction logic.
Validation is Your Sanity Check: Relying on Regular Expressions and rigorous data validation saved countless hours of manually debugging broken rows. A few well-thought-out patterns are worth their weight in gold.
Refactor Early and Often: Structuring the code with clear static methods for functions like unlocking, reading, and parsing simplified my life immensely and made future maintenance manageable.
Sometimes, Momentum Wins: While I always advocate for planning, there are times when rapid development driven by pure momentum is the best way to overcome a difficult challenge and keep the project moving.

Future Work and Recommendations for a Robust Tool

Moving forward, the goal is to transform this tool from a functional script with a UI into a truly robust and production-ready application.

1. The Duplicate Data Nightmare

When a user merges two statement periods (e.g., January-March and March-June), avoiding duplicate transactions is absolutely critical. Logically, the Receipt Number (Transaction ID) should be the unique key.

I was wrong.

Transactions involving Fuliza debt payment and its immediate credit/debit transaction often share the exact same receipt number, yet they represent two distinct money movements. A simple check using just the Receipt ID will incorrectly filter out a legitimate transaction.

Future Fix: I’m exploring creating a more sophisticated Composite Key using a combination of the Receipt Number, the Amount, and the Transaction Type to ensure genuine uniqueness across all imported statements.

2. Upgrading Data Persistence

The current file-based storage in the uploads/ folder is simple but unsustainable.

Future Fix: Migrate to a lightweight, embedded database like SQLite or a more powerful relational system like PostgreSQL to handle long-term data storage, faster querying, and better management of user sessions.

These refactored sections focus on improving data utility through better categorization and enhancing user engagement through automation.

3. Refining the Financial Map: The Categorization Challenge

Once the raw data is successfully parsed, the next critical step is making it meaningful through categorization. Currently, my logic relies on basic, static keyword matching against the transaction details. While this works for common movements like Merchant Payment or Withdrawal, it is not intelligent enough to capture nuance. This limitation often results in many valid transactions falling into an ambiguous “Other” category, which significantly limits true expenditure tracking.

Here is a glimpse of the current categorization structure, which needs a major upgrade:

  
def categorize_transaction(details: str) -> str:
    details = str(details).lower()

    if "merchant payment" in details or "buy goods" in details:
        return "Merchant Payment"
    elif "customer send money" in details or "transfer to" in details:
        return "Person-to-Person"
    elif "withdraw" in details:
        return "Withdrawal"
    elif ...
    else:
        return "Other"
    # ... some transactions are currently falling into the generic 'Other' category
    # Looking for a better, scalable way to categorize this!

Future Fix: The plan is to explore smarter categorization methods. This might involve using advanced natural language processing (NLP) or machine learning to intelligently infer the transaction’s purpose when keywords are insufficient. The goal is to turn the vague “Other” overview into a true, detailed financial expenditure map.

4. Enhancing User Experience & Automation

Beyond the core data features, the user experience needs to become more proactive. While the current web UI is excellent for on-demand checks, personal finance is about consistency and timely reminders.

Future Fixes:

Automated Reporting: Work on integrating simple communication systems to send automated, summarized financial reports directly to users via channels like email, SMS, or WhatsApp. This feature will allow users to receive a weekly snapshot of their Net Flow or major spending categories without needing to manually log into the application.
Data Visualizations: Implement richer data visualizations (charts, customizable dashboards) beyond the current table-based summaries.

Visuals

1. The Entry Point

A screenshot of the simple Flask upload page. This image should show a minimal interface with a file input button labeled “Choose File” and a “Upload & Parse” button. This clearly demonstrates the simplicity of starting the analysis.

2. The Data Preview

A screenshot of the parsed transaction table. This image should showcase a clean, interactive table with columns such as: Date, Details/Description, Paid In (Ksh), Withdrawn (Ksh), and Balance (Ksh). This proves that the core parsing logic is working and turning unstructured data into clean rows.

3. Cashflow Summary

A screenshot of the summary dashboard. This visual should present a small, bold table or card view summarizing the financial movements over the period, highlighting the Total Inflow, Total Outflow, and the final Net Flow. This gives the reader the “big picture” at a glance.

4. The Deep Dive

Now, let’s talk about what actually happens once you upload that M-PESA statement (and yes — I’d love to show the actual numbers, but they’re a little too “personal finance 101” for the internet ).

The dashboard is divided into a few main sections — each serving a specific purpose in making sense of your money flow:

💡 Insights

This is where the story of your transactions begins. You’ll find:

Category Breakdown & Counterparties — who you send money to (and who sends it back 👀).
Category Summary — totals for Paid In, Withdrawn, and the resulting Net In–Out per category.
Top Counterparties Count and Top Counterparties Value — showing your most frequent and highest-value contacts.
Visualizations that make it all clearer: Inflow vs Outflow by Category and Top 10 Counterparties by Transaction Value.

📈 Analysis

For the data nerds (me included), this section dives into:

Periodic Cashflow Analysis — tracking how money moves across days, weeks, or months.
Charts for Cashflow, Category Trends, and Merchant Trends, giving you that “aha” moment when you realize where most of your spending goes.

🔄 Trends

Finally, the Trends section ties everything together with:

Balance Summary — how your M-PESA balance evolved over time.
Spending Efficiency Summary — comparing inflows vs outflows.
Visuals for the selected period: Spending Trends and M-PESA Balance Evolution Over Time.

All in all, it’s a neat, private, and slightly humbling look at where your hard-earned cash actually goes — but hey, at least now the analysis part looks good 😅.

Closing Thoughts

What began as a small weekend curiosity and then became a dormant repository successfully transformed into a year-long revival project.

It was a powerful reminder that in the world of code, unfinished work is not a failure; it’s simply paused progress. Sometimes, all a good idea needs is a little patience and a spark of inspiration from the community to become a reality.

The repo is now open source—I welcome you to explore, fork, and contribute!

👉GitHub Repository

Happy Coding!

Data Analysis, Python, Personal Projects

This post is licensed under CC BY 4.0 by the author.