← Return to Selected Works

Lleven V1: The Genesis

FastAPIPandasRedis

The first version of Lleven was born out of a simple need: to make sense of Mobile Money (MoMo) statements without the manual hassle. It was a lean, focused tool designed to transform a raw PDF into a "Wrapped" experience—much like Spotify, but for your spending.

Architecture Overview

The V1 architecture was a classic Synchronous Processing model. It prioritized simplicity and immediate feedback for small-to-medium statements.

Tech Stack

  • Framework: FastAPI
  • Processing: Pandas & PDFPlumber
  • Caching: Redis
  • Security: Fernet (AES-128) Encryption for cached data

The Processing Engine

Lleven V1 used a specialized parsing engine built on top of pdfplumber.

1. Validation Logic

Before processing, the system checked for specific "magic strings" in the first page of the PDF to ensure it was a valid MTN MoMo statement. These included:

  • MSISDN:
  • Time Run:
  • TRANSACTION DATE
  • ACCOUNT HOLDER NAME:

2. Data Extraction

The engine targeted tables with a vertical_strategy and horizontal_strategy set to "lines". It mapped raw PDF columns to a structured internal format:

  • Raw Mapping: TRANSACTION DATE, FROM ACCT, FROM NO., TRANS. TYPE, AMOUNT, TO NO., TO NAME, REF, OVA.
  • Cleaning: It used regex to identify the ACCOUNT_HOLDER_NO from the header text to distinguish between incoming and outgoing funds.

3. Data Cleaning Pipeline

  • Date Normalization: Converted string dates (e.g., 21-May-2023 10:30:00 AM) into proper Python datetime objects.
  • Type Casting: Converted currency strings into floats for arithmetic operations.
  • Normalization: Removed newline characters and extra whitespace from names and references using custom regex cleaning.

The Upload Workflow

  1. Request: User uploads a PDF to /process-file.
  2. Deduplication: A SHA-256 hash of the file is generated. If the hash exists in Redis, the system returns the existing file_hash immediately.
  3. Parsing: If new, the system extracts the table, cleans the data, and caps it to the requested year (e.g., 2023).
  4. Encrypted Caching:
    • The resulting DataFrame is serialized using pickle.
    • It is then encrypted using Fernet (symmetric encryption).
    • The encrypted blob is stored in Redis with a 1-hour TTL.
  5. Response: Returns the file_hash and an expiry timestamp.

The Retrieval Workflow (/get-wrapped)

When the user requests their "Wrapped" results:

  1. The system pulls the encrypted blob from Redis.
  2. It decrypts and deserializes the DataFrame.
  3. On-the-Fly Analytics: It runs a series of summary algorithms:
    • Spending Summary: Aggregates totals for PAYMENT, CASH_OUT, TRANSFER, and DEBIT.
    • Frequency Analysis: Calculates the top 5 recipients by amount and frequency.
    • Monthly Trends: Groups transactions by month to visualize spending patterns.
    • Credit Summary: Identifies salary or incoming transfers by filtering for the user's ACCOUNT_HOLDER_NO in the TO_NO column.

Limitations of V1

  • The "Timeout" Wall: Large PDFs (50+ pages) often caused HTTP timeouts because the API waited for the entire extraction to finish before responding.
  • Memory Pressure: Since processing happened on the API workers, high concurrent uploads could lead to OOM (Out of Memory) errors.
  • Stateless Persistence: Data only lived in Redis. If the cache expired, the user had to re-upload the file.
  • Lack of Identity: No user accounts meant users couldn't see a history of their past uploads without keeping the file hashes themselves.

Lleven V1 proved the concept, but the stage was set for a more robust, scalable, and secure V2.