Lleven V1: The Genesis
The first version of Lleven was born out of a simple need: to make sense of Mobile Money (MoMo) statements without the manual hassle. It was a lean, focused tool designed to transform a raw PDF into a "Wrapped" experience—much like Spotify, but for your spending.
Architecture Overview
The V1 architecture was a classic Synchronous Processing model. It prioritized simplicity and immediate feedback for small-to-medium statements.
Tech Stack
- Framework: FastAPI
- Processing: Pandas & PDFPlumber
- Caching: Redis
- Security: Fernet (AES-128) Encryption for cached data
The Processing Engine
Lleven V1 used a specialized parsing engine built on top of pdfplumber.
1. Validation Logic
Before processing, the system checked for specific "magic strings" in the first page of the PDF to ensure it was a valid MTN MoMo statement. These included:
MSISDN:Time Run:TRANSACTION DATEACCOUNT HOLDER NAME:
2. Data Extraction
The engine targeted tables with a vertical_strategy and horizontal_strategy set to "lines". It mapped raw PDF columns to a structured internal format:
- Raw Mapping:
TRANSACTION DATE,FROM ACCT,FROM NO.,TRANS. TYPE,AMOUNT,TO NO.,TO NAME,REF,OVA. - Cleaning: It used regex to identify the
ACCOUNT_HOLDER_NOfrom the header text to distinguish between incoming and outgoing funds.
3. Data Cleaning Pipeline
- Date Normalization: Converted string dates (e.g.,
21-May-2023 10:30:00 AM) into proper Python datetime objects. - Type Casting: Converted currency strings into floats for arithmetic operations.
- Normalization: Removed newline characters and extra whitespace from names and references using custom regex cleaning.
The Upload Workflow
- Request: User uploads a PDF to
/process-file. - Deduplication: A SHA-256 hash of the file is generated. If the hash exists in Redis, the system returns the existing
file_hashimmediately. - Parsing: If new, the system extracts the table, cleans the data, and caps it to the requested year (e.g., 2023).
- Encrypted Caching:
- The resulting DataFrame is serialized using
pickle. - It is then encrypted using Fernet (symmetric encryption).
- The encrypted blob is stored in Redis with a 1-hour TTL.
- The resulting DataFrame is serialized using
- Response: Returns the
file_hashand an expiry timestamp.
The Retrieval Workflow (/get-wrapped)
When the user requests their "Wrapped" results:
- The system pulls the encrypted blob from Redis.
- It decrypts and deserializes the DataFrame.
- On-the-Fly Analytics: It runs a series of summary algorithms:
- Spending Summary: Aggregates totals for
PAYMENT,CASH_OUT,TRANSFER, andDEBIT. - Frequency Analysis: Calculates the top 5 recipients by amount and frequency.
- Monthly Trends: Groups transactions by month to visualize spending patterns.
- Credit Summary: Identifies salary or incoming transfers by filtering for the user's
ACCOUNT_HOLDER_NOin theTO_NOcolumn.
- Spending Summary: Aggregates totals for
Limitations of V1
- The "Timeout" Wall: Large PDFs (50+ pages) often caused HTTP timeouts because the API waited for the entire extraction to finish before responding.
- Memory Pressure: Since processing happened on the API workers, high concurrent uploads could lead to OOM (Out of Memory) errors.
- Stateless Persistence: Data only lived in Redis. If the cache expired, the user had to re-upload the file.
- Lack of Identity: No user accounts meant users couldn't see a history of their past uploads without keeping the file hashes themselves.
Lleven V1 proved the concept, but the stage was set for a more robust, scalable, and secure V2.