Automate Shopify Financials: Payment Reconciliation, Receipt Scanning, and Tax Filing
Automate financial workflows for Shopify stores processing payments through multiple systems — reconcile credit card terminals, Twint, and Shopify Payments; scan receipts with AI vision; pull Google Ads costs; and produce tax-ready financial reports.
Ask any retail business owner how they do their end-of-month accounts and you will likely hear the same story: they export a CSV from Shopify Payments, download another spreadsheet from their physical card terminal, log into Twint or PayPal for a third set of numbers, and — if they are disciplined — pull their Google Ads spend from a separate dashboard entirely. Each of these files has its own date format, its own fee structure, its own column naming, and its own definition of what a completed transaction actually is.
The result is a monthly ritual. Someone opens all the spreadsheets side by side. They copy totals across sheets. They try to match line items that should add up but often do not, because one processor reports gross and another reports net, or because the terminal batch settled two days after the Shopify payout. They build a combined report that took eight hours to produce and is already out of date by the time they email it to the accountant. Receipts — paper ones from the post office, PDF invoices from software vendors, photos of taxi chits on a phone — sit uncategorised in a folder that nobody touches until the week before the tax deadline.
This is not disorganisation. It is a data integration problem.
This guide solves that problem with a single Python pipeline. It pulls transaction data from every payment system your store uses and normalises every record into one consistent schema, regardless of how the source chooses to format its exports. It sends every receipt image — photograph, PDF invoice, or scanned document — through a large language model that reads the vendor name, the total, the tax amount, and the correct expense category, returning structured JSON instead of a blurry photo in a folder. It imports advertising costs directly from the Google Ads API so that your profit-and-loss statement includes the cost of acquiring each sale. And it produces reports — Excel workbooks, Word documents, terminal summaries — structured exactly the way your tax authority expects, not the way your payment processors happened to export them.
Who This Is For
- Store owners running a Shopify store alongside a physical retail location, where payments flow through a terminal, a mobile wallet, bank transfers, and the online checkout — all generating separate reports that need to be reconciled
- Ecommerce operators who have built a manual workflow of exporting CSVs and matching them against bank statements each month, and who know that workflow will not scale
- Finance teams and bookkeepers who receive a folder of receipts at year-end and spend days typing vendor names and amounts into accounting software by hand
- Developers building internal tooling for a retail business or a client, who need a pipeline that connects payment APIs, receipt scanning, and tax rules without reinventing the data model from scratch
- Vibe coders who have written scripts before and want a complete blueprint for replacing the quarterly spreadsheet scramble with something that runs on its own
You do not need to be an accountant to use this. You need Python, access to the APIs and export files described, and the willingness to connect them once.
The Real Problem: Three Payment Systems, Three Formats, Zero Integration
To understand why this pipeline exists, consider a real scenario. A client runs a Shopify store selling textile products online and from a physical shop. Customers arrive through three channels:
- Online — they check out through Shopify Payments, paying by credit card or digital wallet. Each payout lands in a Shopify CSV export with columns labelled Payout Date, Charges, Refunds, Fees, Total, Currency, and a bank reference number. Charges represent gross customer payments for that settlement period. Fees are what Shopify deducts as a processing charge. The total is what actually reaches the bank account.
- In the shop — they tap a credit card on a SumUp terminal. SumUp provides a PDF revenue report organised by card type (Visa, Mastercard, American Express) and by product category, with gross amounts printed using non-standard thousand separators that break simple number parsing.
- Mobile — a growing share of local customers prefer Twint, the Swiss mobile payment system. Twint exports a semicolon-delimited CSV with four distinct transaction types: Payment (a customer sale), Credit Merchant (a fee refund), Cancellation (a reversed charge), and Closing Merchant (a settlement batch). If you treat all four types as revenue, you double-count transactions.
Before automation, the client’s team spent two days each month in this routine:
- Open the Shopify payouts CSV, find the rows for the month, sum the Charges column.
- Open the SumUp PDF, manually type the gross amounts from the table into a spreadsheet.
- Open the Twint CSV, filter to only Payment rows marked Settled, sum those.
- Compare the combined total against the bank statement line by line.
- Photograph paper receipts, save them in a folder, and forget about them until the accountant asks.
Each step introduced room for error. A column transposed. A thousand separator interpreted as a decimal. A Cancellation counted as revenue. A receipt that never got photographed.
The pipeline below replaced every one of those manual steps. A single command runs daily, reconciles all sources, scans any new receipts it finds, and appends everything to a financial database that can answer any question — monthly revenue, annual expenses by category, tax estimate — in seconds.
The Architecture
The architecture has four layers, each a dedicated module that operates independently:
Ingestion — Three source-specific parsers, each responsible for understanding one payment provider’s export format. The Shopify parser reads the standard payout CSV and also connects to the Shopify Admin API for live order-level data when available. The SumUp parser extracts tables from PDFs using pdfplumber and sends the unstructured text to GPT-4o for field extraction, because SumUp’s PDF formatting uses spacing tricks that break standard regex. The Twint parser handles the semicolon-delimited format, filtering out settlement batches and cancellations so only actual customer payments enter the database. Each parser outputs a list of raw transaction dictionaries.
Normalisation — A single normaliser receives raw transactions from every parser and maps them to a unified schema. Every transaction gets the same fields regardless of source: source, date, type, amount_gross, amount_fee, amount_net, currency, payment_method, and a source-specific source_ref. The normaliser applies sensible defaults for missing fields — if a source does not report fees, the fee field remains zero without breaking downstream reports.
Expense capture — The receipt scanner handles both images and PDFs. For images, it sends the file directly to GPT-4o’s vision endpoint. For PDFs, it extracts text first with pdfplumber, then sends the text to GPT-4o for structured extraction. The scanner uses a SHA-256 file-hash cache so that re-running the pipeline does not re-scan documents that have already been processed — this is critical for interrupt-and-resume workflows and keeps API costs predictable.
Reporting — The reporting layer queries the unified database and produces three kinds of output: Excel workbooks with multiple sheets and styled formatting via openpyxl, Word documents with embedded charts via python-docx and matplotlib, and terminal summaries for quick checks via a Click-based CLI.
Step 1: Ingest and Normalise Payment Data from Every Source
Each payment source speaks a different language. The ingestion layer translates all of them into a single, predictable format.
First, define the unified transaction record. Everything the pipeline produces conforms to this schema:
TRANSACTION_SCHEMA = {
"source": "shopify", # One of: shopify, twint, sumup, manual
"payment_method": "card", # card, twint, bank_transfer, cash
"source_ref": "010F243...", # Original ID from the source system
"date": "2026-06-16", # ISO format YYYY-MM-DD
"time": "14:32", # Optional, for intra-day resolution
"type": "income", # income, payout, expense, fee, refund
"amount_gross": 281.50, # Gross amount in CHF (before fees)
"amount_fee": 12.51, # Processing fee deducted
"amount_net": 268.99, # Net amount after fees
"currency": "CHF", # ISO 4217 currency code
"description": "...", # Human-readable reference
}
Shopify: CSV Payout Parsing and Live API Integration
Shopify exports two types of financial data. The payouts CSV is a high-level settlement report that arrives via email or admin download. Each row represents one payout batch — the accumulated charges, refunds, adjustments, and fees for a settlement period, with the net total that landed in the bank account.
The parser reads the CSV, skips rows without a payout date, filters to only rows with status paid, and extracts every column into the unified format:
import csv
from datetime import datetime
def parse_shopify_csv(filepath):
"""Parse Shopify payout CSV — one row per settlement batch."""
transactions = []
with open(filepath, "r", encoding="utf-8-sig") as f:
reader = csv.DictReader(f)
for row in reader:
if not row.get("Payout Date", "").strip():
continue
if row.get("Status", "").strip().lower() != "paid":
continue
charges = _amount(row.get("Charges", "0"))
fees = _amount(row.get("Fees", "0"))
refunds = _amount(row.get("Refunds", "0"))
total = _amount(row.get("Total", "0"))
bank_ref = row.get("Bank Reference", "").strip()
payout_date = _iso_date(row["Payout Date"])
transactions.append({
"source": "shopify",
"payment_method": "bank_transfer",
"source_ref": bank_ref,
"date": payout_date,
"type": "payout",
"amount_gross": charges,
"amount_fee": fees,
"amount_net": total,
"currency": row.get("Currency", "CHF").strip(),
"description": (
f"Shopify payout — gross CHF {charges:.2f}, "
f"fees CHF {fees:.2f}, refunds CHF {refunds:.2f}"
),
"raw": {
"charges": charges,
"refunds": refunds,
"fees": fees,
"total": total,
"bank_ref": bank_ref,
},
})
return transactions
def _amount(value):
"""Parse a monetary string to float. Handles comma separators."""
if not value or not str(value).strip():
return 0.0
return round(float(str(value).strip().replace(",", "")), 2)
def _iso_date(value):
"""Normalise any date format to YYYY-MM-DD."""
value = value.strip()
for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%d.%m.%Y"):
try:
return datetime.strptime(value, fmt).strftime("%Y-%m-%d")
except ValueError:
continue
return value
For stores that need order-level granularity alongside payout-level settlements, the pipeline also connects to the Shopify Admin API. This fetches individual orders with their payment gateway, shipping cost, and tax breakdown, giving you the ability to track revenue before it reaches the settlement batch:
def fetch_orders(created_at_min, created_at_max, status="any", limit=250):
"""Pull paid orders from Shopify Admin API with full financial details."""
url = f"{BASE_URL}/orders.json"
params = {"status": status, "limit": limit, "financial_status": "paid"}
orders = []
while url:
resp = requests.get(url, headers=_shopify_headers(), params=params)
resp.raise_for_status()
data = resp.json()
for o in data.get("orders", []):
orders.append({
"order_id": o.get("id"),
"order_number": o.get("order_number"),
"date": o.get("created_at", "")[:10],
"gross": float(o.get("total_price", 0)),
"shipping": float(o.get("total_shipping_price_set", {}).get(
"shop_money", {}).get("amount", 0)),
"tax": float(o.get("total_tax", 0)),
"currency": o.get("currency", "CHF"),
"payment_gateway": ", ".join(o.get("payment_gateway_names", [])),
"source_name": o.get("source_name", ""),
})
# Paginate via Link header
link = resp.headers.get("Link", "")
url = None
if 'rel="next"' in link:
for part in link.split(","):
if 'rel="next"' in part:
url = part.split(";")[0].strip(" <>")
return orders
Twint: Filtering Settlement Batches from Real Transactions
The Twint CSV is semicolon-delimited and contains four transaction types — Payment, Credit Merchant, Cancellation, and Closing Merchant. Only Payments marked as Settled are actual customer purchases. Closing Merchant rows are settlement summaries that arrive later and would double-count the same money if imported as separate transactions.
The parser skips everything except type Payment with state Settled:
import csv
def parse_twint_csv(filepath):
"""Parse Twint semicolon-delimited CSV — only settled payments."""
transactions = []
in_data_section = False
with open(filepath, "r", encoding="utf-8-sig") as f:
reader = csv.reader(f, delimiter=";")
for row in reader:
if not row or all(c.strip() == "" for c in row):
continue
# Skip header rows until we hit the column headers
first_cell = row[0].strip().strip('"')
if first_cell == "Date" and not in_data_section:
in_data_section = True
continue
if not in_data_section or len(row) < 13:
continue
txn_type = row[2].strip().strip('"')
state = row[3].strip().strip('"')
if txn_type != "Payment" or state.lower() != "settled":
continue
date_str = row[0].strip().strip('"')
time_str = row[1].strip().strip('"')
amount_gross = _amount(row[4])
amount_fee = _amount(row[6])
order_id = row[9].strip().strip('"') if len(row) > 9 else ""
merchant_ref = row[10].strip().strip('"') if len(row) > 10 else ""
transactions.append({
"source": "twint",
"payment_method": "twint",
"source_ref": order_id or merchant_ref,
"date": _iso_date(date_str),
"time": time_str,
"type": "income",
"amount_gross": amount_gross,
"amount_fee": amount_fee,
"amount_net": round(amount_gross - amount_fee, 2),
"currency": "CHF",
"description": f"TWINT payment — ref: {merchant_ref}",
})
return transactions
def _iso_date(value):
"""Parse TWINT date format (YYYY.MM.DD) to ISO (YYYY-MM-DD)."""
value = value.strip()
for fmt in ("%Y.%m.%d", "%Y-%m-%d", "%d.%m.%Y"):
try:
return datetime.strptime(value, fmt).strftime("%Y-%m-%d")
except ValueError:
continue
return value
SumUp: Extracting Tables from PDFs with LLM Vision
SumUp provides revenue reports as PDFs, not CSVs. The PDF contains a table organised by card type — Visa, Mastercard, American Express — with sales counts, refund counts, and totals. It also includes a category breakdown by product type.
Parsing this with traditional regex is fragile because SumUp uses non-standard spacing (thin spaces and non-breaking spaces that render differently in different PDF readers). Instead, the parser extracts all text from the PDF using pdfplumber and sends the complete text to GPT-4o with a strict JSON output format:
import pdfplumber
from llm_adapter import extract_json
SYSTEM_PROMPT = """You extract structured data from a SumUp credit card revenue report PDF.
The report contains payment-method breakdowns (by card type) and product
category breakdowns. Return ONLY this JSON:
{
"merchant": "string",
"period_start": "YYYY-MM-DD",
"period_end": "YYYY-MM-DD",
"total_sales": 0.00,
"payment_methods": [
{"type": "Visa", "sales_count": 5, "total": 1603.76}
],
"categories": [
{"name": "Product name", "qty": 3, "total": 500.00}
]
}
All amounts in CHF as floats. Handle numbers with non-standard spacing
(e.g. "1 603.76" = 1603.76). Only include entries with non-zero totals."""
def parse_sumup_pdf(filepath):
"""Extract SumUp credit card transaction data from PDF via GPT-4o."""
all_text = ""
with pdfplumber.open(filepath) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
all_text += text + "\n"
if not all_text.strip():
return []
data = extract_json(SYSTEM_PROMPT, all_text)
if not data:
return []
end_date = data.get("period_end", datetime.now().strftime("%Y-%m-%d"))
transactions = []
for pm in data.get("payment_methods", []):
if pm.get("total", 0) <= 0:
continue
transactions.append({
"source": "sumup",
"payment_method": "card",
"source_ref": f"sumup-{pm['type'].lower().replace(' ', '_')}",
"date": end_date,
"type": "payout",
"amount_gross": pm["total"],
"amount_fee": 0.0,
"amount_net": pm["total"],
"currency": "CHF",
"description": (
f"SumUp {pm['type']} — {pm['sales_count']} sales, "
f"CHF {pm['total']:.2f}"
),
})
return transactions
The extract_json helper is shared across the project — a thin wrapper around the OpenAI client that sends a system prompt plus the extracted text and parses the JSON response:
def extract_json(system_prompt, user_content, temperature=0.1):
"""Send content to GPT-4o and parse the JSON response."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_content},
],
max_tokens=1000,
temperature=temperature,
)
raw = response.choices[0].message.content.strip()
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
return json.loads(raw)
This approach works for any PDF-based financial report. The LLM handles the layout ambiguity that regular expression parsers cannot, and the JSON output format ensures the result is always machine-readable.
The Normaliser: One Schema to Rule Them All
Each parser outputs transactions in the same shape. The normaliser receives them all and fills in any gaps with sensible defaults before storage:
def normalize_transactions(raw_transactions, batch_id=None):
"""Map raw parser output to the unified transaction schema."""
now = datetime.now().isoformat()
normalized = []
for raw in raw_transactions:
txn = {
"source": raw.get("source", "unknown"),
"payment_method": raw.get("payment_method"),
"source_ref": raw.get("source_ref"),
"date": raw.get("date", ""),
"time": raw.get("time"),
"type": raw.get("type", "income"),
"amount_gross": raw.get("amount_gross", 0.0),
"amount_fee": raw.get("amount_fee", 0.0),
"amount_net": round(raw.get("amount_gross", 0.0) - raw.get("amount_fee", 0.0), 2),
"currency": raw.get("currency", "CHF"),
"description": raw.get("description"),
"batch_id": batch_id,
"imported_at": now,
}
normalized.append(txn)
return normalized
Step 2: Receipt Scanning with LLM Vision and Hash-Based Caching
Receipts are the hardest part of financial tracking because they have no API. You get a thermal-printed slip from the post office, a PDF invoice from Adobe, a photographed lunch receipt, a domain renewal confirmation from Hostpoint. Each has its own layout, language, and relevant fields.
The scanner handles every format the same way: send the file to GPT-4o and ask for structured JSON. But it adds a critical optimisation: before making the API call, it hashes the file and checks a local cache. If the file was scanned before, the cached result returns instantly with zero API cost. This makes batch scanning idempotent — you can run the scanner on an entire receipts folder every day and it only processes new files.
import base64
import json
import hashlib
import pdfplumber
from openai import OpenAI
CACHE_PATH = "data/scan_cache.json"
CATEGORIES = [
"Payment Processing Fees", "Shipping & Postage",
"Web Hosting & Domains", "Software & Tools",
"Marketing & Advertising", "Office Supplies",
"Professional Services", "Inventory & Supplies",
"Bank Fees", "Travel & Transport", "Insurance",
"Telecom & Internet", "Rent & Utilities",
"Equipment & Hardware", "Other",
]
SYSTEM_PROMPT = """You are a receipt scanner for a business expense tracking system.
Analyze this document and return JSON:
{
"date": "YYYY-MM-DD",
"amount": 0.00,
"currency": "CHF",
"merchant": "Company or vendor name",
"description": "Brief summary under 80 chars",
"suggested_category": "One of the valid categories",
"confidence": "high|medium|low"
}
- Date: the invoice/receipt date. Swiss formats accepted (DD.MM.YYYY).
- Amount: the total due. Positive, no currency symbols.
- Merchant: the vendor/company name from the document header.
- Category: pick the BEST match from: """ + ", ".join(CATEGORIES) + """
- Confidence: "high" if clear, "medium" if some guessing, "low" if unclear.
Return ONLY JSON, no markdown."""
def scan_receipt(file_path):
"""Scan a receipt or invoice using GPT-4o with file-hash caching."""
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
# Check cache first — avoids re-scanning previously processed files
cached = get_cached(file_path)
if cached:
cached["cached"] = True
return cached
ext = os.path.splitext(file_path)[1].lower()
client = OpenAI()
if ext == ".pdf":
data = _scan_pdf(client, file_path)
else:
data = _scan_image(client, file_path, ext)
# Cache the result for future runs
cache_entry = {
k: data[k] for k in (
"date", "amount", "currency", "merchant",
"description", "suggested_category", "confidence"
)
}
set_cached(file_path, cache_entry)
return data
def _scan_pdf(client, file_path):
"""Extract text from PDF, then parse with GPT-4o."""
with pdfplumber.open(file_path) as pdf:
pages_text = []
for page in pdf.pages:
text = page.extract_text()
if text:
pages_text.append(text)
if not pages_text:
raise ValueError(f"No extractable text: {file_path}")
full_text = "\n---PAGE BREAK---\n".join(pages_text)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": full_text},
],
max_tokens=500,
temperature=0.1,
)
return _parse_response(response)
def _scan_image(client, file_path, ext):
"""Send image directly to GPT-4o vision."""
with open(file_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
mime_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png"}
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "text", "text": "Extract the receipt data from this image."},
{"type": "image_url", "image_url": {
"url": f"data:{mime_map.get(ext, 'image/jpeg')};base64,{b64}"
}},
]},
],
max_tokens=500,
temperature=0.1,
)
return _parse_response(response)
def get_cached(file_path):
"""Return cached result for a file, or None if not yet scanned."""
file_hash = hashlib.sha256(open(file_path, "rb").read()).hexdigest()
cache = json.load(open(CACHE_PATH)) if os.path.exists(CACHE_PATH) else {}
return cache.get(file_hash)
def set_cached(file_path, data):
"""Store scan result keyed by SHA-256 file hash."""
file_hash = hashlib.sha256(open(file_path, "rb").read()).hexdigest()
cache = json.load(open(CACHE_PATH)) if os.path.exists(CACHE_PATH) else {}
cache[file_hash] = data
os.makedirs(os.path.dirname(CACHE_PATH), exist_ok=True)
with open(CACHE_PATH, "w") as f:
json.dump(cache, f, indent=2)
What the scanner extracts
A photographed post office shipping receipt becomes:
{
"date": "2025-09-20",
"amount": 33.40,
"currency": "CHF",
"merchant": "Die Schweizerische Post",
"description": "International parcel shipping, 2 packages",
"suggested_category": "Shipping & Postage",
"confidence": "high"
}
An Adobe Creative Cloud monthly invoice becomes:
{
"date": "2025-10-14",
"amount": 35.55,
"currency": "CHF",
"merchant": "Adobe Systems Software Ireland Ltd",
"description": "Creative Cloud All Apps 100GB subscription",
"suggested_category": "Software & Tools",
"confidence": "high"
}
The scanner handles receipts in German, French, Italian, and English — the LLM reads the layout and extracts fields regardless of language. The fifteen built-in categories cover the expense types most ecommerce businesses encounter: shipping, hosting, software, marketing, supplies, professional services, travel, and more. Each is stored with a flag indicating whether it is tax-deductible.
Step 3: Import Marketing Spend from Google Ads
Financial reports are incomplete without the cost side of customer acquisition. The Google Ads API provides daily campaign spend at the micros level (1 micro = 1/1,000,000 of the currency unit), along with impressions, clicks, and conversions:
from google.ads.googleads.client import GoogleAdsClient
def daily_ad_spend(customer_id, days=30):
"""Pull daily cost, impressions, clicks, and conversions from Google Ads."""
client = GoogleAdsClient.load_from_storage("google-ads.yaml")
ga_service = client.get_service("GoogleAdsService")
query = f"""
SELECT
segments.date,
metrics.cost_micros,
metrics.impressions,
metrics.clicks,
metrics.conversions
FROM campaign
WHERE segments.date BETWEEN
'{_n_days_ago(days)}' AND '{_n_days_ago(0)}'
ORDER BY segments.date DESC
"""
response = ga_service.search_stream(customer_id=customer_id, query=query)
rows = []
for batch in response:
for row in batch.results:
rows.append({
"date": row.segments.date,
"cost": row.metrics.cost_micros / 1_000_000,
"impressions": row.metrics.impressions,
"clicks": row.metrics.clicks,
"conversions": row.metrics.conversions,
})
return rows
def _n_days_ago(n):
return (datetime.now() - timedelta(days=n)).strftime("%Y-%m-%d")
These rows feed into the same database as payment and expense data, giving you a complete profit-and-loss statement that includes exactly how much you paid to acquire each sale.
Step 4: Building Tax-Ready Financial Reports
With all transactions and expenses in a unified database, reporting becomes a set of SQL queries. The schema tracks every dollar in, every dollar out, and every fee deducted, grouped by source, period, and category.
import pandas as pd
class TaxReportBuilder:
"""Produce financial reports from the unified transaction database."""
def __init__(self, transactions, expenses, ad_costs):
self.txns = pd.DataFrame(transactions)
self.expenses = pd.DataFrame(expenses)
self.ads = pd.DataFrame(ad_costs)
def revenue_by_source(self):
"""Net revenue broken down by payment processor."""
return (
self.txns
.groupby("source")
.agg(
count=("source_ref", "count"),
gross=("amount_gross", "sum"),
fees=("amount_fee", "sum"),
net=("amount_net", "sum"),
)
.round(2)
.reset_index()
)
def expenses_by_category(self):
"""Group scanned expenses into tax-return categories."""
return (
self.expenses
.groupby("category")
.agg(
count=("amount", "count"),
total=("amount", "sum"),
)
.round(2)
.reset_index()
.sort_values("total", ascending=False)
)
def monthly_profit_summary(self, year):
"""Month-by-month profit including all revenue, fees, expenses, and ads."""
monthly = self.txns.copy()
monthly["month"] = pd.to_datetime(monthly["date"]).dt.to_period("M")
monthly = monthly[monthly["month"].dt.year == year]
revenue = (
monthly.groupby("month")
.agg(gross_revenue=("amount_gross", "sum"), fees=("amount_fee", "sum"))
.round(2)
)
exp_monthly = self.expenses.copy()
exp_monthly["month"] = pd.to_datetime(exp_monthly["date"]).dt.to_period("M")
exp_monthly = exp_monthly[exp_monthly["month"].dt.year == year]
expenses = (
exp_monthly.groupby("month")
.agg(total_expenses=("amount", "sum"))
.round(2)
)
ad_monthly = self.ads.copy()
ad_monthly["month"] = pd.to_datetime(ad_monthly["date"]).dt.to_period("M")
ad_monthly = ad_monthly[ad_monthly["month"].dt.year == year]
ads = (
ad_monthly.groupby("month")
.agg(marketing_spend=("cost", "sum"))
.round(2)
)
summary = revenue.join(expenses, how="outer").join(ads, how="outer").fillna(0)
summary["net_profit"] = (
summary["gross_revenue"] - summary["fees"]
- summary["total_expenses"] - summary["marketing_spend"]
)
return summary.round(2).reset_index()
Export to Excel and Word
The reporting layer uses openpyxl to write formatted Excel workbooks — blue header rows, auto-width columns, number formatting with thousands separators — and python-docx to generate Word documents with cover pages, embedded charts, and multi-year comparison sections. Both export formats are structured to match what Swiss cantonal tax offices expect.
For Swiss businesses specifically, the reporting module also tracks the VAT threshold (CHF 100,000 annual turnover — registration becomes mandatory at this level), computes estimated cantonal and federal tax liability using Geneva’s progressive rate brackets, and estimates social security contributions (AVS/AI/APG) for self-employed proprietors.
Schedule
# Run daily to keep the financial database current
0 6 * * * cd /app && python financial_pipeline.py
For production scheduling with error recovery, retry logic, and a web dashboard, see Schedule and Orchestrate Workflows with Prefect.
Before vs After
| Task | Before | After |
|---|---|---|
| Payment reconciliation | Export CSV from 3+ systems, manually match totals against bank statements. Two days per month. | Single CLI command. All sources ingested, normalised, and deduplicated in under a minute. |
| Expense tracking | Photos in a folder. Typed into a spreadsheet at year-end by hand. Categories guessed from memory. | LLM scans each receipt in under a second. Vendor, date, amount, category captured to database. Re-running costs nothing for already-scanned files. |
| Marketing cost import | Pulled from Google Ads dashboard manually, typed into a separate sheet, sometimes forgotten. | API imports daily spend alongside revenue and expenses. P&L includes customer acquisition cost automatically. |
| Tax report preparation | Accountant rebuilds P&L from bank statements and paper files. Days of billable time. | Structured Excel and Word exports with revenue by source, expenses by category, and monthly profit summaries. Ready for upload. |
| Data accuracy | Missed transactions. Double-counted settlement batches. Transposed columns. | Every transaction has a source_ref traceable to the original export. Deduplication prevents double-counting. |
Common Pitfalls
-
Settlement transactions that are not revenue — Twint’s Closing Merchant and Shopify’s payout summaries both represent aggregated settlements of earlier transactions. Importing them alongside individual sales creates phantom revenue. Filter them out at the parser level, not at the report level.
-
Receipt image quality — Blurry or low-light photos produce lower confidence scores from the LLM scanner. Train anyone who photographs receipts to lay them flat on a dark background in good light. For critical receipts — lease agreements, insurance invoices — use a proper document scanner.
-
Date format inconsistency — Shopify uses ISO dates. Twint uses dot-separated Swiss format (2025.12.10). PDFs use whatever the creator preferred. Write a date parser that tries multiple formats in order of likelihood for that source.
-
LLM API costs at scale — Scanning 300 receipts with GPT-4o costs roughly CHF 3-5 at current rates. The SHA-256 cache means you only pay once per file. Batch all new receipts once a week rather than calling the API every day for the same files.
-
Currency mixing — If your store sells in multiple currencies, store both the original currency and the converted amount. Apply exchange rates at the time of each transaction, not at the time of the report, or your profit calculations will drift.
-
Google Ads reports accrued, not cash — Google Ads reports costs on an accrual basis. Your tax return may use cash accounting. Track both methods in the database and align your exports with whichever your jurisdiction requires.
Next Steps
For the fundamentals of pulling Shopify transaction data for reporting, see How to Automate Shopify Reports with Python. For building a dashboard that visualises the financial data this pipeline produces, see Shopify Automated Reporting: Build a Self-Updating Sales Dashboard.
For a complete comparison of every Shopify reporting approach and how financial reporting fits into the larger picture, see The Complete Guide to Shopify Reporting and Analytics. For scheduling this pipeline with retries, monitoring, and error recovery, see Schedule and Orchestrate Workflows with Prefect.
For securing the API keys used across payment processors, receipt scanning, and Google Ads, see Secure Python Automation: Managing Secrets and Keys.
Ecommerce optimisation services include building automated financial reporting pipelines tailored to your store’s payment systems, expense categories, and tax jurisdiction.
Get in touch to discuss automating your financial reporting.
FAQ
What payment processors does this pipeline support?
The architecture handles any system that provides an API or a structured CSV export. The guide includes complete parser implementations for Shopify Payments (CSV + API), Twint (semicolon-delimited CSV), and SumUp (PDF extracted via LLM). Adding a new source — a different card terminal, a mobile wallet, a buy-now-pay-later provider — requires implementing one parser class that produces the same unified transaction schema.
How accurate is LLM receipt scanning?
With a clear photo and GPT-4o, accuracy is above 95% for merchant name, date, and total amount. Categories are slightly less reliable — the model may classify a payment processor invoice as Software rather than Payment Processing Fees. The scanner surfaces a confidence level (high, medium, low) with each extraction so you can verify low-confidence entries during the first month. After that, the cache ensures verified entries never need re-scanning.
Can this replace my accountant?
No. This pipeline produces structured, categorised data that saves your accountant hours of manual compilation. It generates reports in the format they need for tax filing. It does not replace professional tax advice — particularly around deductions, depreciation schedules, or jurisdiction-specific filing rules.
Does this handle multi-currency stores?
The transaction schema stores both the original currency and the reporting currency. The normaliser converts all amounts to your base currency at import time. For stores that need to report in multiple currencies simultaneously, extend the schema with a currency_original and exchange_rate pair.
How does the cache work, and why does it matter?
Every file scanned by the receipt pipeline — image, PDF, or otherwise — is hashed with SHA-256 before the API call. The hash-keyed result is stored in a local JSON cache. On subsequent runs, the pipeline checks the cache before making any API call. This means: (a) you pay the GPT-4o cost only once per file, (b) batch scans can be interrupted and resumed without losing progress, and (c) running the pipeline daily processes only new files.
What if my tax jurisdiction has different reporting requirements?
The reporting layer is separated from the ingestion layer. You can swap the export functions — the ones that produce Excel and Word files — to match your local tax authority’s format without touching any of the parsing or scanning logic. For Swiss businesses, VAT threshold tracking, cantonal tax brackets, and AVS social security estimation are built in as an example of jurisdiction-specific reporting.
Enjoyed this article?
Get notified when I publish new articles on automation, ecommerce, and data engineering.
Get in touch