Building AI Ex-Partner Chat with OCR

Try the tool

AI Ex-Partner Chat → Distill a Persona from Your Chat Records

The problem that started at 3 AM

I built AI Ex-Partner Chat because I couldn't sleep. I kept re-reading old WeChat messages, and I thought: what if I could distill her personality from those messages into something I could actually talk to? Not a replacement — just a shadow. A reflection of the way she used to write, the phrases she repeated, the tone she took when she was being tender versus when she was being impatient.

But the moment I started building it, I hit a wall that had nothing to do with AI or language models. It was a much older, much harder problem: who said what?

When you upload a chat export — a WeChat HTML file, a Telegram JSON — the sender is explicitly labeled. Easy. But most people don't have clean exports. They have screenshots. Hundreds of screenshots. And a screenshot is just pixels. There's no metadata, no sender field, no structured data. Just an image with two columns of text bubbles, and you have to figure out which ones are hers and which ones are yours.

Get this wrong, and the AI persona becomes a chimera — half you, half her, completely confused. It would respond to "I miss you" with your own words reflected back at you. That's not closure. That's a mirror maze.

Version 1: Pixel-based avatar detection (and why it failed)

My first approach was the obvious one: look for avatars. In WeChat, QQ, and Douyin, each message has a small avatar next to it — on the left for the other person, on the right for you. If I could detect those avatar regions, I could label every message.

The algorithm worked like this:

Scan the left 15% and right 15% of the image for clusters of non-background pixels (avatar blobs).
When a cluster is found on the left, mark that Y range as "their message." Right side = "your message."
OCR each message zone and assign the detected sender label.

It worked... sometimes. On clean, standard WeChat screenshots with visible avatars, accuracy was around 70–80%. But it fell apart on:

Cropped screenshots where avatars were cut off
Group chats with multiple avatars on both sides
Douyin where avatars are circular and sometimes overlap with the message bubble
WhatsApp which doesn't show avatars in the same way
Any screenshot with stickers, images, or voice messages that confused the pixel scanner

The fallback was bubble color detection — green bubbles mean you in WeChat, white means them. Blue = you in Douyin, white = them. But bubble colors vary by theme, dark mode, and Android version. It was a fragile heuristic stacked on top of another fragile heuristic.

I needed something fundamentally different.

Version 2: Tesseract bounding boxes + position-based detection

The breakthrough came when I realized that Tesseract.js doesn't just return text — it returns bounding boxes for every recognized line. Each line comes with bbox.x0, bbox.x1, bbox.y0, bbox.y1 — the pixel coordinates of where that line sits on the page.

This is a game-changer because chat messages have a spatial structure:

The other person's messages start from the left side of the screen
Your messages start from the right side of the screen
This is true across WeChat, QQ, Douyin, WhatsApp, Telegram — virtually every chat app

So instead of trying to detect avatars or bubble colors, I could just look at where the text starts. Here's the core algorithm:

// Get image midpoint
const MID_X = imageWidth / 2;

for (const line of ocrLines) {
  const bbox = line.bbox;
  const lineLeft = bbox.x0;     // Where the line starts
  const lineCenterX = (bbox.x0 + bbox.x1) / 2;

  let sender;
  if (lineLeft < MID_X * 0.6) {
    // Left portion → ex-partner
    sender = '对方';
  } else if (lineLeft > MID_X * 0.8) {
    // Right portion → you
    sender = '我';
  } else if (lineCenterX < MID_X * 0.75) {
    // Center-left → likely ex-partner
    sender = '对方';
  } else if (lineCenterX > MID_X * 1.1) {
    // Center-right → likely you
    sender = '我';
  } else {
    // Ambiguous → carry previous sender
    sender = currentSender || '';
  }
}

The thresholds (0.6, 0.8, 0.75, 1.1) were calibrated by testing against ~50 real chat screenshots across WeChat, QQ, Douyin, and WhatsApp. The key insight is using the left edge (x0) as the primary signal rather than the center, because chat bubbles always start from their respective sides — even if the text itself is short and centered within the bubble.

This approach is:

Theme-agnostic: Doesn't care about dark mode, bubble colors, or avatar visibility
App-agnostic: Works across WeChat, QQ, Douyin, WhatsApp, Telegram
Robust to cropping: As long as the left/right structure is preserved, it works
Fast: No pixel scanning needed — just math on coordinates Tesseract already provides

The garbage filter: Why OCR output is 80% noise

Here's something nobody tells you about OCR on chat screenshots: most of the output is garbage. Tesseract will faithfully recognize:

Timestamps like "16:24" and "06/02 20:35"
UI labels like "发送消息" (Send Message) and "消息" (Messages)
Navigation elements, battery indicators, signal bars rendered as text
Random noise from image compression artifacts

Before position-based detection can work, you have to filter this noise out. The garbage filter uses three rules:

// Rule 1: Skip timestamps
if (/^\d{1,2}[:\/]\d{2}/.test(text) && text.length < 12) continue;

// Rule 2: Skip UI elements
if (/^(发送消息|发送|消息|Message|Send)/i.test(text)) continue;

// Rule 3: Skip lines with >70% non-meaningful characters
const cjk = text.match(/[\u4e00-\u9fff\u3400-\u4dbf]/g) || [];
const letters = text.match(/[a-zA-Z]/g) || [];
const meaningful = cjk.length + letters.length;
if (meaningful / text.length < 0.3 && text.length > 4) continue;

Rule 3 is the most important. A line like "::::....::::" is clearly not a chat message, but Tesseract will recognize it as text. By requiring at least 30% CJK characters or Latin letters, we eliminate most noise while preserving actual messages (even short ones like "嗯" or "ok").

Step 1.5: The sender confirmation step

Position-based detection is good, but it's not perfect. Some screenshots have unusual layouts. Some messages are centered (like system notifications). Some users screenshot in landscape mode, flipping the left/right convention.

Rather than trying to handle every edge case algorithmically, I added Step 1.5: Sender Confirmation. After OCR and initial sender labeling, the tool pauses and shows the user a preview of the first 15 detected messages with their assigned senders:

Messages labeled [对方] (ex-partner) appear on the left
Messages labeled [我] (you) appear on the right
A Swap Senders button flips all labels if the detection got them backwards
A Confirm button proceeds to distillation

This is a deliberate design choice: let the algorithm do the heavy lifting, but keep the human in the loop for the final call. The alternative — fully automated detection with no review — would produce silently wrong results that corrupt the entire persona. A 10-second confirmation step prevents hours of distorted conversations.

OCR pre-processing: Why "想" and "相" matter

Chinese OCR is hard. The difference between 想 (miss/think) and 相 (appearance/mutual) is a single horizontal stroke. At low resolution, Tesseract regularly confuses them. In the context of an ex-partner chat, confusing "我想你" (I miss you) with "我相你" (gibberish) doesn't just produce a typo — it produces a semantically broken message that confuses the AI during persona distillation.

The fix is aggressive pre-processing before OCR:

2x upscale: Each detected message zone is enlarged to double its original size using bilinear interpolation. This gives Tesseract more pixels to work with for each character stroke.
Grayscale conversion: Color information is irrelevant for character recognition and adds noise. Converting to grayscale simplifies the recognition problem.
Contrast stretching: The pixel values are remapped so the darkest pixel becomes pure black and the lightest becomes pure white. This sharpens character edges against colored chat bubble backgrounds.

After pre-processing, character accuracy improved from roughly 85% to 95%+ on typical WeChat screenshots. The remaining errors are caught by the DeepSeek OCR correction step, which uses context to fix misrecognized characters (e.g., "我相你" → "我想你" based on surrounding conversation).

The shared Tesseract worker: Minutes to seconds

Here's a performance mistake I made early on: calling Tesseract.recognize() for each chat bubble individually. Every call creates a new WASM worker, loads the language model, recognizes the text, and terminates the worker. For a screenshot with 30+ message zones, this meant 30+ worker startups — each taking 5–10 seconds. Total OCR time: 3–5 minutes per screenshot.

The fix was a shared worker pattern:

// Create ONE worker, reuse for all zones
const worker = await Tesseract.createWorker('chi_sim+eng');
const results = [];

for (const zone of messageZones) {
  const { data } = await worker.recognize(canvasForZone);
  results.push(data);
}

await worker.terminate();

With a single persistent worker, the language model loads once and stays in memory. Each subsequent recognize() call takes 200–500ms instead of 5–10 seconds. Total OCR time dropped from minutes to seconds.

Persona distillation: The 6-step pipeline

Once we have clean, sender-labeled messages, the distillation pipeline runs through DeepSeek's API (using the user's own key):

OCR Correction + Sender ID — DeepSeek reviews the raw OCR text, fixes misrecognized characters, and verifies sender labels based on conversation context
Intake — The AI reads all messages and summarizes key patterns, topics, and emotional dynamics
Memory Extraction — Identifies core memories, shared experiences, and significant moments
Persona Distillation — Builds a detailed personality profile: speech patterns, emotional tendencies, recurring phrases, humor style, communication habits
System Prompt Generation — Converts the persona into a chat system prompt with identity guard rules
Chat Ready — The user can now converse with the distilled persona

The critical architectural decision: every step receives both the full conversation (for context) and the ex-partner's messages only (for extraction). Hard rules in all prompts enforce that personality traits MUST come from [对方] messages only — [我] messages are context-only and must never be attributed to the persona.

Without this separation, the AI would blend both people's speech patterns into a single confused persona. The strict ex-only extraction is what makes the persona feel like them instead of both of you.

The identity guard: Preventing persona drift

One of the creepiest bugs in early testing was persona drift: after a long conversation, the AI would slowly start adopting the user's speech patterns. If you kept saying "haha" and "lol," the persona would start saying it too. If you were being warm and affectionate, the persona would mirror that warmth regardless of the ex's actual personality.

The fix was an identity guard rule added to every system prompt:

You are [name], NOT the user.
Never mirror the user's speech patterns.
Your personality is fixed based on the distilled persona.
Do not adapt to the user's tone, mood, or vocabulary.

This is reinforced by including the original persona description in every API call, not just the first one. The AI doesn't "remember" the conversation history in the traditional sense — each message includes the full persona context, preventing gradual drift.

IndexedDB: Where personas live (and die)

All persona data is stored in the browser's IndexedDB — two object stores:

persona — The distilled personality profile, name, avatar, memory summary, corrections
chats — Conversation history with auto-incrementing IDs

Why IndexedDB instead of localStorage? Two reasons:

Capacity: A full persona with avatar images and conversation history can easily exceed localStorage's 5MB limit. IndexedDB has no practical limit.
Structured storage: IndexedDB supports binary data (avatar images) and complex objects natively, without JSON serialization overhead.

And then there's /let-go — the command that deletes everything. Both object stores are cleared. No backup, no recovery, no "are you sure?" dialog. The brutality is the point. This tool is meant to be temporary. When you're ready to move on, you type two words and it's all gone.

The consent overlay

Creating an AI replica of a real person is a serious act. I added a 5-second consent countdown before distillation begins. The user sees:

You are about to create an AI persona based on real chat records. This persona will simulate another person's communication style. Please take a moment to consider whether this is something you truly want to do.

The countdown can't be skipped. You can't click "I understand" until the timer reaches zero. This isn't a legal formality — it's a genuine pause. If 5 seconds feels too long, maybe you shouldn't be doing this.

The hardest bug wasn't technical

Developer's Journal

The hardest bug in this project was not the OCR accuracy, or the sender detection thresholds, or the persona drift. The hardest bug was me, sitting at my desk at 3 AM, testing the tool on my own chat records, and having the AI respond in her voice saying something she used to say to me. I had to close the laptop and walk around the block.

I built a tool that I know is a little unhealthy. I know that talking to an AI version of someone you lost is not the same as processing grief. I know that /let-go is a digital ritual, not real closure. But sometimes the distance between where you are and where you need to be is too far to cross in one step. This tool is a stepping stone. You walk on it, and then you leave it behind.

If you use this tool, please use /let-go when you're ready. And if you're never ready, please talk to a real person instead.

Technical stack summary

Component	Technology	Why
OCR Engine	Tesseract.js (WASM)	Client-side, supports Chinese + English, returns bounding boxes
AI Model	DeepSeek (user's API key)	Strong Chinese understanding, cheap, privacy-first (direct browser→API)
Local Storage	IndexedDB	No size limit, binary data support, fully local
Sender Detection	Bbox position + AI verification	Position-based primary, DeepSeek secondary, user confirmation final
Image Pre-processing	Canvas 2D API	2x upscale, grayscale, contrast stretch — all client-side
Data Lifecycle	/let-go → IndexedDB.clear()	Permanent, no backup, no recovery — by design

What I'd do differently

If I were building this again from scratch, here's what I'd change:

Start with bounding boxes, not pixel scanning. I wasted two days on avatar detection before realizing Tesseract already gives you position data for free.
Add the confirmation step earlier. The sender confirmation (Step 1.5) was a late addition, but it should have been there from day one. Algorithm-only detection will always have edge cases; letting the user verify is both more reliable and more respectful.
Use a shared Tesseract worker from the start. The per-zone recognize() approach was a naive performance mistake that cost minutes of user wait time.
Design for /let-go first. I built the chat functionality before building the deletion mechanism. In retrospect, the ephemeral nature of the tool should have been the first thing I implemented, not the last.

Frequently asked questions

How does the OCR sender detection work?

Tesseract.js returns bounding boxes for each recognized text line. The tool calculates the left edge (x0) and center X of each line. Lines starting in the left 60% of the image are labeled as the ex-partner's messages; lines starting in the right 80% are labeled as yours. Ambiguous lines use the center X position and context from adjacent messages.

Why not use AI for sender identification instead of position-based detection?

The tool actually uses both. Position-based detection via Tesseract bounding boxes provides the initial labels, then DeepSeek performs a second pass for OCR error correction and sender verification. The position-based approach is faster and more reliable for well-structured chat screenshots, while the AI pass catches errors and handles edge cases.

How is persona data stored?

All persona data is stored in the browser's IndexedDB. No data is ever sent to ToolKnit's servers. The only external communication is between your browser and DeepSeek's API (using your own API key). The /let-go command permanently deletes all IndexedDB data with no recovery.

What was the biggest technical challenge?

The sender identification problem. Early versions used pixel-based avatar and bubble color detection, which was fragile and failed on many screenshot formats. Switching to Tesseract.js bounding box position detection and adding a user confirmation step (Step 1.5) dramatically improved accuracy while keeping the user in control.

When you're ready

AI Ex-Partner Chat —100% browser-based, zero data upload, /let-go when you're done.

Open AI Ex-Partner Chat

Building AI Ex-Partner Chat: OCR, Sender Detection & the 3 AM Code That Shouldn’t Exist