The Name That Broke the Import

 

The Name That Broke the Import

How testing with my mom's name caught a data corruption bug





Hi, I'm Lily Chen.

Ten months into my role at PayStream, I was building a merchant onboarding tool. Small business owners could upload a CSV of their customer list, and our system would import it—names, emails, purchase history—to help them track loyalty rewards.

The feature was straightforward. Parse CSV, validate data, store in PostgreSQL. I'd tested it thoroughly with sample data from our QA team. Everything worked perfectly.

Friday afternoon, I was about to mark the ticket as ready for review when I thought: I should test this with real-world data.

I created a test CSV with my family's names.

That's when I saw it.

The Test

Our QA team had given me clean test data:

name,email,total_purchases
John Smith,john@example.com,847.50
Sarah Johnson,sarah@example.com,1249.00

My code handled it flawlessly. Import, validate, store. All green.

But real merchants don't just have customers named John and Sarah. I thought about the dim sum restaurant down the street from our office, the boba shop I go to every week. Their customer lists wouldn't look like our test data.

I opened a new CSV file and typed:

name,email,total_purchases
陈美华,meihua@example.com,125.00

陈美华. Chen Meihua. My mom's name.

I uploaded the file.

The import succeeded. The success message appeared: "1 customer imported."

I opened the database viewer to check the result.

My stomach dropped.

Where my mom's name should have been, I saw: é™ˆç¾ŽåŽ

The Realization

I stared at the screen.

That wasn't a name. That was mojibake—corrupted text that happens when you decode bytes with the wrong character encoding.

I tried again with my grandmother's name: 李秀英 (Li Xiuying).

Result: æŽç§€è‹±

My cousin's name: 王志强 (Wang Zhiqiang).

Result: çŽ‹å¿—å¼º

Every Chinese name turned into garbage.

I felt a chill. This wasn't just a technical bug. This was erasing people's identities.

I checked my code:

def import_customers(csv_file):
    customers = []
    with open(csv_file, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            customer = {
                'name': row['name'],
                'email': row['email'],
                'total_purchases': float(row['total_purchases'])
            }
            customers.append(customer)
    return save_to_database(customers)

Clean code. Standard Python CSV reading. What was wrong?

Then I saw it. The open() call didn't specify an encoding.

Python's default encoding is system-dependent. On my Mac, it defaults to UTF-8. But on our Linux servers? It was defaulting to ASCII or Latin-1, depending on locale settings.

Any character outside the basic ASCII range—like Chinese characters—was getting mangled during the file read.

The Scope

My hands were shaking as I opened our production database.

We'd been in beta for two months with a handful of pilot merchants. I wrote a quick query to check for corruption:

SELECT name FROM customers WHERE name LIKE '%�%' OR name LIKE '%?%';

The results loaded.

347 rows.

347 customer names corrupted. Not just Chinese names—Vietnamese names with diacritics (Nguyễn), Korean names (김), Japanese names (田中), even Spanish names with accents (José, María).

Every merchant with international customers. Every name with non-ASCII characters. Corrupted.

I thought about what this meant. A customer walks into their favorite restaurant, and the owner pulls up the loyalty program: "Sorry, I don't see your name in our system." The customer spells it out. The owner shows them the screen: é™ˆç¾ŽåŽ.

"That's not my name."

The Fix

I immediately opened a critical bug ticket and pinged Jake.

"We're corrupting non-ASCII customer names. Production database has 347 corrupted records."

He called me within two minutes.

I showed him the test case with my mom's name. Showed him the production data. Showed him the code fix:

def import_customers(csv_file):
    customers = []
    with open(csv_file, 'r', encoding='utf-8') as f:  # Explicit UTF-8
        reader = csv.DictReader(f)
        for row in reader:
            customer = {
                'name': row['name'],
                'email': row['email'],
                'total_purchases': float(row['total_purchases'])
            }
            customers.append(customer)
    return save_to_database(customers)

One parameter: encoding='utf-8'.

"Okay," Jake said. "That fixes new imports. But we need to recover the corrupted names already in production."

"The original CSV files," I said. "We store them in S3, right?"

His eyes lit up. "For audit purposes. Every uploaded file gets archived with a timestamp."

The Recovery

Within an hour, I had written a migration script. The logic was straightforward:

  1. Pull all corrupted customer records from the database (the 347 rows with mojibake)
  2. Retrieve their associated original CSV files from S3
  3. Re-process each file with proper UTF-8 encoding
  4. Match records by email address (emails are pure ASCII, so they didn't corrupt)
  5. Update the name fields with the correctly decoded data

Email addresses were the key. Since they only use ASCII characters, they'd survived the encoding mishap intact. I could use them as reliable identifiers to match corrupted records with their correct data.

The script ran in about ten minutes. When it finished:

  • 312 names successfully recovered from original CSV files
  • 35 names still corrupted (files older than our 90-day S3 retention policy)

For those 35, we contacted the merchants directly with a brief explanation and a simple re-upload link. Most responded within a day. One restaurant owner wrote back: "Thank you for caring enough to fix this. Most software just tells us to use 'American names.'"

That hit me hard.

The Prevention

I didn't stop at fixing the import function. I did a full audit of our codebase for any file operations that might have the same issue.

Found twelve other places where we read or wrote text files without explicit encoding. Updated them all.

Then I completely rewrote our test data. Instead of "John Smith" and "Sarah Johnson," our standard test suite now includes:

  • 陈美华 (Chinese)
  • María García (Spanish)
  • Nguyễn Văn An (Vietnamese)
  • Müller (German)
  • Παπαδόπουλος (Greek)

If our code can't handle these names, it doesn't pass tests. Period.

What I Learned

The bug wasn't in the logic. It was in the assumption.

I'd assumed Python's defaults would handle international text correctly. I'd assumed our test data represented real users. I'd assumed that if tests passed with "John Smith," they'd pass with "陈美华."

All wrong.

Now I follow one rule: your identity is my test case.

If the system can't handle my mom's name correctly, it's broken. If it can't handle my grandmother's name, my cousin's name, names with accents and diacritics and characters outside ASCII—it's not production-ready.

The best test data isn't sanitized and simple. It's messy and real and representative of the actual humans who will use your software.

And sometimes, the most important test is just asking: Would this work for my family?


Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.

Comments

Popular posts from this blog

The New ChatGPT Reason Feature: What It Is and Why You Should Use It

Raspberry Pi Connect vs. RealVNC: A Comprehensive Comparison

Insight: The Great Minimal OS Showdown—DietPi vs Raspberry Pi OS Lite