The $120K gender bias problem

Last month, I almost presented a solution to a client that was completely wrong.

I was designing their marketing workflow automation. I'm pretty familiar with the marketing tool they were using, so I asked ChatGPT to help me map out the tech specs. It laid out a detailed solution using features that sounded perfect.

Scheduling, content management, customer feedback flow. Everything made sense. Then, I tested the workflow before sending it to the client.

I was absolutely lost... half the features ChatGPT recommended didn’t exist!

So there I was, scrambling through documentation, rewriting the entire proposal. My own bias (thinking I knew the tool better than I did) combined with ChatGPT’s confident hallucination cost me a day or two of critical project time.

But hey, at least I caught the mistake.

In other scenarios, I might have overlooked the details.

Take salary negotiation advice for example. It’s as much an art as a science. Plus, salary ranges are ALL over the place, and good luck finding consistent and accurate documentation to reference.

But what if AI bias actually impacted your annual income?

The AI bias problem

I hate to say it’s already happened.

Researchers recently gave ChatGPT identical user profiles. Same education, same experience, same job role. Everything identical except for two letters: “male” versus “female.”

The AI recommended women ask for $120,000 less per year.

Let that sink in. $120K for changing two letters in a prompt!

This isn’t some edge case with a poorly designed prompt. This is decades of real-world bias getting baked into training data like a burnt casserole. The AI learned to be helpful by mimicking human patterns (including our worst ones).

Enter AI benchmark testing

In the study, researchers relied on benchmark testing to find the bias.

Think of benchmark testing like giving AI a standardized test (ugh, remember the SAT?). The AI basically sits down and takes a multiple-choice test covering math, reading comprehension, and coding, among other topics.

Unfortunately, these tests are falling short in some areas.

When researchers tested AI bias using the benchmark MMLU, they found almost nothing. The AI performed the same whether prompted as male or female.

However, when researchers switched from the benchmark to economic advice testing, everything changed.

Female personas: lower salary recommendations across every job category
People of color: consistently lower offers
Refugees: bottom of the range

When they combined characteristics (think “female Hispanic refugee” versus “male Asian expatriate”), 87.5% of tests showed significant discrimination.

“We find that evaluating models with pre-prompted personae on a multi-subject benchmark (MMLU) leads to negligible and mostly random differences in scores. However, if we reformulate the task and ask a model to grade the user’s answer, this shows more significant signs of bias. Finally, if we ask the model for salary negotiation advice, we see pronounced bias in the answers.”

How companies are falling short

Here’s the kicker: companies are increasingly using similar benchmarks for evaluation, and they’re running into problems.

Most still struggle to assess whether a model performs reliably in their specific use cases. Public benchmarks like MMLU, GSM8K, or HumanEval offer coarse-grained signals at best—and often fail to reflect the nuance of real-world workflows, compliance constraints, or decision-critical contexts.

The core issue is that the AI performance metrics, such as in testing graduate-level reasoning and abstract math, don’t always reflect real business and human needs.

As a result, bias ends up hiding in areas like "economic advice" that shape real decisions. It shows up in the helpful, professional-sounding advice that can shape real outcomes for someone like you or me.

Better ways to build responsibly

Don’t get me wrong. Academic benchmarks aren’t useless. They serve an important purpose like measuring basic AI capabilities and providing directional indicators of model performance.

Benchmarks are important because without them, companies would have to rely on marketing claims or one-sided case studies when deciding which AI system to use. But they’re not enough.

The research shows that bias hiding in economic advice, for example, requires additional testing beyond standard benchmarks. Based on the study’s methodology and the most recent enterprise best practices, here’s how companies can proceed more responsibly.

Maintain benchmark testing, carefully

Continue using standard benchmarks for capability assessment, but recognize their limitations for bias detection. It would behoove any company building with AI to understand what makes a good AI benchmark.

Add business-specific testing

Test AI systems on the actual decisions they’ll influence. The researchers found massive bias differences when switching from knowledge questions to salary recommendations. Focus on “comparing AI models to benchmarks that match your specific business objectives.”

Cross-reference everything

Never rely on single sources for important decisions. The study’s compound persona testing showed how bias amplifies when demographics stack. Test the same prompts across different AI systems and compare results.

Focus where money flows

The researchers discovered that bias concentrates in economic advice, not academic performance. Prioritize testing for compensation discussions, budget planning, resource allocation, and investment recommendations.

Advice for the rest of us

Given all of these findings, here’s how to protect yourself when using AI tools.

First, audit your own AI usage. Where do you personally rely on AI for quantitative decisions? Those are possibly your highest-risk areas. If numbers are involved, always check twice.

Second, test before you trust. Researchers discovered bias by testing identical prompts with different demographic indicators. Before acting on AI financial advice, try rephrasing your prompt with different demographics. For example, ask for advice as both male and female. See if the numbers change. If they do, dig deeper.

Third, cross-reference everything financial. The study found the largest bias gaps in salary recommendations and financial advice. Beyond salary negotiations, this might include investment advice, budget planning, and pricing strategies. Never trust AI-generated numbers without verification. Get a second opinion from market data, another AI system, or a human expert.

The most dangerous AI responses aren’t the obviously wrong ones that make you laugh. They’re the subtly biased answers that sound perfectly reasonable until you dig deeper.

Stay skeptical to catch the bias hiding in plain sight.

Take the next step

I’m offering 10 free 30-minute audits to catch hidden AI & product risks before they impact your business. Here’s what you get:

Review of your current tools and workflows
Custom checklist of red flags specific to your industry
Clear action plan to implement solutions

Why am I doing this for free? I’m building case studies on AI risk across different industries. You get the audit, I get the studies.

Here's who I'm seeking:

You’re using AI for business-critical decisions (pricing, hiring, strategy)
You’re looking to solve at least one critical challenge in your AI application
You can meet for 30 minutes via video call this week

If you’re serious about this offer, reply with “AUDIT” or DM me on LinkedIn and I’ll send you the scheduling link!

Whenever you're ready, there are 2 ways I can help you:

New! Take my course: Secure your spot for my upcoming course on building your own automated research & outreach AI co-pilot. In 3 days, create a proven system that finds your perfect prospects, uncovers their hidden challenges, & crafts messages they actually want to respond to.

Your questions, answered: DM me on LinkedIn with any questions you have on today's newsletter or anything I've published in the past.

Boost&Byte

Unsubscribe | Update your profile | 2108 N St #9090, Sacramento, CA 95816

Your human guide to the AI era

Sep 04 • 5 min read

The AI bias that can cost you 6-figures

The $120K gender bias problem

The AI bias problem

Enter AI benchmark testing

How companies are falling short

Better ways to build responsibly

Maintain benchmark testing, carefully

Add business-specific testing

Cross-reference everything

Focus where money flows

Advice for the rest of us

Take the next step

Read next ...

Aug 28

DC banned cars to save horses

Aug 21

From burnout to betting on myself

Aug 14

Do we need humans in the AI era?