Why Our Market Regime Engine Lost to SPY

In partnership with

Pop Macro has been working on a recurring online video series and newsletter segment called Bulls vs Bears which analyzes the market and macro environment to determine whether the bulls are in charge or the bears are in charge. But who won one day does not determine who will win the next. And anyone can see a red candle and tell the bears were in charge. The primary point of the series is to make a fun and educational entertainment that provides macroeconomic news. But we don't want it to just be fun. We want it to be useful. The honest goal isn't prediction — it's building tools that help you think more clearly under uncertainty.

How It Works

Bulls vs Bears scores the macro environment across 11 rounds, each covering a different market dimension: breadth, metals, energy, volatility, credit, labor, financial conditions, liquidity, and sentiment. Each round produces a verdict: Bulls win, Bears win, or Neutral. When 60% or more of the rounds score bullish, the macro environment is in what we call Hyper-drive — broad structural advance, green lights across the dashboard. When 60% or more score bearish, we're in the Trash Compactor — structural decay, capital preservation mode. Everything in between is the Chop — the uncertain middle ground where most trading days actually live.

So we decided to test how a trading strategy based on these regime calls would perform against buying and holding the SPY. And the results should surprise no one: the SPY won.

But the point of this article isn't the score. It's what we learned about why defensive strategies struggle, where our model broke, and why you should be suspicious of anyone who claims their system beats the market.

The Setup

We looked at a 2016 to 2026 historical window — roughly ten years covering the late bull market, COVID, the 2022 rate hike bear market, and the AI-driven rally. The strategy was simple: when the engine calls Hyper-drive, go 100% SPY. When it calls Chop, allocate 60% SPY, 30% intermediate Treasury bonds (IEF), and 10% gold (GLD). When it calls Trash Compactor, go defensive: 30% SPY, 50% short-term Treasury bonds (SHY), and 20% gold.

The benchmark was a buy-and-hold SPY strategy. We included transaction costs at 5 basis points per rebalance, which is realistic for ETF trading.

Here's what came back:

Metric	Bulls vs Bears Defensive Strategy	SPY Buy & Hold
Total Return	+164%	+242%
Max Drawdown	-22%	-34%
Sharpe Ratio	0.89%	0.76%
Volatility	11.5%	18%

Our strategy succeeded and failed in the way that most defensive strategies do. On the surface, the strategy underperformed on total return by 78 percentage points. The strategy gave up return but cut max drawdown by 35% and improved risk-adjusted returns. That's the classic defensive strategy profile — you trade return for risk reduction. If you're someone who can't stomach watching your portfolio drop 34%, the strategy's 22% drawdown is meaningfully more survivable. And the Sharpe ratio — which measures return per unit of risk — was 17% higher for the strategy than for SPY alone.

But here's the paradox that every defensive strategy runs into: the market's worst days and best days tend to cluster together. Miss the crash and you also miss the snapback. Missing the 10 worst days over a decade dramatically improves your returns. But missing the 10 best days dramatically hurts them. And in practice, any strategy that avoids bad days will also miss some of the good ones, because they're often within days or weeks of each other.

Where It Broke

After reviewing the first results, we dug into the data and found something unexpected. During the 2020 COVID crash — the fastest equity decline in modern history, where SPY fell 34% in 33 days — our engine's maximum bearish reading was only 6 out of 10 scoring rounds. During the 2022 bear market, same story. The engine identified stress, but never quite reached the level needed to trigger full defensive mode.

Why? Because three of our scoring rounds were producing contradictory signals during crises.

Bug 1: Emergency Liquidity Scored As Bullish.

When the Fed flooded the system with emergency liquidity in March 2020, our plumbing round scored it as a bullish signal — because "liquidity going up" was coded as good news. But emergency Fed intervention isn't a sign of market health. It's triage. The patient is on the operating table and the surgeon is working. The fact that the surgeon showed up is not evidence that the patient is healthy. A more honest scoring would cross-reference rising liquidity with credit stress to distinguish between "gradual accommodative flow" (actually bullish) and "emergency crisis response" (not bullish, just stabilizing).

Bug 2: Collapsing Oil Scored As Consumer Relief.

Our energy round treats low oil prices as bullish — and in normal conditions, that's correct. Cheap gas is good for consumers and margins. But in March 2020, oil wasn't cheap because supply was abundant. Oil was cheap because demand was collapsing. The economy was shutting down. Low oil from demand destruction is a bearish signal, not a bullish one. The same data point meant opposite things depending on context.

Bug 3: Decelerating Jobless Claims Scored As Recovery.

Our labor round uses a second-derivative signal: when jobless claims are high but decelerating, that often marks market bottoms. This is well-documented in the data. But during COVID, weekly claims went from 280,000 to 6.8 million. The following week they dropped to 5.5 million. Technically, that's "decelerating." But 5.5 million weekly claims is still an extinction-level event for the labor market. The improvement was real, but the level was still catastrophic. Our round scored it bullish anyway.

The common thread: every signal has context. A data point that means one thing in a normal market can mean the opposite during a crisis. Simple threshold rules — "if X is above Y, score bullish" — work most of the time. But they break at the extremes, which is exactly when you need them most.

The Fix That Made Things Worse

We found a technical issue in how the engine handled missing data. One of our 11 rounds — the newest one, built on CBOE options data — had no historical data to work with. So historical rows were effectively being scored on 10 rounds, not 11. But the threshold for triggering defensive mode was calibrated for 11 rounds. The math: we required 7 bearish rounds to trigger defensive mode, but with only 10 rounds scoring, that meant 70% of rounds had to agree — significantly stricter than the 60% we'd designed for.

We fixed it. Made the threshold dynamic — 60% of whatever rounds actually scored. Reran the backtest.

The strategy performed worse.

Metric	Before Fix	After Fix
Total Return	+164%	+138%
Max Drawdown	-22%	-24%
Sharpe Ratio	0.89	0.78

More accurate signals. Worse results. How?

Because the fix made the engine more responsive, which meant it switched between defensive and aggressive allocations more often. And every switch during a choppy bear market is a potential whipsaw: you sell SPY as it drops, switch to bonds, SPY bounces, you switch back, SPY drops again, you switch to bonds again. Each round trip costs you a little. Over a year of choppy markets, those small losses compound into a meaningful drag.

This is the core problem with tactical asset allocation. A more accurate model isn't automatically a better model. If your signals are noisy — if they flip back and forth during volatile periods — acting on every signal is worse than not acting at all. The buy-and-hold investor who did nothing during 2022 took a 34% drawdown but recovered within 181 days. Our strategy took a 24% drawdown (better) but took 862 days to recover (much worse), because it kept getting whipsawed during the choppy recovery.

What This Actually Means

So the strategy didn't beat SPY. That's the honest result, and we'd be lying if we told you otherwise. But the exercise wasn't pointless. Here's what we actually learned.

Lesson 1: Modeling The Future Is Hard.

Not impossible, but genuinely difficult. Our engine correctly identified stress during 2020 and 2022 — the bear scores went up meaningfully during both episodes. But "correctly identifying stress" and "correctly timing your response to stress" are two different skills. The engine has the first one. It doesn't have the second one yet.

Lesson 2: Beating The System Has Hidden Costs.

Every time you try to outsmart the market — sell before the drop, buy before the rally — you introduce friction. Transaction costs, timing errors, whipsaws, and the psychological toll of watching a strategy underperform during the recovery. Buy-and-hold is boring, but boring has compounding on its side.

Lesson 3: The Value Of A Regime Engine Isn’t In The Trading Signals. It’s In The Awareness.

Knowing that 8 out of 11 macro dimensions are bearish doesn't tell me to sell everything. But it tells me to pay attention. It tells me this is not the time to add leverage, chase momentum, or ignore risk management. That kind of awareness — structured, repeatable, data-driven — is worth more than any trading signal.

Lesson 4: Beware Anyone Who Sells You A Black Box.

If someone offers you a "regime indicator" or "macro signal" without showing you the scoring logic, the thresholds, the backtest limitations, and the honest drawdowns — they're selling you hope, not a tool. You should be suspicious of anyone whose backtest only shows wins.

What’s Next

The backtest revealed three things that will shape the next iteration of Pop Macro’s forecast work. Some of these things were already in the pipeline and this project helped show us where to spend our energy.

Regime transition detection. This will ensure the system can function as a leading indicator rather than a lagging indicator.
Smarter defensive allocations. Holding bonds as a defensive position doesn’t work in a rising-rate environment. 2022 proved that — interest rates went up, bonds went down, and our defensive allocation lost money on both the equity and bond legs. A better approach might use ultra-short-term Treasury instruments, or rotate into sectors that benefit from the prevailing regime rather than hiding in bonds.
Post-game analysis. In basketball, the final score tells you who won, but the box score tells you why. We want to build a layer that sits on top of the scoring engine and asks: how confident are we in this regime call? Is any round sitting right on its threshold? Are the rounds agreeing with each other or is it a split decision? That kind of context doesn’t change the score, but it changes how much you should trust the score when trying to predict the outcome of future games or playoff matchups.

The engine doesn’t tell us what to do. It’s not for that. It gives us a reproducible frame for noticing when conditions have changed. Last week net liquidity was negative. Today it's positive. Something shifted. What we do about it is still our judgment call. And we think that's the honest relationship anyone should have with a tool like this: it extends your awareness, it doesn't replace your judgment.

If this helps one person build their own better question-asking tool, that's the whole point.

Smart starts here.

You don't have to read everything — just the right thing. 1440's daily newsletter distills the day's biggest stories from 100+ sources into one quick, 5-minute read. It's the fastest way to stay sharp, sound informed, and actually understand what's happening in the world. Join 4.5 million readers who start their day the smart way.

Join for free today!

The Pop Macro Protocol is strictly an educational and entertainment platform. The content provided in this newsletter, including all macroeconomic telemetry and pop-culture translations, does not constitute personalized financial, legal, or investment advice. Do not execute trades based on this publication. The authors are narrators of quantitative data, not registered financial advisors, brokers, or fiduciaries. All trading involves a significant risk of loss. What works in our simulation may not align with your personal risk tolerance. Always perform your own independent due diligence or consult with a licensed financial professional before making any investment decisions. By consuming this content, you acknowledge that the Pop Macro Protocol and its creators are not liable for any financial losses, damages, or shocks to your portfolio.

Three Bugs, One Breakthrough, Zero Alpha: Lessons From Building A Market Regime Engine