Non-observable states cut Markovian bandit regret near-logarithmic

A new regret bound for bandits with hidden dynamics

Researchers introduced UCB-NOM, an algorithm designed for Markovian bandits where the underlying state cannot be directly observed and decision windows may be constrained. The problem class, called self-degrading Markovian bandits, generalizes rested Markovian bandits by allowing arm quality to degrade over time when not selected.

Without prior knowledge of the bandit structure, any algorithm that switches arms rarely must incur super-logarithmic regret (worse than $\omega(\log(T))$, where $T$ is the learning horizon). Despite this hard barrier, UCB-NOM achieves nearly logarithmic regret. When given prior knowledge in the form of a bound on the bias functions of each arm, a tuned version of UCB-NOM reaches $O(\log(T))$ regret, matching the information-theoretic ideal for standard bandits.

A notable property: the regret bounds do not scale with the number of states in the underlying Markov chains. This decoupling suggests that hidden states are "a mild inconvenience" rather than a fundamental obstacle in this setting.

State observability is often a false luxury

Most deployed bandit algorithms assume you can see or infer the relevant context: user segment in recommendations, traffic load in routing, market regime in pricing. In practice, you often cannot. Competitor behavior, supply-chain disruptions, or user intent shift silently.

This work narrows the gap between what you can guarantee when you have full observability and what you can achieve when you do not. The result matters most for systems where switching costs are high (cold-starting a new arm is expensive) or where decision frequency is constrained (you cannot explore every hour).

The constraint on decision epochs also aligns with real-world friction: you may be rate-limited by infrastructure, business rules, or stakeholder approval cycles.

Before adopting UCB-NOM, validate the self-degrading assumption

The algorithm performs best in domains where not selecting an arm degrades its value over time. This matches some real problems (a recommender arm trained on sparse feedback decays; a supplier becomes less reliable if ignored) but not others (an ad creative stays equally effective whether you show it or not). If your arms do not self-degrade, the theory does not apply and empirical validation becomes mandatory.

If prior knowledge on bias bounds is available (often the case if you have historical offline data or domain expertise), prioritize that instantiation of UCB-NOM. The logarithmic regret guarantee is much stronger than the nearly-logarithmic version and worth the upfront work to calibrate.

Test the algorithm on a small, controlled slice of your production traffic first. Hidden-state bandits are harder to debug than fully-observable ones because you cannot easily verify that the state model is correct.

Non-observable states cut Markovian bandit regret near-logarithmic

Our Take

Why it matters

Do this week

A new regret bound for bandits with hidden dynamics

State observability is often a false luxury

Before adopting UCB-NOM, validate the self-degrading assumption

Related stories

New method lets you interpret protein AI models without exploding feature counts

Darts Adds Four Foundation Models in One Interface

RANSAC scoring removes the guesswork parameter