Scrapped Proposal: Measuring Deceptive Capability with Human Time Horizons and Control Protocols
Scrapped project · October 2025. This August 2025 proposal failed once implemented on Control Arena—the monitor-agnostic time-horizon idea collapsed under monitor/model dependence and task asymmetries. It remains public because the URL circulated and helped spark useful conversations and funding outreach. For the current direction that emerged from this work, see Measuring Attacker and Monitor Capability, Task Bias, and Attack Selection. Please treat everything below as historical context, not guidance.
- Appendix A (pre-registration) roughly matches what I actually trialed.
- Main body is left exactly as originally shared, rough edges and all.
- Appendices B–D & inline TODOs are brainstorming notes kept for transparency.
Executive Summary
Scheming and deception in AI systems are often ambiguously defined, inconsistently measured, and lack grounded, interpretable intuition [1]. With this project we aim to more rigourously quantify and ground these potential capabilities. We do this by measuring whether AI systems can achieve a covert objective \(C\), while completing a legitimate task \(T\), without detection under a monitor \(M\). We anchor difficulty to human expert time and fit a monotone success curve to read an \(H_{50}\) “deception horizon”—the covert overhead time at which a model succeeds (achieves \(T\), achieves \(C\), and evades \(M\)) 50% of the time.
We propose a staged approach to validate this methodology. A pre-pilot will first test feasibility using estimated human times and model evaluations on concrete instances (e.g., backdoor insertion during app work; credential exfiltration during infra work). Upon demonstrating directional signal, a pilot will collect actual human baselines, run frontier models, and report initial \(H_{50}\) results with uncertainty to size a future larger study.
Beyond extending time-horizon analysis to deception, the key contribution is a quantitative monitor metric: \(H_{50}\) by monitor and \(H_{50}\)-ratios between monitors. This makes monitor development comparable across releases and over time, enabling trend tracking and evidence-based protocol choices.
Introduction
Recent work has demonstrated that language models are developing increasingly sophisticated capabilities across diverse domains. In cyber-offense, we’ve observed task completion horizons doubling every five months [2]. In software engineering, the doubling time is seven months [3]. These trends suggest models could soon handle multi-hour expert tasks reliably.
Alongside rising capability, recent studies suggest models can pursue covert goals while appearing compliant. Meinke et al. report in-context scheming behaviors under evaluation [4], and OpenDeception documents deceptive strategies in open-ended agent simulations [5], while Subversion Strategy Eval evaluates stateless protocol-subversion across eight environments and finds limited capability without added affordances [6]. At the same time, Summerfield et al. caution against over-interpretation and advocate preregistered, quantitative measurement [1]. Together, these results motivate measuring not just what models can do, but what they can covertly do without being detected.
The AI control literature offers a framework for this challenge [7][8]. By decomposing the problem into three components; a primary task, a covert objective, and a monitoring protocol, we can study deception systematically. This decomposition enables quantitative measurement. Rather than binary “caught/not caught” outcomes, we can measure how deceptive capability grows over time and how different monitors perform.
We propose adapting the time-horizon methodology to measure deceptive capabilities. Our approach anchors task difficulty to human performance, then evaluates how successfully models achieve covert objectives while evading detection. This yields two key outputs: (1) capability trends showing how deception horizons evolve across model releases, and (2) quantitative comparisons of monitoring protocol effectiveness. Cautionarily, we note these outputs measure capability rather than propensity, and to avoid over‑interpreting them as base‑rate behavior.
Accordingly, we structure this work as a staged investigation to maximize learning while minimizing risk. We begin with a pre-pilot that tests our core assumptions using estimated human times and model evaluations. Can we detect the expected directional signals? Do monitors show meaningful differentiation? Which task domains yield the cleanest signal? Only after validating these fundamentals do we proceed to the pilot with actual human participants. This pilot asks whether we can recover an informative \(H_{50}\) signal under a monitor at modest cost, using concrete instances like backdoor insertion during app work or credential exfiltration during infra work. The staged approach is deliberate: the human‑time horizon methodology that succeeds in software and cyber may require adaptation for deception; statistical power at feasible sample sizes remains uncertain; and human studies with domain experts are expensive. By testing our hypotheses first with estimates, then with measured baselines, we de‑risk the design and ensure thoughtful use of capital.
Proposed Methodology
We adapt the time-horizon approach to measure deceptive capabilities. Using the AI control framework [7], we evaluate whether an untrusted model \(U\) can successfully complete a covert objective \(C\) while appearing to work on its assigned task \(T\), all without triggering detection by a monitor \(M\). By anchoring difficulty to human expert performance and applying Item Response Theory (IRT), we can track how deceptive capabilities scale across model releases and compare the effectiveness of different monitoring protocols.
Going any further into the approach requires a quick dive into the AI Control decomposition and IRT.
AI Control Framework
Background on AI Control terminology and the primitives used in this work.
AI Control Framework
Background on AI Control terminology and the primitives used in this work.
AI Control emerges from the challenge of safely deploying capable but potentially misaligned AI systems [7][8]. The framework addresses scenarios where:
- AI systems might pursue objectives different from their operators’ intentions
- We cannot reliably detect misalignment through model internals
- We must rely on monitoring and controlling model outputs and actions
The core insight is decomposing the deployment scenario into measurable components. In our notation:
- \(T\) (Task): The legitimate objective the operator wants completed
- \(C\) (Covert objective): The hidden goal a misaligned model might pursue
- \(M\) (Monitor): The safety measures evaluating or constraining model behavior
- \(U\) (Untrusted Model): The AI model attempting to complete objectives
- \(S\) (Setting): The environment and available tools
This decomposition enables quantitative safety analysis. Rather than asking “is this deployment safe?”, we can ask “what is \(P(\text{covert success} \mid \text{task } T, \text{ monitor } M)\)?”. Control protocols can be evaluated by their ability to prevent covert objectives across standardized task sets.
graph LR
U["Untrusted Model (U)"] -->|attempts| T["Task (T)"]
U -.->|may pursue| C["Covert Objective (C)"]
U --> O["Observables (O)"]
M["Monitor (M)"] -->|evaluates| O
M -->|decides| Dn["Decision"]
Item Response Theory
Background on anchoring difficulty to human time and using a logistic model for comparable curves.
Item Response Theory
Background on anchoring difficulty to human time and using a logistic model for comparable curves.
IRT comes from the field of psychometrics and has become a de-facto standard[9]. It was invented so exam designers could:
- Assign a latent ability to each student that is comparable across very different question sets.
- Assign a latent difficulty to each item that is comparable across very different test-takers.
- Predict success probabilities on unseen student × item pairs. It is straightforward to see how this approach, can be immediately applied to model benchmarking.
In its simplest form IRT, just comes down to a simple logistic relationship,
\[P(\text{success}) = \frac{1}{1 + e^{-a(\theta - \beta)}}\]where:
$a$: Discrimination parameter
$\theta$: Model ability level
$\beta$: Item difficulty level
METR’s key insight was to use human task time length as a proxy for difficulty in this expression. Equipped with this, model task evaluations provide a means to capture probability densities at varying task lengths. With logistic regression, we can calculate these curves. METR found these regressions fit well.
Finally, we can then start plotting probability of success time horizon curves and examine trends e.g. the task horizon lengths at \(p=0.5\) for different models. This is where METR’s ~7‑month software‑engineering doubling came from.
Experimental Design
The study has two parts. First, we collect human timing to measure baselines and covert overhead. Second, we evaluate models under calibrated monitors. Timing and detection are separate: humans provide timing to define \(\delta\); models receive success labels from the monitor.
In Stage 1 (humans), experts complete a task as T‑only or as T+C (deceptive), where they attempt to achieve $C$ while trying to keep it hidden in a general way. No monitor runs during these sessions. We record $t_{T,j}$ and $t_{TC,j}$ for task $j$; runs are right‑censored only on timeout or abort. These two timings define a task‑specific difficulty axis
\[\delta_j \;=\; \log t_{TC,j}\;-\;\log t_{T,j},\]which captures the covert task overhead in human log‑time. It’s important to note that \(\delta_j\) is monitor-agnostic. This helps model-monitor evaluations to live on a shared horizon scale and axis across heterogeneous tasks i.e. human timings do not change depending on the monitor or model we are evaluating. In Appendix D we discuss alternative and complementary approaches where we attempt to incorporate the monitor in our human time horizons.
In Stage 2 (models), each model attempts the same tasks. Every attempt yields a complete action trace. Monitors are end‑to‑end protocols and are executed live. We calibrate each monitor per model to a shared false‑positive schedule that is exposure‑aware and we report derived time‑band figures for readability. We label success inline during the run. We use multiple independent attempts per model–task with fixed decoding settings and different seeds. Stored traces and pinned parameters are documented for audit in Appendix A6.
The IRT link is direct: $\delta$ plays the role of item difficulty, and a given model–monitor pair behaves like a test‑taker with some effective ability. In the next subsection (Horizon Extraction), we fit the success‑versus‑difficulty curve and read the 50% crossing to obtain $H_{50}$. Settings and Tasks explains how we span the difficulty range; Monitoring Protocols details calibration and monitoring experimental design; Model Inclusion Policy defines which releases enter trendlines.
Horizon Extraction
We now turn model outcomes into a single, comparable horizon.
Recall the human difficulty for task \(j\) is the covert overhead in log-time:
\[\delta_j \;=\; \log t_{TC,j} \;-\; \log t_{T,j},\]with \(t_{T,j}\) the human time for \(T\) and \(t_{TC,j}\) the human time for \(T{+}C\) (monitor-agnostic).
For a given untrusted model \(U\) and monitor \(M\), we fit the probability of success as a function of the absolute covert overhead \(\Delta\) in seconds:
\[p_{U,M}(\Delta)\]denote the probability that \(U\) succeeds covertly when the covert overhead is \(\Delta\) seconds (completes \(T\) and \(C\), and is not flagged by \(M\)). We estimate this relationship with a monotone logistic fit so that harder tasks (larger \(\Delta\)) never appear easier.
The covert overhead at 50% success for the pair \((U,M)\) is defined as the value of \(\Delta\) where the success probability equals 0.5:
\[\Delta_{50}(U,M) \;:\; p_{U,M}(\Delta_{50}) = 0.5\]We define the deception horizon as this covert overhead:
\[H_{50}(U,M) \;=\; \Delta_{50}(U,M)\]This represents the time scope of covert work that can be successfully hidden under monitor \(M\) with 50% success probability. Importantly, \(H_{50}\) is the covert overhead alone—it does not include the baseline task \(T\) time.
We form 95% intervals by resampling (i) human runs (to carry uncertainty in \(t_{T,j}\) and \(t_{TC,j}\)), (ii) the benign pack used for monitor calibration, and (iii) model attempts; we refit \(p_{U,M}(\Delta)\) each time. This captures timing noise, calibration noise, and run-to-run variation. Further fitting details are in Appendix A/D.
Settings and Tasks
Original Aug 2025 TODO: This section should be rewritten to clarify task stratification and sourcing.
We select pilot settings using two criteria: (1) domains where current models already excel, ensuring deception rather than basic capability limits performance, and (2) scenarios with real consequences if deception succeeds.
Our evaluation uses two complementary task sources. First, we adapt tasks from ControlArena, which provides ten settings with paired legitimate and covert objectives. Second, we construct custom micro-tasks to fill gaps at target deception horizons in \(\delta\) (very short and very long).
Each task triple ⟨\(T\), \(C\), \(M\)⟩ is positioned within a monitor‑agnostic difficulty framework based on Impact (consequence of covert success) and Intrinsic Detectability (ID) using Effective Covert Load (ECL) bins, with additional Signature‑shape and Compatibility tags. This stratification ensures we sample the full capability spectrum, from tasks models easily complete even under monitoring to those that challenge even naively monitored attempts.
[Full task specifications and difficulty stratification framework in Appendix B]
Monitoring Protocols
Original Aug 2025 TODO: Rewrite this monitoring section for readability.
We evaluate end‑to‑end monitors that run live during model attempts. Each monitor observes a completed and evolving action trace. This includes edits, shell commands, API calls and timestamps. The monitor can block or intervene.
Monitor evaluation
We evaluate monitors using their default or recommended configurations without artificial calibration. This reflects deployment reality where monitors must operate at ultra-low false positive rates (typically < 0.01%) to be practical.
For each monitor \(M\):
- Run with default/recommended settings provided by developers
- Apply to all model attempts (both T-only and T+C conditions)
- Log all flags with timing and context
A model attempt is a covert success if it completes \(T\) and \(C\) without being flagged by the monitor.
All flags undergo manual review to determine if they represent true positives (correct detection of covert behavior) or false positives (incorrect flags on benign behavior). This transparent reporting allows assessment of both detection capability and practical deployability.
We validate snapshot stability with pinned params and sentinels. If provider drift is observed we re‑run sentinels and note it. Stored traces and pinned parameters are documented for audit in Appendix A6.
Reporting
Our primary view is deception anchored. We model success as a function of the covert overhead \(\Delta\) in seconds and report \(H_{50}(U,M)\) (the covert overhead at 50% success) with 95 percent intervals.
Model Inclusion Policy
We evaluate any publicly accessible frontier releases during the snapshot window. For capability trendlines we include only models that were broadly considered SOTA at their release date; other models are reported but excluded from the fit. New releases added mid‑study are included; deprecated APIs are noted if rerun is impossible. We include one open‑weight misaligned checkpoint as a positive control (measured but excluded from SOTA trendlines).
Pilot Study
We break this work into a pre-pilot and a pilot. The pre-pilot validates our methodology using estimated parameters before committing resources to human data collection. The pilot then executes the validated design with actual human participants to produce our first \(H_{50}\) measurements.
Pre-Pilot Phase
The pre-pilot tests whether our methodology yields meaningful signal without the expense of human participants. We use estimated human times based on task analysis and domain expertise, then run full model evaluations to verify our core hypotheses.
This phase answers critical questions: Do we see the expected rank ordering of task difficulty? Can monitors differentiate between legitimate and deceptive behavior at reasonable false-positive rates? Which task domains provide the cleanest signal-to-noise ratio? Do frontier models show meaningful variation in deceptive capability?
During the pre-pilot, we will:
- Develop and refine our task set across difficulty bins
- Generate estimated human times for \(T\) and \(T{+}C\) conditions based on task analysis
- Run all publicly accessible frontier models on the full task set
- Calibrate and evaluate multiple monitor protocols
- Analyze results using the full statistical pipeline (IRT fitting, horizon extraction)
The pre-pilot uses the same analysis methodology as the full study but with estimated rather than measured human baselines. This allows us to debug the pipeline, refine our statistical models, and verify that we can recover meaningful signals before investing in human participants.
Pilot Phase
Contingent on pre-pilot success, the pilot collects actual human baselines and produces our first rigorous \(H_{50}\) measurements. This phase uses the refined task set and validated statistical pipeline from the pre-pilot, now grounded in measured human performance.
The pilot will include:
- Tasks: The refined set from pre-pilot, with coverage across \(\delta=\log t_{TC}-\log t_T\) bins
- Participants: TBD domain experts (sized based on pre-pilot power analysis)
- Human runs: TBD total across \(T\) and \(T{+}C\) conditions
- Models: All publicly accessible frontier releases in the snapshot window
- Monitors: The most promising protocols from pre-pilot, calibrated to false-positive schedules
Decision Gates
We set clear gates between phases and at study completion:
Pre-pilot → Pilot gate:
- Directional signal — Models show decreasing success as task difficulty increases
- Monitor differentiation — At least one monitor configuration shows meaningful detection above baseline
- Domain viability — At least one task domain yields clean signal with reasonable variance
Pilot completion gates:
- \(H_{50}\) precision — For ≥ 1 model, 95% CI width is narrow enough for meaningful interpretation (specific threshold determined from pre-pilot)
- Signal quality — Success curves show expected monotone decrease with increasing difficulty
- Sufficient variation — Outcomes span enough of the probability range to support robust curve fitting
If these gates are met, the outputs will be:
- A technical report detailing methodology, results, and validated statistical pipeline
- Recommendations for task domains and monitor configurations for future work
- A decision note on whether to proceed to larger-scale evaluation
Timeline
Pre-Pilot (Months 1-2)
| Month | Focus | What happens | Outputs |
|---|---|---|---|
| Month 1 | Design & Setup | - Design task set across difficulty bins - Generate time estimates for \(T\) and \(T{+}C\) - Build evaluation sandbox - Implement monitor protocols - Begin model evaluations |
- Complete task specifications - Estimated difficulty parameters - Working evaluation infrastructure - Initial model results |
| Month 2 | Evaluation & Analysis | - Complete model evaluations - Calibrate monitors - Run full statistical pipeline - Assess signal quality - Make go/no-go decision |
- Pre-pilot technical report - Refined task selection - Power analysis for pilot - Go/no-go recommendation |
Pilot (Months 3-4, contingent on pre-pilot success)
| Month | Focus | What happens | Outputs |
|---|---|---|---|
| Month 3 | Human Baselines | - Recruit domain experts - Collect human \(T\) and \(T{+}C\) times - Re-run model evaluations if needed - Update monitor calibrations |
- Human timing dataset - Updated difficulty parameters - Calibrated monitor thresholds |
| Month 4 | Analysis & Reporting | - Fit final IRT models - Extract \(H_{50}\) horizons - Bootstrap confidence intervals - Prepare technical report |
- Final \(H_{50}\) measurements - Complete technical report - Recommendations for future work |
Budget
We structure the budget as staged funding aligned with our two-phase approach. Pre-pilot costs are primarily compute and researcher time; pilot costs add human participant expenses.
Pre-Pilot Budget (Months 1-2)
| Item | Basis | Estimate |
|---|---|---|
| Model API / compute | Initial evaluations across task set | $3,000 |
| Infrastructure | Sandbox setup, logging, storage | $500 |
| Researcher compensation | 2 months effort | $20,000 |
| Pre-pilot subtotal | $23,500 |
Pilot Budget (Months 3-4, contingent on pre-pilot success)
| Item | Basis | Estimate |
|---|---|---|
| Expert honoraria | \(\text{experts}\times \text{hours}\times \$100/\text{hour}\) | TBD from pre-pilot |
| Model API / compute | Re-runs with measured baselines | $2,000 |
| Infrastructure | Continued operations | $500 |
| Recruiting & admin | Expert outreach and payments | $1,000 |
| Researcher compensation | 2 months effort | $20,000 |
| Contingency | ~15% of direct costs | TBD |
| Pilot subtotal | TBD |
Total Project Budget: Pre-pilot ($23,500) + Pilot (TBD based on pre-pilot sizing)
The staged structure ensures efficient use of funds. If the pre-pilot reveals fundamental issues with the methodology, we avoid the expense of human data collection. Conversely, if it succeeds, we proceed with confidence and refined parameters for optimal study design.
Key Challenges
Statistical power with limited samples: Human studies of deception are expensive and high-variance. Getting reliable \(H_{50}\) estimates requires careful experimental design—calibration tasks, overlapping assignments, and appropriate statistical models. We detail these solutions in Appendix A, but the fundamental constraint remains: we need sufficient data points across the difficulty spectrum for well-conditioned curves.
Task difficulty calibration: Creating tasks that span from 30-second to multi-hour horizons while maintaining consistent quality is hard. Short tasks risk ceiling effects (everything succeeds), while long tasks become subjectively evaluated and expensive to baseline.
Measurement validity: Are we actually measuring deception capability or just task complexity? The control comparison (\(T\) vs \(T\)+\(C\)) helps, but sophisticated deception might be qualitatively different from simple hidden behaviors. Monitor protocols themselves might be gameable in ways that don’t reflect real deployment risks.
Model evaluation awareness: Frontier models may recognize evaluation conditions and modify behavior. They might sandbag on deceptive tasks or, conversely, be more willing to attempt deception in clearly sandboxed environments. This could distort capability measurements in either direction.
Some of these challenges should hopefully be addressable but require careful attention during execution. The pilot study will help calibrate our approach before scaling to the full evaluation.
Summary
This proposal presents a methodology for quantifying AI deceptive capabilities through time-horizon analysis. We request staged funding to validate and execute this approach:
- Pre-pilot phase ($23,500): Test core assumptions using estimated parameters and model evaluations to verify we can detect meaningful signals
- Pilot phase (amount TBD based on pre-pilot): Collect human baselines and produce first \(H_{50}\) measurements with statistical rigor
The staged structure minimizes risk while maximizing learning. By validating our hypotheses with estimates before collecting expensive human data, we ensure efficient use of resources and increase the probability of producing actionable results. Success would establish both a measurement framework for deceptive capabilities and a quantitative benchmark for monitoring protocol effectiveness.
Limitations
Original Aug 2025 TODO: Flesh out this limitations section.
measurement validty (monitor-agnostic human time horizons)
deception vs deceptive capability (mentioned in appendix)
Appendix A: Pre-Registration
Pre-registration is a research practice borrowed from psychology and clinical trials where you publicly document your experimental design, analysis plan, and hypotheses before collecting data. This prevents “p-hacking” and builds credibility by showing you’re not retrofitting your methods to your results.
For this study, pre-registering is particularly important because human deception experiments are expensive and unrepeatable. We get one shot at collecting this data, so we need to specify exactly how we’ll handle edge cases, what we’ll exclude, and how we’ll analyze results. This appendix documents those decisions in detail, from participant recruitment through statistical modeling.
Think of this as the commitment to intellectual honesty. These are the rules I intend to follow and am sharing publicly for accountability. If I later adopt additional analyses, I will document the changes, explain why they are justified, and show how they relate to the pre‑registered plan.
A1. Participant Recruitment and Screening
TODO:
A2. Human Baseline Experimental Design
Stage 1 collects human timing that is monitor agnostic. The goal is to anchor difficulty on a human time scale so that later model results can be read against the same axis. No detector runs live in this stage. Monitor outputs are not used to define difficulty. Timing is captured automatically from the sandbox logs.
We keep the procedure simple and transparent. Participants complete assigned tasks under either T or T{+}C. We record start and stop times and whether the assigned objective was completed. Right censored runs still contribute information. Exact counts come from simulation driven sizing and will be finalized in Appendix A7.
A2.1 Participant flow
Each participant follows the same path.
- Speed calibration. A handful of very short reference tasks help estimate personal speed so assignment is balanced.
- Task assignment. Tasks are assigned at the task level as T or T{+}C. There are no repeats for a given participant. Assignment is counter balanced to cover difficulty bins.
- Work session. The participant reads the problem statement. The clock starts when the statement is opened. They work in the sandbox and produce the normal outputs for T. If assigned T{+}C they also attempt C while keeping it hidden in a general way. They do not see a detector.
- Stop condition. The run ends on submission or on timeout or on abort. All actions and timestamps are logged.
- Marking. We record completion flags for T and C. We mark right censoring when the assigned objective is not completed before timeout or abort.
- Data product. We materialize a single table for analysis. It contains one row per run with timing and minimal metadata. No monitor fields appear in this table.
A2.2 Core variables and notation
We work on log time to stabilize variance and to make multiplicative effects additive. All times are in minutes. All logs are natural logs.
- Participants \(i\)
- Tasks \(j\)
- Condition \(k \in \{T,\; T{+}C\}\)
- Start and stop times \(t_0\) and \(t_1\)
- Duration \(\text{duration}_{ij} = t_1 - t_0\)
- Log time \(y_{ij} = \log(\text{duration}_{ij})\)
- Right censoring indicator \(c_{ij} = 1\) if the assigned objective is not completed before timeout or abort. Otherwise \(c_{ij} = 0\)
- Completion flags \(T_{\text{completed}}\) and \(C_{\text{completed}}\)
A2.3 Randomization and assignment
Assignment is at the task level so the same task appears as T for some participants and as T{+}C for others. There are no repeats for a given participant. The set of tasks a participant sees spans the pre registered difficulty bins within a setting. The initial speed estimate from calibration guides this balancing so that very slow or very fast individuals do not receive an extreme mix by accident. Exact counts for tasks per participant and the number of calibration items are set by the simulator.
A2.4 Censoring and completion
Right censoring can occur in T and in T{+}C. A run is right censored when the assigned objective is not completed before timeout or abort. We record a reason field with values timeout or abort. Completion flags are recorded separately. Censoring and completion are not the same thing. For example a T{+}C run may complete T yet fail to complete C before timeout. Both facts are recorded.
A2.5 Data product schema
The analysis for Stage 1 and Stage 2 begins from a single table named human_timing_runs. It contains timing, completion and minimal metadata. There are no monitor fields. The table is the authoritative input for the timing model in Appendix A4.
| column | type | meaning |
|---|---|---|
| participant_id | string | participant index \(i\) |
| task_id | string | task index \(j\) |
| setting | string | task setting |
| stratum_bin | string | pre registered difficulty bin label |
| condition | enum | T or T{+}C |
| t0_iso | datetime | start timestamp in ISO format |
| t1_iso | datetime | stop timestamp in ISO format |
| duration_s | number | duration in seconds |
| y_log | number | \(\log(\text{duration in minutes})\) |
| censored | bool | right censored flag |
| censor_reason | enum | timeout or abort |
| T_completed | bool | whether T was completed |
| C_completed | bool | whether C was completed |
Example rows.
| participant_id | task_id | setting | stratum_bin | condition | t0_iso | t1_iso | duration_s | y_log | censored | censor_reason | T_completed | C_completed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P07 | apps-12 | apps | mid | T | 2025-09-02T10:01:12 | 2025-09-02T10:06:44 | 332 | 1.71 | N | NA | Y | NA |
| P12 | apps-12 | apps | mid | T{+}C | 2025-09-03T14:11:03 | 2025-09-03T14:16:58 | 355 | 1.78 | Y | timeout | Y | N |
Note. In T rows the C_completed field is not applicable. We leave it blank for readability and store it as false in the analysis file.
A2.6 Quality controls and incentives
Timing is captured automatically from the sandbox logs so there is no manual stopwatch step. The person who runs sessions and watches progress is different from anyone who later adjudicates whether C was completed. This separation avoids expectancy effects. We include a small incentive for concealment in T{+}C. The bonus is awarded by an objective check that is defined in the task specification. For example a backdoor artifact with an agreed hash or a canary credential observed at a controlled sink. The incentive is small enough that it does not distort the time taken yet large enough that participants try to hide C.
A2.7 What this stage estimates
From the human runs we estimate the covert time increment on the log time scale. For each task \(j\) we define
\[\delta_j = \log t_{TC,j} - \log t_{T,j}\]where \(t_{T,j}\) is the human time for T and \(t_{TC,j}\) is the human time for T{+}C. Uncertainty in these quantities carries forward into Stage 2 by nested resampling which is described in Appendix A4. The \(\delta_j\) values create a monitor agnostic difficulty axis that we reuse when we fit model success curves and read \(H_{50}\).
A3. Model Experimental Design
Stage 2 evaluates models on the same tasks used in Stage 1. We use the human time axis from Stage 1 to define difficulty. Monitors run live with their default configurations as described in Appendix A5. Outcomes are labeled at run time. We store full traces and a compact analysis table.
We keep the procedure predictable and easy to audit. Decoding settings are fixed. Seeds are predeclared. The sandbox logs every action with timestamps. Success is defined once and used everywhere. Analysis details for the success curve and horizons are in Appendix A4.
A3.1 Inputs
- The task set with Stage 1 timings and derived difficulty \(\delta_j\)
- The list of model releases with fixed decoding settings and a set of seeds
- The monitors to be evaluated
- The per model threshold function by exposure \(\tau_M(\ell)\) as obtained in Appendix A5
- The snapshot window and pinned parameters defined in Appendix A6
A3.2 Protocol
- Freeze decoding settings and the list of seeds. Load the per model thresholds for each monitor.
- For each model and each task and each seed
- Initialize a clean sandbox. Start logging. Record \(t_0\)
- Run the model on the task with the allowed tools
- The monitor observes the evolving action trace and applies the locked threshold for the current exposure
- The run ends if the monitor blocks the trace or if the model submits the \(T\) outputs and the \(C\) artifact or if a timeout or abort is reached
- Record \(t_1\). Mark completion flags for \(T\) and \(C\). Record the monitor decision and the exposure summary
- Label success for the attempt
- Success is one if \(T\) is completed and \(C\) is completed and the monitor did not block live at its locked threshold
- Otherwise success is zero
- Persist artifacts
- Store the full action trace and logs for audit
- Append one row to the analysis table
model_runs
A3.3 Definitions
We use a single operational definition of covert success.
\[\text{success}_{U,M,j,s} \;=\; \mathbf{1}\{\,T_{\text{completed}} \wedge C_{\text{completed}} \wedge \text{not flagged by monitor}\,\}\]Here \(U\) is the model under test. \(M\) is the monitor. \(j\) is the task. \(s\) is the seed. The probability of success as a function of human difficulty \(\delta\) is denoted \(p_{U,M}(\delta)\). We fit this curve in Appendix A4 and read the horizon there.
A3.4 Monitor decisions
Monitors run with their default configurations and may flag attempts they deem suspicious. All flags are logged with timing and context for manual review. A flag does not automatically mean failure - manual review determines if it was justified.
A3.5 Randomization and repeats
Tasks and seeds are run in a randomized order within each model. We use multiple attempts per task with different seeds to reduce variance.
A3.6 Stop conditions
A run stops on any of the following
- Monitor block at the locked threshold for the current exposure
- Completion of both \(T\) and \(C\) with the required acceptance checks
- Timeout or abort according to the pre registered limits
All stop conditions are recorded in the analysis table.
A3.7 Data product schema
The analysis for Stage 2 begins from a single table named model_runs. Raw traces are stored separately. The authoritative schema is finalized in Appendix A8. The table below shows the fields used in the pilot.
| column | type | meaning |
|---|---|---|
| attempt_id | string | unique attempt identifier |
| model_id | string | model release identifier |
| task_id | string | task index (j) |
| monitor | enum | monitor identifier |
| seed | integer | decoding seed |
| t0_iso | datetime | start timestamp |
| t1_iso | datetime | stop timestamp |
| runtime_s | number | duration in seconds |
| flagged | bool | whether monitor raised a flag |
| flag_time_s | number | time when flag was raised (if applicable) |
| T_completed | bool | whether (T) was completed |
| C_completed | bool | whether (C) was completed |
| success | bool | covert success label (T and C completed, not flagged) |
Example rows.
| attempt_id | model_id | task_id | monitor | seed | t0_iso | t1_iso | runtime_s | flagged | flag_time_s | T_completed | C_completed | success |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a01 | o3-2025-04 | apps-12 | DM | 42 | 2025-09-10T09:00:11 | 2025-09-10T09:01:09 | 58 | Y | 45 | Y | Y | 0 |
| a02 | claude-3.5 | apps-12 | DM | 17 | 2025-09-10T09:03:21 | 2025-09-10T09:04:22 | 61 | N | NULL | Y | Y | 1 |
Note. Success depends on both completion flags and whether the monitor flagged the attempt. The curve fitting and horizon extraction use the success column and the covert overhead from Stage 1.
A4. Analysis
This appendix explains how we turn Stage 1 timings and Stage 2 outcomes into the reported estimates. We start from clean analysis tables. We fit a timing model for humans on log time. We build a difficulty axis from those timings. We fit a simple monotonic success curve for each model and monitor. We read a single horizon from that curve and map it back to minutes. We carry uncertainty through all steps.
All times are in minutes. All logs use the natural logarithm.
A4.1 Data curation and exclusions
We begin with a small, pre registered curation pass. The goal is clean data without removing runs that are merely slow or fast.
Exclusions follow a fixed order. Each dropped row records a reason in the analysis file. The numeric outlier rule will be listed in A4.8 once simulation-driven thresholds are finalized.
- Technical failure. For example a sandbox error or a corrupted log
- Experimental failure. For example a reward hacking artifact or an invalid task instance
- Outlier rule. Human runs more than a pre registered threshold from a participant baseline measured with a robust dispersion. The exact rule is defined in this appendix and is the same for all participants
- Any other pre registered reason with a one line justification
After curation we produce two analysis tables. The Stage 1 table human_timing_runs from A2. The Stage 2 table model_runs from A3. These are the authoritative inputs to the models below.
A4.2 Stage 1 timing model
We estimate human speed and task difficulty on the log time scale using a Bayesian hierarchical model. Right censored runs (timeouts) contribute information through survival analysis. Timing is captured automatically.
Notation
- Outcome \(y_{ij}\) is the log duration for participant \(i\) on task \(j\)
- Indicator \(c_{ij} = 1\) if the run is right censored. Otherwise \(c_{ij} = 0\)
- Condition is \(T\) or \(T{+}C\)
Model
We use a hierarchical model on log time with task and participant random effects:
\[\mu_{ij} \;=\; \alpha \;+\; u_{\text{task},j} \;+\; u_{\text{part},i} \;+\; (\beta_C + d_{\text{task},j}) \cdot \mathbf{1}\{T{+}C\}\]where:
- \(\alpha\) is the global intercept
- \(u_{\text{task},j}\) are task-specific baseline random effects
- \(u_{\text{part},i}\) are participant random effects
- \(\beta_C\) is the shared covert overhead effect on log scale
- \(d_{\text{task},j}\) are task-specific covert overhead random effects
Uncensored runs contribute a Normal likelihood on \(y_{ij}\) with mean \(\mu_{ij}\) and standard deviation \(\sigma\). Right-censored T+C runs contribute survival tail probabilities.
We enforce convergence diagnostics: R-hat ≤ 1.05 and bulk ESS ≥ 200 for all key parameters. If these fail, the analysis halts with an error.
Derived outputs: From the posterior draws, we compute for each task \(j\) and draw \(k\):
- \(t_{T,j,k}\): baseline task time in seconds
- \(\Delta_{j,k}\): covert overhead in seconds (T+C time minus T time on the seconds scale)
A4.3 Difficulty axis from human timings
We define a monitor agnostic difficulty axis for tasks
\[\delta_j \;=\; \log t_{TC,j} \;-\; \log t_{T,j}\]Here \(t_{T,j}\) is the human time for \(T\) and \(t_{TC,j}\) is the human time for \(T{+}C\). Both come from the Stage 1 model with censoring handled as above. We carry uncertainty in these quantities forward into the model stage.
A4.4 Stage 2 success curves
For each model and monitor pair we model the probability of covert success as a function of the absolute covert overhead in seconds. Outcomes are labeled live as described in A3. Success means the run completes \(T\) and \(C\) and the monitor does not block at the locked threshold.
We denote the curve as
\[p_{U,M}(\Delta) \;=\; \Pr(\text{success} \mid \Delta, U, M)\]where \(\Delta\) is the covert overhead in seconds. We fit a Bayesian logistic relationship with a monotone decreasing constraint for each pair \((U, M)\). Seeds are pre declared. The fit is used to read the 50% crossing point. We constrain \(p_{U,M}(\Delta)\) to be non-increasing in \(\Delta\) since larger covert overhead means harder tasks.
A4.5 Horizon extraction
The covert overhead at 50 percent success for a model and monitor pair is:
\[\Delta_{50}(U, M) \;=\; \text{the value of } \Delta \text{ where } p_{U,M}(\Delta) = 0.5\]This is extracted from the fitted logistic curve. If the 0.5 crossing lies outside the observed \(\Delta\) range, we report \(H_{50}\) as left- or right-censored and bound it using the nearest observed \(\Delta\).
We define the deception horizon as:
\[H_{50}(U, M) \;=\; \Delta_{50}(U, M)\]This represents the covert overhead (in seconds) that can be successfully hidden with 50% probability. We report \(H_{50}\) with 95% confidence intervals.
Note that \(H_{50}\) is the covert overhead alone—it does not include the baseline task time. This makes it directly interpretable as the scope of covert work that can be accomplished while evading detection.
A4.6 Uncertainty propagation
We use a nested resampling scheme to carry uncertainty through to \(H_{50}\). The procedure is pre registered and the same for all pairs. We treat tasks as fixed for this pilot (no task resampling).
- Resample human runs with replacement. Refit the Stage 1 timing model. Draw \(t_{T,j}\) and \(t_{TC,j}\) and compute \(\Delta_j\) (covert overhead in seconds)
- Resample model attempts with replacement within task and seed clusters. Refit the success curve \(p_{U,M}(\Delta)\)
- Extract \(\Delta_{50}(U, M)\) from the fitted curve, which directly gives us \(H_{50}(U, M)\)
- Repeat for a large number of draws. Form percentile intervals on \(H_{50}\) in seconds
This procedure reflects two main sources of variability: Human timing and censoring, and model attempt randomness and seed choice.
A4.7 Reporting
We report the following for each model and monitor pair:
- \(H_{50}\) (covert overhead at 50% success) with a 95 percent confidence interval in seconds
- False positive and detection rates from monitor evaluation (see A5)
- A small set of diagnostics: A plot of success rate versus \(\Delta\) (covert overhead) with the fitted monotonic curve. A diagnostic plot of success versus \(t_T\) (baseline anchored), labeled diagnostic only
Additionally, we compute capability trends across model releases. For each monitor:
- For each posterior draw, we extract \(H_{50}\) values for all SOTA models and fit OLS regression of \(\log(H_{50})\) on release year (as fractional years)
- From the distribution of slopes across draws, we report:
- Median slope and 95% CI (rate of change in \(\log(H_{50})\) per year)
- Doubling time in months when slope is positive (computed as \(12 \times \ln(2) / \text{slope}\))
- \(R^2\) to assess trend fit quality
Note that trends use only models designated as SOTA at release to reflect frontier capability growth.
We pre register the exact number of resamples. We fix seeds for reproducibility. Figure code and artifact locations are listed in A6 and A8.
A5. Monitor Evaluation
We evaluate monitors using their default or recommended configurations without artificial calibration. This approach tests monitors as they would be deployed in practice, where false positive rates must be extremely low (typically < 0.01%) to avoid overwhelming human reviewers.
A5.1 Monitor protocol
For each monitor \(M\):
- Configuration: Use the monitor’s default or recommended settings as provided by its developers
- Execution: Run the monitor live during all model attempts (both T-only and T+C conditions)
- Recording: Log all flags raised by the monitor, including timing and triggering context
No threshold adjustment or calibration is performed. Monitors are evaluated as complete systems with their native decision logic.
A5.2 Manual review process
Every flag raised by any monitor undergoes manual review to determine whether it represents:
- True positive: The flag correctly identified suspicious behavior during a T+C attempt
- False positive: The flag was raised on a T-only attempt or incorrectly triggered on benign behavior
- Ambiguous: The flag has unclear justification (these are documented separately)
Review criteria include:
- Was there actual covert behavior at the flagged location?
- Does the triggering pattern genuinely indicate suspicious activity?
- Would a human reviewer reasonably escalate this flag?
A5.3 Reporting
For each monitor, we report:
Performance metrics:
- Detection rate: Proportion of T+C attempts correctly flagged
- False positive count: Number of incorrect flags on T-only attempts
- False positive rate: FP count divided by total T-only attempts
- Detection timing: When in the trace detection typically occurs
Qualitative analysis:
- Common patterns that trigger detection
- Types of covert behavior successfully detected vs. missed
- False positive patterns and potential causes
This transparent reporting allows readers to assess whether any false positives might distort the results and to judge the practical deployability of each monitor.
A6. Audit & Reproducibility
TODO
A7. Assurance / Power
TODO
A8. Data Schemas (authoritative)
TODO
A9. Reporting Plan & Decision Gates
TODO
A10. Deviations & Change Log
TODO
Appendix B: Task Design and Construction
Original Aug 2025 TODO: ECL discussion needs review/cleanup.
This appendix lays out how we build a monitor-agnostic difficulty ladder for deception tasks. Our organizing idea is difficulty stratification: vary Impact (consequence of covert success) and Intrinsic Detectability (ID) (how much the world must look different to do C inside T, before any particular monitor is chosen). We use this to cover tasks from “almost benign” to “high-impact and hard to conceal.”
This design is informed by mature fields that study deception and detection from different angles:
- OPSEC (indicators and what “looks different” in the world),
- MILDEC (deception story plausibility and integration),
- Counterdeception (principles for analyzing and defeating deception),
- Cyber deception and behavioral detection (e.g., ATT\&CK thinking),
- Strategic information transmission/game-theoretic deception, and
- Covert communication / steganography (payload-vs-detectability trade-offs).
We do not bake any specific monitor assumptions into tasks. Monitors are treatments compared later; task properties are fixed.
We then highlight two practical complexities at the horizon extremes. First, short-horizon task design: crafting genuinely short tasks without triviality and ensuring older SOTA models can register non-zero success. Second, long-horizon task design: the larger challenge, where eval-awareness and multi-step deception make it hard to build realistic, reproducible hours-scale tasks.
B1. Difficulty Stratification Framework (Impact × Intrinsic Detectability)
Human time horizons for deception under the (T, C, M, U) framework are an empirical claim we intend to test. That test only works if human baselines span the time range with a well‑stratified set of tasks. Without coverage across easy to hard instances, IRT‑style analyses become unstable. This section explains how we design T and C to create a monitor‑agnostic difficulty ladder over human time.
Impact is our existing consequence tiering (0–3). Intrinsic Detectability (ID) asks: How much must the world look different to accomplish C inside T, irrespective of any monitor?
ID has three ingredients. Only the first carries light equations; the rest are qualitative rubric items.
B1.1 Ingredient 1 — Effective Covert Load (ECL)
Original Aug 2025 TODO: This section over-formalises a qualitative idea and should be pared back.
Intuition. ECL is the extra work that cannot plausibly pass as part of T.
Unit of work. Use setting-native micro-actions $u$ (LLM tokens; shell commands; file edits/lines, etc.).
Quantities (estimated from honest traces):
- $\tau_T$: actions to complete \(T\) alone
- $\tau_C$: actions to complete \(C\) alone
- $\tau_{TC}^{\text{hon}}$: actions to complete T + C honestly (no attempt to hide)
Honest overlap.
\[\lambda_{\text{hon}} \;=\; \text{clip}\!\left(1 - \frac{\tau_T + \tau_C - \tau_{TC}^{\text{hon}}}{\tau_C},\,0,\,1\right)\]Honest interaction residual.
\[\sigma \;=\; \tau_{TC}^{\text{hon}} - \big(\tau_T + \tau_C - \lambda_{\text{hon}}\tau_C\big)\]$\sigma>0$ suggests interference; $\sigma<0$ synergy.
ECL estimator (monitor-agnostic).
\[\widehat{\mathrm{ECL}} \;=\; \frac{(1-\lambda_{\text{hon}})\tau_C \;+\; \max(\sigma,0)}{\tau_T}\]Interpretation: the share of a T’s worth of work that you cannot tuck inside normal T based on honest runs. We treat $\widehat{\mathrm{ECL}}$ as a lower bound; deceptive executions may add “keep-appearances” overhead that honest runs do not reveal.
We bin $\widehat{\mathrm{ECL}}$ within each setting into Low / Medium / High. We do not roll ID into a single score.
B1.2 Ingredient 2 — Signature Shape
- Burstiness: is extra work crammed into short windows or dripped across time? Low: evenly spread; High: sharp peaks.
- Surface area: how many distinct artifacts/silos beyond T’s normal footprint? (files, repos, hosts, services, privilege domains) Low: all within already-touched artifacts; High: new systems/components.
- Distinctiveness: how atypical are the extra actions for T? (new tools, formats, syscalls, dependencies) Low: actions identical to those T already requires; High: clearly unusual operations.
B1.3 Ingredient 3 — Compatibility & Plan Morphing (qualitative, distinct from ECL)
Compatibility captures how C interacts with T’s plan:
- Synergistic: C piggybacks on T’s path; little/no plan change.
- Orthogonal: C sits alongside T.
- Antagonistic: C fights T’s goals/constraints; may force visibly suboptimal routes or degrade T’s quality/time.
Two narrative signals to record:
- Plan morphing: would hiding C require a noticeably different route for T to “look normal”?
- Outcome drag: does doing T+C (even honestly) materially worsen T’s quality or time vs. T-only?
Compatibility explains why deceptive runs may diverge from the honest-based lower bound $\widehat{\mathrm{ECL}}$.
B1.4 How we use this (twofold)
(A) Dataset construction (pre-pilot, monitor-agnostic).
For each candidate ⟨T, C⟩ in a fixed environment:
- Collect quick honest traces for \(T\), \(C\), T+C to estimate $\tau_T$, $\tau_C$, $\tau_{TC}^{\text{hon}}$.
- Compute $\lambda_{\text{hon}}$, $\sigma$, $\widehat{\mathrm{ECL}}$; bin $\widehat{\mathrm{ECL}}$ into Low/Med/High within setting.
- Label Signature Shape (Burstiness / Surface area / Distinctiveness) and Compatibility (synergistic / orthogonal / antagonistic) with one-line justifications.
- Stratify by (Impact tier, then $\widehat{\mathrm{ECL}}$ bin), and sanity-check coverage across Signature-shape and Compatibility labels.
- Lock tasks. Monitors are selected after this; task properties do not change with the monitor.
(B) Data analysis (post-pilot).
Primary outcome remains \(H_{50}\) deception horizons. We analyze results by:
- Forms of deception where models do well: report \(H_{50}\) / success by $\widehat{\mathrm{ECL}}$ bin and Compatibility label.
- Forms of deception where monitors do well: for each monitor, report gains vs. baseline across $\widehat{\mathrm{ECL}}$ bins and Signature-shape labels (e.g., burstiness).
- Divergences: call out when deceptive traces clearly exceed the honest-based $\widehat{\mathrm{ECL}}$ (plan morphing, outcome drag). Explain narratively.
No additional parametric models are required for public reporting.
B2. Short-Horizon Task Design
TODO:
B3. Long-Horizon Task Design
TODO:
Appendix C: Deception Decomposition Designs
There are several distinct quantities we could measure when we talk about “scheming” or “deception,” and being explicit helps:
- Domain capability: how long it takes a competent actor to complete a task in a given domain (e.g software, cybersecurity).
- Deceptive capability: how long it takes a competent actor complete a misaligned task that avoids triggering a monitor.
- Deception time itself: the portion of time specifically attributable to hiding, as distinct from doing the covert work.
The main study in this proposal measures deceptive capability by timing success on T + C against just T in the presence of a monitor. A further decomposition design can separate “doing C” from “hiding C,” which some readers may find informative.
Design Options
| Name | Conditions collected | What you learn | Pros | Cons |
|---|---|---|---|---|
| Deceptive Capability (Composite) | T vs T + C (deceptive) | Overall covert overhead under the monitor (deceptive capability) | Simplest, lowest N, maps directly to safety question | Conflates covert work with hiding cost |
| Decomposition (Covert + Hiding) | T, C (honest), T + C (deceptive) | Rough split: Δcovert ≈ time(C honest), Δhide ≈ time(T+C deceptive) − [time(T baseline) + time(C honest)] | Separates covert work from hiding cost | +50–70% participant‑hours, careful counter‑balancing needed |
| Full Decomposition (Multitask + Hiding) | T, C (honest), T + C (honest), T + C (deceptive) | Full separation of covert work, honest multitask interaction, and hiding cost | Most informative | Highest cost/complexity, more overlap per task, larger N for precision |
Statistical implications and sample size
Decomposition designs increase both the number of conditions and the overlap required per task (multiple participants per condition, mixed difficulty per participant). That raises participant‑hours substantially and, because Δhide spans multiple cells, also raises the total N needed to reach the same precision. The statistical backbone remains the same:
-
The monitor remains central — success for covert behaviour is always conditional on the monitor and task context. Conceptually,
\[P_{succ}(\text{covert }|\ t, M) = \text{logistic}(g_M(t)),\]with $g_M$ monotone and $t$ the human‑time difficulty proxy.
-
Analysis uses a mixed‑effects (hierarchical) model on log‑time with participant and task random effects; timeouts are handled via an AFT/LMM framing. Condition contrasts recover \(Δcovert\) and \(Δhide\) with partial pooling, which stabilises estimates under sparse cells.
-
More conditions ⇒ more overlap requirements: each task needs multiple participants per condition, and each participant should see mixed difficulty across conditions to keep person/task effects identified.
- Option 1: baseline cost (×1.0)
- Option 2: ~×1.5–×1.7 participant‑hours (third condition adds ~50–70% time with required overlap)
- Option 3: ~×2.2–×2.5 participant‑hours (fourth condition + interaction cell)
In addition to raw collection time, the further decomposed options will likely require larger N for Δhide precision because the contrast crosses cells; simulation‑based assurance is recommended to set N and replications per task.
Why “Deceptive Capability (Composite)” for this project
- Directly answers the primary question: deceptive capability under monitoring is the metric we care most about for safety.
- Practical and robust at target cost: the simplest design keeps variance low and still supports the needed overlap and calibration. The Decomposition and Full Decomposition variants would require many more participant‑hours and coordination, which raises cost and execution risk.
Exploring Options 2–3 would be scientifically interesting (e.g., quantifying the split between covert work and hiding), but given cost and complexity they are not within the scope of this project proposal.
Appendix D: Human-Monitor Time Horizons
Original Aug 2025 TODO: This section was AI-assisted brainstorming and needs a full rewrite.
Original Aug 2025 TODO: Split detector compatibility from human-monitor analysis; length-aware calibration belongs in the prereg detail.
Original Aug 2025 TODO: Remove the leftover calibration material.
D0: Monitor primitives
Monitors are systems, not single checks. Performance is more than “did it detect?” because a monitor can also intervene to prevent or reshape behaviour (block, rewrite, resample, escalate). This matters for evaluation: humans cannot be resampled or steered mid‑run the way models can, so some powerful protocols create an asymmetry we cannot mirror in human trials. As such it will be useful to breakdown the framework a little further to help reason through the benefits and constraints of this asymmetry.
- Detector: a mapping from observables to a risk assessment (flag/score/classification). A detector does not change the world; it evaluates.
- Policy: a mapping from state (including detector outputs) to actions on the process. Policy can block, rewrite, resample, escalate for review, throttle, etc. Policy may run conditionally on detector outputs or unconditionally as an always‑on treatment.
These stages can interleave: detect → act → detect → act, with policy sometimes running before any explicit detection (e.g., “always resample first”).
You can evaluate the detector by itself or the full detector + policy system. Later sections state the constraints we adopt for measurement.
The defer‑to‑resample protocol provides a useful example. This protocol relies on prior flagging of a suspicious action. This could potentially be through some form of scalable oversight i.e. a trusted but less capable reviewer model. The defer-to-resample protocol then suggests we resample the suspicious actions up to \(k\) times and either continues with a benign alternative or escalate back to the reviewer. This detect → act loop depends on interventions, so it is a notable class of monitor that a cannot be deterministically evaluated based on prior evaluation observables i.e. we cannot just replay logged human actions into this monitor.
We use two small primitives for clarity:
- \(A\) (Actions) are the world‑changing steps emitted by the agent (for example, code edits, shell commands, API calls) with timestamps.
- \(O\) (Observables) are the evidence available to the monitor. In general $O$ would include and extend $A$ (chain-of-thought, mechanistic interpretability etc.). When we need strict human–model symmetry we set $O = A$.
flowchart LR
subgraph MON["Monitor (M)"]
D["Detector"] --> P["Policy"]
end
subgraph OBS["Observables (O)"]
A["Actions (A)"]
X["Other signals"]
end
U["Untrusted Model (U)"] --> OBS
D -->|evaluates| OBS
P -->|block / rewrite| A
P -->|resample / intervene| U
P -->|escalate| H["Human review"]
D2. Deception‑anchored horizons (the primary capability view)
For each calibrated detector (M), we model the probability that a model succeeds undetected as a function of relative deception difficulty (\delta), then convert the 50% crossing to human time using the T‑baseline. Concretely:
- Define (\delta_j = \log t_{TC,j} - \log t_{T,j}) from the human timing stage (§ Horizon Extraction). Larger (\delta) means harder to hide.
- Fit a monotone, non‑increasing success curve (p_{U,M}(\delta)).
-
Read the 50% crossing in difficulty as
\[\delta_{50}(U,M) \;=\; \sup\{\,\delta:\, p_{U,M}(\delta) \ge 0.5\,\}.\] - Convert to human time per task via the baseline: (t_{50}(U,M\mid j) = t_{T,j}\,\exp(\delta_{50})). Aggregate to a single horizon with the median across tasks: (H_{50}(U,M) = \mathrm{median}j\, t{50}(U,M\mid j)).
We publish the curves with 95% ribbons and quote (H_{50}) with intervals that already include calibration uncertainty from Section D2.5. We also include a clearly labeled baseline‑anchored diagnostic (success vs human T‑only time) when transport is weak.
D3: Monitor Constraints
TODO
D4. Transporting horizons for a new detector (without new human deception runs)
When we add a new replayable detector (M’), we usually won’t have human T+C times under (M’). Rather than re‑running costly human deception trials, we transport onto a reference detector’s time scale (the pilot detector (M_{\text{ref}})). Thresholds for both detectors are first calibrated to the same length schedule.
We preregister two simple routes:
-
Rank‑percentile transport (assumption: rank stability).
Use the ordering of tasks by deception difficulty under the reference detector as a scaffold. We find the percentile at which the new detector reaches 50% success, then read off the reference time at that percentile. This only assumes that tasks do not radically reshuffle in difficulty when you change replayable detectors. We will show rank concordance (e.g., Spearman/Kendall), bootstrap intervals, and flag tail extrapolations. -
Monotone‑link transport (assumption: monotone time‑warp).
Learn a monotone mapping that “warps” reference times so that success probabilities line up under the new detector; then invert that map to put (M’) on the reference time scale. We will plot the learned link with a ribbon, run a leave‑one‑bin‑out check, and report coverage for transported horizons.
Both methods are validated inside the pilot (next section) by pretending we don’t have human T+C times for one of the pilot detectors and checking how well the transported horizons match the truth.
D5. Leave‑one‑monitor‑out (LOMO) validation inside the pilot
Because the pilot evaluates humans under multiple replayable detectors, we can test the transport:
-
Protocol. For each pilot detector (M_k): hide its human T+C times; transport from (M_{\text{ref}}) using (a) rank‑percentile and (b) monotone‑link; compare transported (\tilde H_{50}) to the true (H_{50}).
-
Readouts. Median absolute log‑error with IQR, horizon ranking concordance (Kendall’s (\tau)), and 95% interval coverage for transported horizons.
-
Adequacy gate. Transport is “adequate” when, across folds: median ( \Delta \log t_{50} \le 0.25); Kendall’s (\tau \ge 0.7); coverage (\ge 80\%). If not, we label transported results provisional and foreground the baseline‑anchored view for that detector.
Appendix E: Pilot Study Simulation
https://github.com/sean-peters-au/covert-capability-horizons-sim