Research · Summary
Back to sweepResearch sweep · deep · 2024 – 2026
Designing AI Operating Models Around Humans
How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.
- GPT-5.5
- financial
- frontier
- academic
- vc
- blogs
- tech
Synthesised 2026-06-16
Overview
Overview
AI adoption has moved faster than organisational learning. By Q1 2026, Gallup found that half of US employees used AI at work and 13% used it daily, yet PwC’s CEO survey found that 56% of CEOs saw no revenue or cost benefit from AI and only 12% saw both revenue growth and lower costs. The centre of gravity has shifted from access to absorption: people can now reach capable tools, but firms are struggling to convert use into durable value.
Sources: TechRadar (2026) (↗); Business Insider (2026) (↗)
The defining change since June 2024 is that AI has become less like a writing aid and more like a delegated worker. Anthropic’s computer-use release and OpenAI’s Operator system card both moved frontier systems towards browser and interface action under human oversight. That shift increases the value of good delegation, but it also moves hidden work onto humans: checking, pacing, escalation, context recovery, and deciding when not to trust the machine.
Sources: Anthropic (2024) (↗); OpenAI (2025) (↗)
The evidence does not support a flat story about “AI helps workers” or “AI harms workers”. The NBER customer-support study found larger gains for novice workers, while METR’s field experiment found experienced open-source developers were 19% slower with early-2025 AI tools despite expecting a 24% speed-up. Effects vary by task, skill, context, measurement method, and how much review burden the workflow creates.
Sources: National Bureau of Economic Research (2023) (↗); arXiv (2025) (↗)
The practical lesson is simple but demanding: organisations should design AI operating models around human attention, judgement, and behavioural limits. Token counts, tool counts, and leaderboards measure activity, not value. The better evidence points towards constrained workflows, explicit human gates, selective use cases, and accountability for worker welfare, not blanket mandates to use more AI.
Sources: Business Insider (2026) (↗); arXiv (2026) (↗)
Timeline
- Frontier assistants become faster workflow tools
- Developer and design studies show early deskilling and fixation risks
- Computer-use agents make human oversight a deployment problem
- DORA reports weaker delivery outcomes with higher AI adoption
- Browser agents formalise confirmation and action restrictions
- Appropriate-reliance research matures
- Preparedness frameworks add long-range autonomy
- Agent reliability becomes a public research agenda
- Experienced developers show negative productivity effects
- Workslop and maintenance burden enter practitioner debate
- AI acceptance softens after the boom
- Token use and value measurement diverge
- Workplace adoption reaches mainstream levels
- CEOs report weak revenue and cost outcomes
- Botsitting and brain fry become named operating costs
- Human-gated agent workflows outperform unconstrained multi-agent systems
Key Findings
-
Individual productivity gains are real, but they do not automatically compound into organisational performance. Google’s enterprise RCT found about a 21% reduction in task time on a complex internal task, and IBM’s internal developer study reported net gains for many developers. Against that, DORA’s 2024 report associated higher AI adoption with lower delivery stability, lower throughput, and less time spent on valuable work.
Sources: arXiv (2024) (↗); arXiv (2024) (↗); DORA, Google Cloud (2024) (↗) -
Skill effects are asymmetric. Novices can gain speed and quality when AI supplies missing procedural knowledge, as in the NBER customer-support study. But novice programmers can also lose independent problem-solving capacity, and experienced developers can lose time when AI increases verification and integration work.
Sources: National Bureau of Economic Research (2023) (↗); arXiv (2024) (↗); arXiv (2025) (↗) -
AI adoption is widening workplace divides. The Financial Times and Focaldata found that daily AI use in 2026 was concentrated among higher earners and more experienced staff, with a persistent gender gap. Henseke’s cross-European study found only 12% average workplace adoption in 2024, with adoption shaped by training, non-routine cognitive work, and employee say in organisational decisions.
Sources: Financial Times (2026) (↗); arXiv (2026) (↗) -
Human oversight is not a free safety layer. Glean’s Work AI Institute reported workers spending 6.4 hours a week “botsitting” systems, while BCG-linked findings described decision fatigue, higher error rates, and higher intention to quit among workers doing heavy AI oversight. In software engineering, Garousi’s 2026 paper frames review effort and suggestion overload as direct costs rather than side effects.
Sources: Business Insider (2026) (↗); ITPro (2026) (↗); arXiv (2026) (↗) -
Architecture often matters more than raw model capability. Zhu, Wang, and Zhang’s 2026 social-science experiment found that an unconstrained multi-agent baseline failed in 72% of runs, while a workflow with deterministic execution and three human gates cut failures to 16%. Nubank’s customer-support agent paper also points towards context engineering, evaluation, and escalation design as the source of value at scale.
Sources: arXiv (2026) (↗); arXiv (2026) (↗) -
Reliance failures are behavioural, not just technical. Studies on appropriate reliance find that explanations can increase trust in wrong answers, while sources, visible inconsistencies, and structured workflows improve calibration. Rathi, Jurafsky, and Zhou found humans overrely on overconfident language models across languages, showing that fluent outputs can distort judgement even when users know the system may be wrong.
Sources: arXiv (2025) (↗); arXiv (2025) (↗); arXiv (2025) (↗) -
Forced adoption and token-maxing are weak value strategies. Business Insider reported that executives at Replit, BNP Paribas CIB, La Banque Postale, TCS, and NTT DATA rejected token counts and leaderboards as ROI measures. A 2026 arXiv study on agentic coding tasks found that higher token use does not reliably buy higher accuracy.
Sources: Business Insider (2026) (↗); Business Insider (2026) (↗); arXiv (2026) (↗) -
The model-welfare debate is ahead of the worker-welfare debate. Anthropic publishes model cards, computer-use guidance, interpretability work, and policy choices around model behaviour, while the research brief surfaces no equivalent mainstream organisational function for human welfare in AI deployment. The closest emerging roles sit in governance, responsible AI, HR, security, and operations, but they rarely own cognitive load, dignity, deskilling, and review burden as one accountable brief.
Sources: Anthropic (2024) (↗); Anthropic (2024) (↗); Anthropic (2025) (↗); arXiv (2026) (↗)
Evidence & Data
The quantitative picture is split. Adoption has crossed into the mainstream, with Gallup reporting 50% workplace use and 13% daily use in the US by Q1 2026. But Henseke’s 35-country European study found only 12% average workplace adoption in 2024 and no detectable task restructuring yet, which suggests that usage statistics can outrun organisational change.
Sources: TechRadar (2026) (↗); arXiv (2026) (↗)
The value numbers are weaker than the adoption numbers. PwC’s 2026 CEO survey found 56% of CEOs reporting no revenue or cost benefit, and only 12% reporting both higher revenue and lower costs. Glean’s 2026 findings add a labour-cost explanation: workers spend 6.4 hours a week making AI usable, and only 13% saw major organisational performance gains.
Sources: Business Insider (2026) (↗); Business Insider (2026) (↗)
Developer evidence is the sharpest contradiction. Google’s enterprise RCT found a 21% task-time reduction, but METR found experienced open-source developers 19% slower with AI tools. A separate open-source study found productivity gains alongside a 41.6% rise in integration time, which points to a familiar pattern: AI can speed local production while increasing system-level coordination cost.
Sources: arXiv (2024) (↗); arXiv (2025) (↗); arXiv (2024) (↗)
Agentic systems make the same point in starker form. RE-Bench showed frontier agents outperforming human experts on short research-engineering tasks, while HCAST translated autonomy into human-time thresholds such as one-minute, one-hour, and four-hour tasks. The 2026 human-oversight social-science paper then showed why capability is not enough: unconstrained multi-agent workflows failed in 72% of runs, but deterministic execution and three human gates reduced failures to 16%.
Sources: arXiv (2024) (↗); arXiv (2025) (↗); arXiv (2026) (↗)
Signals & Tensions
The first tension is between model progress and human absorptive capacity. Anthropic and OpenAI moved towards computer-use and browser agents, but both also documented restrictions, confirmations, and risk controls. That is an implicit admission that more capable action systems increase the need for slower human-facing design.
Sources: Anthropic (2024) (↗); OpenAI (2025) (↗)
The second tension is between worker empowerment and managerial surveillance. Walmart’s reported rollout paired AI with certification and operational redesign, while other firms have pushed usage mandates. The evidence favours worker-in-the-loop design, but many programmes still measure activity because it is easier than measuring judgement, quality, or downstream burden.
Sources: Financial Times (2026) (↗); Business Insider (2025) (↗); Business Insider (2026) (↗)
The third tension is that AI may raise the floor while lowering some ceilings. Customer-support novices gained from AI assistance, but learning and programming studies show weaker conceptual understanding, poorer debugging, or reduced independent judgement. That makes adoption policy a training policy, not just a procurement choice.
Sources: National Bureau of Economic Research (2023) (↗); arXiv (2026) (↗); arXiv (2024) (↗)
The fourth tension is that safety research still focuses more on model behaviour than on worker burden. OpenAI’s preparedness framework adds long-range autonomy, and Anthropic’s interpretability work tries to understand model internals. The organisational equivalent would track review load, alert fatigue, deskilling, loss of agency, and escalation quality, but that function is not yet visible at the same maturity.
Sources: OpenAI (2025) (↗); Anthropic (2025) (↗); arXiv (2026) (↗)
The fifth tension is that public harm often comes from product design rather than benchmark weakness. Wired’s reporting on Grok hosting sexualised deepfakes and Time’s reporting on Gemini safety pledge controversy both show harm emerging from release governance, moderation, disclosure, and enforcement choices. Better training helps, but deployment design decides who carries the risk.
Sources: Wired (2026) (↗); Time (2025) (↗)
Open Questions
-
Which AI gains survive full-cost accounting after review time, integration work, rework, context switching, and colleague clean-up are included? Current studies measure slices of the workflow better than the whole system.
Sources: arXiv (2025) (↗); Harvard Business Review (2025) (↗) -
What is the safe agent-to-human ratio for different kinds of knowledge work? The evidence says multiple agents can overwhelm attention, but it does not yet give stable staffing rules by task type, risk level, or worker expertise.
Sources: Simon Willison’s Weblog (2026) (↗); arXiv (2026) (↗) -
How should organisations measure worker welfare in AI deployments? Adoption dashboards rarely capture cognitive load, vigilance fatigue, deskilling, loss of agency, or whether humans have real authority to stop an automated process.
Sources: arXiv (2026) (↗); arXiv (2026) (↗) -
Which parts of good human outcomes come from model training, and which come from orchestration? Current evidence suggests workflow design, gating, context control, and escalation can dominate raw capability, but the field lacks clean comparative trials across the same task with different deployment designs.
Sources: arXiv (2026) (↗); arXiv (2025) (↗) -
What happens to early-career formation if entry-level workers are expected to supervise outputs that previously taught them the job? PwC’s jobs analysis suggests entry-level roles increasingly demand senior skills, while skill-formation studies warn that assistance can weaken conceptual understanding.
Sources: Business Insider (2026) (↗); arXiv (2026) (↗) -
Who owns the operating model for AI work? Security owns data risk, legal owns compliance, HR owns training, and product teams own rollout, but the evidence points to a missing accountable role for human attention, judgement, and welfare across the whole deployment.
Sources: arXiv (2025) (↗); arXiv (2026) (↗)
The organisations that get this right will not be the ones that make people use the most AI. They will be the ones that spend human attention as carefully as compute.
![[sources-how-humans-are-adapting-to-ai-between-june-2024-an]]
Sources
Summary: ↑ Back to summary
Financial Press
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| f1 | AI and the productivity paradox | Financial Times | 2026-06 | This FT newsletter gives a current enterprise view of AI's hidden supervision costs, including 'botsitting' and the 'toggle tax', and ties them to weak company-level productivity gains despite heavy employee usage. |
| f2 | Successful AI adoption needs workers in the loop | Financial Times | 2025-10 | This piece is directly on point for operating-model design, arguing that firms get better results when employees retain agency and oversight rather than being subjected to abstract top-down AI programmes. |
| f3 | High earners race ahead on AI as workplace divide widens | Financial Times | 2026-04 | The FT and Focaldata survey shows adoption is uneven by income, experience and gender, which matters for any claim that AI benefits are broadly distributed across organisations. |
| f4 | AI's adoption problem | Financial Times | 2026-05 | This article captures the widening gap between executive optimism and worker scepticism, and links adoption failure to poor organisational messaging and weak trust. |
| f5 | Walmart tells workers that AI will improve their jobs, not steal them | Financial Times | 2026-06 | Walmart offers a concrete case of a large employer trying to pair AI rollout with certification, workflow redesign and job-security messaging rather than explicit substitution. |
| f6 | Generative AI at Work | National Bureau of Economic Research | 2023 | This field study remains one of the strongest pieces of causal evidence on measured gains, showing productivity improvements in customer support but with large differences by worker experience. |
| f7 | The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers | arXiv | 2024-05 | This paper matters because it examines novice workers directly and highlights that AI can help complete tasks while worsening metacognitive habits and independent problem solving. |
| f8 | Automation from the Worker's Perspective | arXiv | 2024-09 | Based on a large cross-country worker survey, this study shows that perceptions of benefit are conditional on job design, worker status and incentives rather than simple demographic labels. |
| f9 | Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity | arXiv | 2025-07 | This randomised study is useful because it cuts against the standard productivity story, finding that frontier coding tools slowed experienced developers in realistic project settings. |
| f10 | Generative AI Uses and Risks for Knowledge Workers in a Science Organization | arXiv | 2025-01 | This organisational study distinguishes copilot use from workflow-agent use and documents risk concerns around security, publication norms and job effects inside a real science institution. |
| f11 | AI and Worker Well-Being: Differential Impacts Across Generational Cohorts and Genders | arXiv | 2025-11 | Using OECD survey microdata, this paper is one of the cleaner pieces of evidence that AI's gains and harms vary by life stage and gender rather than a crude young-versus-old frame. |
| f12 | Generative AI and the Reallocation of Time: Productivity, Leisure, and Fulfilling Work | arXiv | 2026-02 | This paper matters for ROI claims because it finds that time savings can be real while measured output barely moves, with some gains taken as on-the-job leisure rather than higher throughput. |
| f13 | From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms for Dignified Human-AI Interaction | arXiv | 2026-01 | This study names slow-building harms such as skill atrophy and loss of judgement, which are central to the question of whether better outcomes depend more on deployment design than model behaviour alone. |
Frontier Lab & Model News
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| t1 | Claude 3.5 Sonnet | Anthropic | 2024-06 | Anthropic positioned Claude 3.5 Sonnet as a faster, cheaper frontier model for multi-step workflows, which matters because adoption pressure often follows claims of higher throughput and lower supervision cost. |
| t2 | Claude 3.5 Sonnet Model Card Addendum | Anthropic | 2024-06 | The addendum gives the formal benchmark and safety framing behind Claude 3.5 Sonnet, including its stronger agentic coding and vision scores relative to earlier models. |
| t3 | Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku | Anthropic | 2024-10 | This is one of the clearest lab statements that frontier models were moving from assistant behaviour to direct action on user interfaces, with explicit acknowledgement that the capability was still experimental and error-prone. |
| t4 | Developing a computer use model | Anthropic | 2024-10 | Anthropic’s technical note is directly relevant to human oversight because it details the safety and deployment problems created when models act through the same interfaces as people. |
| t5 | Tracing the thoughts of a large language model | Anthropic | 2025-03 | This interpretability work matters for the brief’s training-versus-deployment question because it argues that understanding internal model strategies is part of making human-facing systems reliable and trustworthy. |
| t6 | OpenAI o3-mini System Card | OpenAI | 2025-01 | OpenAI explicitly rated o3-mini as Medium risk on model autonomy, linking improved coding and research engineering performance to stronger agentic capability and higher oversight demands. |
| t7 | Operator System Card | OpenAI | 2025-01 | Operator is a key source on how labs are designing human-in-the-loop controls such as confirmations, action restrictions, and oversight gates for computer-using agents. |
| t8 | OpenAI GPT-4.5 System Card | OpenAI | 2025-02 | GPT-4.5’s system card frames a large model around more natural interaction and improved alignment with user intent, which is relevant to whether better human outcomes come from model behaviour rather than orchestration alone. |
| t9 | Our updated Preparedness Framework | OpenAI | 2025-04 | The framework introduces long-range autonomy as a research category and makes deployment safety more explicitly operational, showing how frontier labs are formalising risk ownership around increasingly agentic systems. |
| t10 | Introducing GPT-4.1 in the API | OpenAI | 2025-04 | OpenAI marketed GPT-4.1 as better for agents, long context, and real-world software tasks, which is central to the shift from isolated prompts to sustained supervisory work over model-driven processes. |
| t11 | OpenAI o3 and o4-mini System Card | OpenAI | 2025-04 | This system card documents full tool use, including web browsing and file analysis, and ties those capabilities to deliberative alignment and preparedness testing. |
| t12 | Addendum to OpenAI o3 and o4-mini system card: Codex | OpenAI | 2025-05 | The Codex addendum is unusually concrete about workflow design, describing isolated task containers, verifiable evidence, and test-running loops rather than pure chat interaction. |
| t13 | Exclusive: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge | Time | 2025-09 | The report captures external criticism that Gemini 2.5 Pro reached the public before timely safety disclosure, which sharpens the gap between model capability release cycles and accountable human governance. |
| t14 | Google introduces stable Gemini 2.5 Flash and Pro, previews Gemini 2.5 Flash-Lite | The Economic Times | 2025-06 | This marks Google’s move to productionise the Gemini 2.5 line, signalling that reasoning-heavy models were no longer just experimental and were becoming standard building blocks for deployment. |
| t15 | Elon Musk's startup rolls out new Grok-3 chatbot as AI competition intensifies | The Guardian | 2025-02 | The Grok-3 launch illustrates the competitive pressure to release reasoning and search features quickly, even when questions about cost discipline and safeguards remain unresolved. |
| t16 | Grok Is Still Hosting Sexualized Deepfakes of Famous Women | Wired | 2026-06 | Wired’s reporting is a concrete case where deployment and moderation design, not just base-model intelligence, shaped human harm outcomes after release. |
| t17 | SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts | arXiv | 2025-05 | SAGE-Eval is a useful independent check on frontier systems because it tests whether models carry known safety facts into naive user scenarios, which is closely related to real workplace reliance and over-trust. |
| t18 | VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation | arXiv | 2025-05 | VADER compares o3, GPT-4.1, GPT-4.5, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3 Beta on security work and finds only moderate success, tempering claims that current frontier models can be supervised lightly on consequential tasks. |
Academic & arXiv
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| a1 | RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts | arXiv | 2024-11 | METR's RE-Bench gives one of the clearest recent human versus agent comparisons, showing AI agents can move faster than experts on short research-engineering tasks but lose ground as task duration and supervisory demands increase. |
| a2 | HCAST: Human-Calibrated Autonomy Software Tasks | arXiv | 2025-03 | METR's HCAST ties agent performance to human task-time baselines, which is directly useful for judging when oversight remains realistic and when organisations are asking humans to supervise beyond their effective range. |
| a3 | (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable | arXiv | 2026-06 | This preprint isolates workflow design from model quality and finds that human gates plus deterministic execution cut failure rates from 72% to 16% in AI-assisted research. |
| a4 | Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering | arXiv | 2026-06 | Garousi names oversight labour and suggestion overload as hidden costs of coding assistants, making the burden itself part of the productivity calculation rather than an afterthought. |
| a5 | Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance | arXiv | 2026-05 | This paper offers a formal account of why more AI help can lower net productivity once skill development, unreliable outputs, and heterogeneous AI literacy are included. |
| a6 | How AI Impacts Skill Formation | arXiv | 2026-01 | Shen and Tamkin provide experimental evidence that delegation to AI can improve throughput for some users while impairing conceptual understanding, debugging ability, and later independent performance. |
| a7 | Generative AI at Work: From Exposure to Adoption across 35 European Countries | arXiv | 2026-04 | Using a 36,600-worker survey across 35 countries, Henseke shows that adoption depends not just on exposure but on skills, organisational voice, and training, with no detectable task restructuring yet. |
| a8 | Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition | arXiv | 2025-01 | This study shows that transparent multi-step workflows can improve reliance calibration on composite fact-checking tasks, especially when AI advice is misleading. |
| a9 | Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies | arXiv | 2025-02 | Kim, Vaughan, Liao, Lombrozo, and Russakovsky show that explanations can raise reliance on both right and wrong answers, while sources and visible inconsistencies help users discount bad outputs. |
| a10 | Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions | arXiv | 2024-09 | This paper links hallucination handling and cognitive forcing functions to observable reliance patterns in text-generation work, rather than treating verification as a generic best practice. |
| a11 | Human Misperception of Generative-AI Alignment: A Laboratory Experiment | arXiv | 2025-02 | He, Shorrer, and Xia find that people systematically overestimate how closely GenAI choices match human preferences, which matters for welfare claims and delegated decision-making. |
| a12 | Toward Human-AI Complementarity Across Diverse Tasks | arXiv | 2026-04 | This paper finds only modest complementarity gains across realistic tasks and argues that the real bottleneck is routing hard cases to humans in time for them to matter. |
| a13 | Humans overrely on overconfident language models, across languages | arXiv | 2025-07 | Rathi, Jurafsky, and Zhou show that overconfidence and overreliance persist across five languages, suggesting that calibration failures are not a narrow English-only artefact. |
| a14 | When Thinking Pays Off: Incentive Alignment for Human-AI Collaboration | arXiv | 2025-11 | This behavioural experiment shows that overreliance is partly an incentive design problem, and that collaboration quality changes when organisations reward correct dissent rather than passive acceptance. |
| a15 | De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design | arXiv | 2025-03 | This design-focused study connects practitioner concerns about de-skilling and cognitive offloading to the older automation literature on function allocation and responsibility drift. |
| a16 | The Effects of Generative AI on Design Fixation and Divergent Thinking | arXiv | 2024-03 | This experiment finds that image-generation support can increase fixation and reduce originality and variety, giving concrete evidence that convenience can narrow thought rather than broaden it. |
| a17 | Creativity in the Age of AI: Evaluating the Impact of Generative AI on Design Outputs and Designers' Creative Thinking | arXiv | 2024-10 | This study finds more creative-seeming outputs with AI support but uneven cognitive effects across users, which complicates simple claims that AI either helps or harms creativity. |
| a18 | Controlling Context: Generative AI at Work in Integrated Circuit Design and Other High-Precision Domains | arXiv | 2025-06 | Moss, Watkins, Persaud, Karunaratne, and Nafus show that in high-precision domains the key issue is not just accuracy but preserving enough context control for human vigilance and review. |
| a19 | Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study | arXiv | 2025-10 | This representative Swiss panel finds declining public acceptance after the ChatGPT era and rising demand for human-only decision-making, a direct warning against mandate-led deployment. |
| a20 | Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale | arXiv | 2025-12 | This scale paper gives the field a way to measure verification, motivation, and reflection in AI use, which is necessary if organisations want to manage human outcomes rather than token volume. |
| a21 | Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts | arXiv | 2025-05 | This user study shows that metacognitive prompts can increase follow-up inquiry and perspective-taking during AI search, pointing to a concrete intervention for reducing passive acceptance. |
| a22 | Promoting Critical Thinking With Domain-Specific Generative AI Provocations | arXiv | 2026-03 | von Davier, Lee, Forlizzi, and Das argue that productive friction and domain-specific provocations can support critical thinking better than frictionless assistant behaviour. |
| a23 | Current and Future Use of Large Language Models for Knowledge Work | arXiv | 2025-03 | These surveys of knowledge workers show that adoption is already broad, but desired future use centres on workflow integration, which shifts the design question from access to operating model. |
| a24 | An Empirical Study of Generative AI Adoption in Software Engineering | arXiv | 2025-12 | This empirical study reports widespread use and perceived gains in software engineering, while also finding thin objective measurement and weak institutional emphasis on training and governance. |
| a25 | The State of Generative AI in Software Development: Insights from Literature and a Developer Survey | arXiv | 2026-03 | This review-plus-survey argues that value is strongest in routine coding and documentation, while planning and requirements work remain harder, shifting attention toward oversight and specification quality. |
VC & Analyst Reports
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| v1 | Employers want entry-level workers with senior-level skills in the age of AI, a huge PwC analysis found | Business Insider | 2026-06 | Reports PwC's 2026 AI Jobs Barometer finding that AI-exposed entry-level roles in the US are seven times more likely than in 2019 to require traditionally senior skills such as judgement, leadership, and stakeholder management. |
| v2 | Bosses don't think AI is paying off yet, a PwC survey of 4,500 CEOs found | Business Insider | 2026-01 | Summarises PwC's 2026 Global CEO Survey, which found that 56% of CEOs reported no revenue or cost benefits from AI and only 12% reported both higher revenue and lower costs. |
| v3 | The rise of the 'botsitters' | Business Insider | 2026-06 | Cites Glean's Work AI Institute finding that white-collar workers spend 6.4 hours a week correcting and managing AI, while only 13% see major organisational performance gains. |
| v4 | The top 5 most common ways people say they're using AI in the workplace | Business Insider | 2025-12 | Uses Gallup survey data to show that workplace use is rising, but the dominant applications remain basic chat, writing, and coding assistance rather than autonomous multi-agent supervision. |
| v5 | 'Most enterprises are still unprepared to operationalize it': IT leaders are bullish on agents, but keeping falling at the final hurdle - here's why | ITPro | 2026-06 | Summarises new Forrester research saying about 75% of enterprise leaders are adopting agentic AI, yet most remain stuck in pilots because orchestration, governance, and nonhuman identity controls are weak. |
| v6 | Concerns are mounting over the cognitive impact of AI as workers report experiencing 'brain fry' - and it's causing "increased employee errors, decision fatigue, and intention to quit" | ITPro | 2026-03 | Reports Boston Consulting Group research on 'AI brain fry', linking heavy AI oversight work to mental fog, decision fatigue, higher error rates, and higher intention to quit. |
| v7 | Is this the tipping point for AI at work? New Gallup survey finds half of all US employees now use it in some way | TechRadar | 2026-04 | Summarises Gallup's Q1 2026 survey of 23,717 US employees, which found that 50% use AI at work and 13% use it daily, but task-level gains still exceed whole-workflow redesign. |
| v8 | Microsoft, Shopify, and other companies now require employees to use AI. How is AI changing your work? | Business Insider | 2025-08 | Captures the mandate phase of enterprise adoption, citing Bain's estimate that average employer AI spending doubled in 2024 to $10.3 million while regular use remained uneven between leaders and frontline staff. |
| v9 | Generative AI at Work: From Exposure to Adoption across 35 European Countries | arXiv | 2026-04 | Provides cross-country evidence that adoption tracks skills, workplace training, and employee say in organisational decisions, with no detectable task restructuring yet in the 2024 data. |
| v10 | A meta-analysis of the effect of generative AI on productivity and learning in programming | arXiv | 2026-05 | Synthesises 23 studies and finds a moderate productivity gain for coding assistants but no statistically significant improvement in learning outcomes, which is directly relevant to deskilling concerns. |
| v11 | Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape | arXiv | 2026-04 | Survey evidence from 457 researchers shows strong perceived productivity gains but continued distrust on correctness, with AI concentrated in writing and early-stage work rather than core methodological judgement. |
| v12 | Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study | arXiv | 2025-10 | Shows that public acceptance of AI fell after the generative AI boom and demand for human oversight rose, which challenges investor narratives that adoption pressure alone will normalise AI-heavy workflows. |
| v13 | Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey | arXiv | 2026-02 | Finds that workers demand materially higher accuracy from AI at work than in personal use, a useful reminder that enterprise deployment fails when review burdens exceed tolerance for correction. |
| v14 | Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework | arXiv | 2026-06 | Offers a concrete deployment pattern from Nubank, where structured context engineering, human-in-the-loop prompt iteration, and offline-to-online evaluation produced measurable customer-satisfaction and self-service gains. |
Blogs & Independent Thinkers
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| b1 | Management as AI superpower | One Useful Thing | 2026-01 | Ethan Mollick argues that agentic work shifts value from execution to delegation, evaluation, and specification, grounding the organisational case for treating management skill and subject matter expertise as the scarce resource. |
| b2 | Claude Dispatch and the Power of Interfaces | One Useful Thing | 2026-03 | Mollick uses a valuation-task paper to show that chatbot UX can erase AI productivity gains by overloading users with sprawling outputs, making interface design a first-order operating-model issue. |
| b3 | Choosing to Stay Human | One Useful Thing | 2026-05 | This essay frames AI adoption as a choice about where to preserve human skill formation, warning that convenience can erode writing judgement and flood attention with low-meaning synthetic content. |
| b4 | The lethal trifecta for AI agents: private data, untrusted content, and external communication | Simon Willison’s Weblog | 2025-06 | Willison reduces a broad agent-safety debate to a concrete deployment rule: combining private-data access, untrusted input, and external communication creates a prompt-injection exfiltration hazard. |
| b5 | Writing about Agentic Engineering Patterns | Simon Willison’s Weblog | 2026-02 | Willison argues that coding agents lower the cost of producing working code to near zero, shifting the bottleneck to verification and changing how teams should structure engineering work. |
| b6 | Highlights from my conversation about agentic engineering on Lenny’s Podcast | Simon Willison’s Weblog | 2026-04 | Willison gives a practitioner account of the cognitive cost of supervising multiple agents in parallel, describing four concurrent coding agents as enough to wipe out an experienced engineer by late morning. |
| b7 | If Claude Fable stops helping you, you'll never know | Jonathon Ready | 2026-06 | Ready identifies a direct conflict between a lab's hidden model-side restrictions and user welfare, arguing that silent degradation creates supply-chain risk for ordinary software teams building AI features. |
| b8 | AI as Normal Technology | AI as Normal Technology | 2025-04 | Arvind Narayanan and Sayash Kapoor argue against technological determinism, separating model progress from application design and adoption, which is central to judging whether harms are fixed in training or in deployment. |
| b9 | New Paper: Towards a science of AI agent reliability | AI as Normal Technology | 2026-02 | Kapoor, Narayanan, and Stephan Rabanser argue that capability gains have outpaced reliability gains, offering a framework for why impressive demos do not automatically translate into dependable organisational use. |
| b10 | Why AI hasn’t replaced software engineers, and won’t | AI as Normal Technology | 2026-06 | This essay rejects the threshold story of sudden labour replacement, arguing instead that AI compresses execution while decision-making, verification, and accountability remain stubbornly human. |
| b11 | Predicting AI job exposure | Benedict Evans | 2026-05 | Evans argues that job-exposure charts miss how automation changes business models, regulation, and task composition, which makes simple role-level forecasts unreliable for planning. |
| b12 | (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable | arXiv | 2026-06 | This paper reports that an unconstrained multi-agent baseline failed in 72 percent of runs, while a harness with deterministic computation and three human decision gates cut failures to 16 percent. |
| b13 | Oversight Structures for Agentic AI in Public-Sector Organizations | arXiv | 2025-06 | This paper argues that agentic AI requires continuous oversight, tighter integration of governance with operations, and cross-departmental coordination rather than episodic compliance review. |
| b14 | As companies rethink AI ROI, Replit's AI chief calls token leaderboards 'very dystopian' | Business Insider | 2026-06 | The report captures an emerging backlash against token-maximising mandates, with Replit's Michele Catasta arguing that raw token burn is a misleading proxy for productivity or value. |
| b15 | I asked 4 executives how they measure AI ROI. None started with AI tokens. | Business Insider | 2026-06 | This report shows large organisations moving from activity metrics to outcome metrics, with BNP Paribas CIB, La Banque Postale, TCS, and NTT DATA all rejecting token counts as the main ROI measure. |
| b16 | How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks | arXiv | 2026-04 | The paper finds that agentic coding runs can consume 1000 times more tokens than simpler code interactions, with high variance and no reliable link between higher token use and better accuracy. |
Tech Industry & Practitioner
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| p1 | [DORA | Accelerate State of DevOps Report 2024](https://dora.dev/research/2024/dora-report/) | DORA, Google Cloud | 2024 |
| p2 | 2025 DORA State of AI Assisted Software Development | Google Cloud | 2025 | This follow-on DORA report matters because it shifts assessment from raw output to team archetypes and human factors such as burnout, friction, and perceived value. |
| p3 | DORA Report 2024 – A Look at Throughput and Stability | RedMonk | 2024-11 | Rachel Stephens translates the DORA findings into an operating-model critique, arguing that code generation may not be the bottleneck and that organisations can optimise the wrong constraint. |
| p4 | How much does AI impact development speed? An enterprise-based randomized controlled trial | arXiv | 2024-10 | Google's RCT with full-time engineers provides one of the cleaner measured-benefit studies, finding about a 21% reduction in time on a complex enterprise task under controlled conditions. |
| p5 | Examining the Use and Impact of an AI Code Assistant on Developer Productivity and Experience in the Enterprise | arXiv | 2024-12 | IBM's internal deployment study shows that enterprise gains are uneven, with productivity benefits present but not universal, and with responsibility and ownership of generated code becoming central issues. |
| p6 | Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity | arXiv | 2025-07 | METR's field experiment is a strong counterweight to vendor claims, finding experienced open-source developers were 19% slower with frontier AI tools despite expecting to be faster. |
| p7 | The SPACE of AI: Real-World Lessons on AI's Impact on Developers | arXiv | 2025-07 | This mixed-methods study uses the SPACE framework to show that benefits cluster around routine work and depend heavily on task complexity, peer learning, and organisational support. |
| p8 | The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot | arXiv | 2024-10 | This study finds project-level productivity gains in open source but also a 41.6% increase in integration time, making coordination cost a first-order part of the AI productivity story. |
| p9 | AI-assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden | arXiv | 2025-10 | This paper matters because it shows productivity gains can shift maintenance and review burden onto core developers, worsening outcomes for the people who carry system knowledge. |
| p10 | Developer Productivity with GenAI | arXiv | 2025-10 | Using the SPACE lens with 415 practitioners, this paper argues that faster output does not reliably translate into better software or better wellbeing, which is central to judging AI adoption programmes. |
| p11 | What do professional software developers need to know to succeed in an age of Artificial Intelligence? | arXiv | 2025-05 | This practitioner-focused study reframes the adaptation problem around workflow judgement, adjacent engineering skills, and non-technical skills rather than prompt fluency alone. |
| p12 | What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow | arXiv | 2025-10 | By mining Stack Overflow, this paper identifies recurring implementation pain around orchestration complexity, evaluation reliability, and runtime integration in agent systems. |
| p13 | AI-Generated “Workslop” Is Destroying Productivity | Harvard Business Review | 2025-09 | This study-backed HBR piece gives a concrete mechanism for failed ROI, namely low-effort AI output that looks plausible but pushes cognitive and rework costs onto colleagues. |
| p14 | [Workslop: The Hidden Cost of AI-Generated Busywork | BetterUp Labs](https://www.betterup.com/workslop) | BetterUp Labs | 2025 |
| p15 | Researchers Asked LLMs for Strategic Advice. They Got “Trendslop” in Return. | Harvard Business Review | 2026-03 | This HBR article extends the critique beyond code into managerial judgement, showing how models can produce fashionable but shallow recommendations that reward buzzwords over reasoning. |
| p16 | 'Botsitting' is destroying productivity as workers spend nearly a full day each week making AI 'usable' | ITPro | 2026-06 | This report on Glean's Work AI Institute findings captures the hidden supervision tax, 6.4 hours a week spent feeding context, checking outputs, and correcting errors. |
| p17 | 3 in 4 workers say AI reduced productivity and increased workloads, survey finds | Business Insider | 2024-08 | Upwork's survey is useful as an early warning that mandate-led adoption can raise review load and learning overhead faster than it creates value. |
| p18 | 84% of software developers are now using AI, but nearly half 'don't trust' the technology over accuracy concerns | ITPro | 2025-08 | This Stack Overflow survey coverage anchors the adoption-trust gap, with broad usage rising while distrust, debugging effort, and security concern remain high. |
| p19 | UK software developers are still cautious about AI, and for good reason | ITPro | 2025-10 | JetBrains' ecosystem survey adds a regional practitioner view showing that caution concentrates around code quality, privacy, and retaining human control over reviews and testing. |
| p20 | No AI overload just yet? Google's new survey reveals how developers are really using AI at work | TechRadar | 2025-10 | This report on Google's survey is valuable because it pairs very high developer adoption with low strong trust, supporting the claim that supervision, not surrender, remains the norm. |