Research · Summary

Back to sweep

Research sweep · deep · 2024 – 2026

Designing AI Operating Models Around Humans

How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.

  • GPT-5.5
  • financial
  • frontier
  • academic
  • vc
  • blogs
  • tech

Synthesised 2026-06-16

Overview

Overview

AI adoption has moved faster than organisational learning. By Q1 2026, Gallup found that half of US employees used AI at work and 13% used it daily, yet PwC’s CEO survey found that 56% of CEOs saw no revenue or cost benefit from AI and only 12% saw both revenue growth and lower costs. The centre of gravity has shifted from access to absorption: people can now reach capable tools, but firms are struggling to convert use into durable value.
Sources: TechRadar (2026) (); Business Insider (2026) ()

The defining change since June 2024 is that AI has become less like a writing aid and more like a delegated worker. Anthropic’s computer-use release and OpenAI’s Operator system card both moved frontier systems towards browser and interface action under human oversight. That shift increases the value of good delegation, but it also moves hidden work onto humans: checking, pacing, escalation, context recovery, and deciding when not to trust the machine.
Sources: Anthropic (2024) (); OpenAI (2025) ()

The evidence does not support a flat story about “AI helps workers” or “AI harms workers”. The NBER customer-support study found larger gains for novice workers, while METR’s field experiment found experienced open-source developers were 19% slower with early-2025 AI tools despite expecting a 24% speed-up. Effects vary by task, skill, context, measurement method, and how much review burden the workflow creates.
Sources: National Bureau of Economic Research (2023) (); arXiv (2025) ()

The practical lesson is simple but demanding: organisations should design AI operating models around human attention, judgement, and behavioural limits. Token counts, tool counts, and leaderboards measure activity, not value. The better evidence points towards constrained workflows, explicit human gates, selective use cases, and accountability for worker welfare, not blanket mandates to use more AI.
Sources: Business Insider (2026) (); arXiv (2026) ()

Timeline

Key milestones, June 2024 to June 2026
Q2 2024
  • Frontier assistants become faster workflow tools
Q3 2024
  • Developer and design studies show early deskilling and fixation risks
Q4 2024
  • Computer-use agents make human oversight a deployment problem
  • DORA reports weaker delivery outcomes with higher AI adoption
Q1 2025
  • Browser agents formalise confirmation and action restrictions
  • Appropriate-reliance research matures
Q2 2025
  • Preparedness frameworks add long-range autonomy
  • Agent reliability becomes a public research agenda
Q3 2025
  • Experienced developers show negative productivity effects
  • Workslop and maintenance burden enter practitioner debate
Q4 2025
  • AI acceptance softens after the boom
  • Token use and value measurement diverge
Q1 2026
  • Workplace adoption reaches mainstream levels
  • CEOs report weak revenue and cost outcomes
Q2 2026
  • Botsitting and brain fry become named operating costs
  • Human-gated agent workflows outperform unconstrained multi-agent systems

Key Findings

  1. Individual productivity gains are real, but they do not automatically compound into organisational performance. Google’s enterprise RCT found about a 21% reduction in task time on a complex internal task, and IBM’s internal developer study reported net gains for many developers. Against that, DORA’s 2024 report associated higher AI adoption with lower delivery stability, lower throughput, and less time spent on valuable work.
    Sources: arXiv (2024) (); arXiv (2024) (); DORA, Google Cloud (2024) ()

  2. Skill effects are asymmetric. Novices can gain speed and quality when AI supplies missing procedural knowledge, as in the NBER customer-support study. But novice programmers can also lose independent problem-solving capacity, and experienced developers can lose time when AI increases verification and integration work.
    Sources: National Bureau of Economic Research (2023) (); arXiv (2024) (); arXiv (2025) ()

  3. AI adoption is widening workplace divides. The Financial Times and Focaldata found that daily AI use in 2026 was concentrated among higher earners and more experienced staff, with a persistent gender gap. Henseke’s cross-European study found only 12% average workplace adoption in 2024, with adoption shaped by training, non-routine cognitive work, and employee say in organisational decisions.
    Sources: Financial Times (2026) (); arXiv (2026) ()

  4. Human oversight is not a free safety layer. Glean’s Work AI Institute reported workers spending 6.4 hours a week “botsitting” systems, while BCG-linked findings described decision fatigue, higher error rates, and higher intention to quit among workers doing heavy AI oversight. In software engineering, Garousi’s 2026 paper frames review effort and suggestion overload as direct costs rather than side effects.
    Sources: Business Insider (2026) (); ITPro (2026) (); arXiv (2026) ()

  5. Architecture often matters more than raw model capability. Zhu, Wang, and Zhang’s 2026 social-science experiment found that an unconstrained multi-agent baseline failed in 72% of runs, while a workflow with deterministic execution and three human gates cut failures to 16%. Nubank’s customer-support agent paper also points towards context engineering, evaluation, and escalation design as the source of value at scale.
    Sources: arXiv (2026) (); arXiv (2026) ()

  6. Reliance failures are behavioural, not just technical. Studies on appropriate reliance find that explanations can increase trust in wrong answers, while sources, visible inconsistencies, and structured workflows improve calibration. Rathi, Jurafsky, and Zhou found humans overrely on overconfident language models across languages, showing that fluent outputs can distort judgement even when users know the system may be wrong.
    Sources: arXiv (2025) (); arXiv (2025) (); arXiv (2025) ()

  7. Forced adoption and token-maxing are weak value strategies. Business Insider reported that executives at Replit, BNP Paribas CIB, La Banque Postale, TCS, and NTT DATA rejected token counts and leaderboards as ROI measures. A 2026 arXiv study on agentic coding tasks found that higher token use does not reliably buy higher accuracy.
    Sources: Business Insider (2026) (); Business Insider (2026) (); arXiv (2026) ()

  8. The model-welfare debate is ahead of the worker-welfare debate. Anthropic publishes model cards, computer-use guidance, interpretability work, and policy choices around model behaviour, while the research brief surfaces no equivalent mainstream organisational function for human welfare in AI deployment. The closest emerging roles sit in governance, responsible AI, HR, security, and operations, but they rarely own cognitive load, dignity, deskilling, and review burden as one accountable brief.
    Sources: Anthropic (2024) (); Anthropic (2024) (); Anthropic (2025) (); arXiv (2026) ()

Evidence & Data

The quantitative picture is split. Adoption has crossed into the mainstream, with Gallup reporting 50% workplace use and 13% daily use in the US by Q1 2026. But Henseke’s 35-country European study found only 12% average workplace adoption in 2024 and no detectable task restructuring yet, which suggests that usage statistics can outrun organisational change.
Sources: TechRadar (2026) (); arXiv (2026) ()

The value numbers are weaker than the adoption numbers. PwC’s 2026 CEO survey found 56% of CEOs reporting no revenue or cost benefit, and only 12% reporting both higher revenue and lower costs. Glean’s 2026 findings add a labour-cost explanation: workers spend 6.4 hours a week making AI usable, and only 13% saw major organisational performance gains.
Sources: Business Insider (2026) (); Business Insider (2026) ()

Developer evidence is the sharpest contradiction. Google’s enterprise RCT found a 21% task-time reduction, but METR found experienced open-source developers 19% slower with AI tools. A separate open-source study found productivity gains alongside a 41.6% rise in integration time, which points to a familiar pattern: AI can speed local production while increasing system-level coordination cost.
Sources: arXiv (2024) (); arXiv (2025) (); arXiv (2024) ()

Agentic systems make the same point in starker form. RE-Bench showed frontier agents outperforming human experts on short research-engineering tasks, while HCAST translated autonomy into human-time thresholds such as one-minute, one-hour, and four-hour tasks. The 2026 human-oversight social-science paper then showed why capability is not enough: unconstrained multi-agent workflows failed in 72% of runs, but deterministic execution and three human gates reduced failures to 16%.
Sources: arXiv (2024) (); arXiv (2025) (); arXiv (2026) ()

Signals & Tensions

The first tension is between model progress and human absorptive capacity. Anthropic and OpenAI moved towards computer-use and browser agents, but both also documented restrictions, confirmations, and risk controls. That is an implicit admission that more capable action systems increase the need for slower human-facing design.
Sources: Anthropic (2024) (); OpenAI (2025) ()

The second tension is between worker empowerment and managerial surveillance. Walmart’s reported rollout paired AI with certification and operational redesign, while other firms have pushed usage mandates. The evidence favours worker-in-the-loop design, but many programmes still measure activity because it is easier than measuring judgement, quality, or downstream burden.
Sources: Financial Times (2026) (); Business Insider (2025) (); Business Insider (2026) ()

The third tension is that AI may raise the floor while lowering some ceilings. Customer-support novices gained from AI assistance, but learning and programming studies show weaker conceptual understanding, poorer debugging, or reduced independent judgement. That makes adoption policy a training policy, not just a procurement choice.
Sources: National Bureau of Economic Research (2023) (); arXiv (2026) (); arXiv (2024) ()

The fourth tension is that safety research still focuses more on model behaviour than on worker burden. OpenAI’s preparedness framework adds long-range autonomy, and Anthropic’s interpretability work tries to understand model internals. The organisational equivalent would track review load, alert fatigue, deskilling, loss of agency, and escalation quality, but that function is not yet visible at the same maturity.
Sources: OpenAI (2025) (); Anthropic (2025) (); arXiv (2026) ()

The fifth tension is that public harm often comes from product design rather than benchmark weakness. Wired’s reporting on Grok hosting sexualised deepfakes and Time’s reporting on Gemini safety pledge controversy both show harm emerging from release governance, moderation, disclosure, and enforcement choices. Better training helps, but deployment design decides who carries the risk.
Sources: Wired (2026) (); Time (2025) ()

Open Questions

  1. Which AI gains survive full-cost accounting after review time, integration work, rework, context switching, and colleague clean-up are included? Current studies measure slices of the workflow better than the whole system.
    Sources: arXiv (2025) (); Harvard Business Review (2025) ()

  2. What is the safe agent-to-human ratio for different kinds of knowledge work? The evidence says multiple agents can overwhelm attention, but it does not yet give stable staffing rules by task type, risk level, or worker expertise.
    Sources: Simon Willison’s Weblog (2026) (); arXiv (2026) ()

  3. How should organisations measure worker welfare in AI deployments? Adoption dashboards rarely capture cognitive load, vigilance fatigue, deskilling, loss of agency, or whether humans have real authority to stop an automated process.
    Sources: arXiv (2026) (); arXiv (2026) ()

  4. Which parts of good human outcomes come from model training, and which come from orchestration? Current evidence suggests workflow design, gating, context control, and escalation can dominate raw capability, but the field lacks clean comparative trials across the same task with different deployment designs.
    Sources: arXiv (2026) (); arXiv (2025) ()

  5. What happens to early-career formation if entry-level workers are expected to supervise outputs that previously taught them the job? PwC’s jobs analysis suggests entry-level roles increasingly demand senior skills, while skill-formation studies warn that assistance can weaken conceptual understanding.
    Sources: Business Insider (2026) (); arXiv (2026) ()

  6. Who owns the operating model for AI work? Security owns data risk, legal owns compliance, HR owns training, and product teams own rollout, but the evidence points to a missing accountable role for human attention, judgement, and welfare across the whole deployment.
    Sources: arXiv (2025) (); arXiv (2026) ()

The organisations that get this right will not be the ones that make people use the most AI. They will be the ones that spend human attention as carefully as compute.


![[sources-how-humans-are-adapting-to-ai-between-june-2024-an]]


Sources

Summary: ↑ Back to summary


Financial Press

ID Title Outlet Date Significance
f1 AI and the productivity paradox Financial Times 2026-06 This FT newsletter gives a current enterprise view of AI's hidden supervision costs, including 'botsitting' and the 'toggle tax', and ties them to weak company-level productivity gains despite heavy employee usage.
f2 Successful AI adoption needs workers in the loop Financial Times 2025-10 This piece is directly on point for operating-model design, arguing that firms get better results when employees retain agency and oversight rather than being subjected to abstract top-down AI programmes.
f3 High earners race ahead on AI as workplace divide widens Financial Times 2026-04 The FT and Focaldata survey shows adoption is uneven by income, experience and gender, which matters for any claim that AI benefits are broadly distributed across organisations.
f4 AI's adoption problem Financial Times 2026-05 This article captures the widening gap between executive optimism and worker scepticism, and links adoption failure to poor organisational messaging and weak trust.
f5 Walmart tells workers that AI will improve their jobs, not steal them Financial Times 2026-06 Walmart offers a concrete case of a large employer trying to pair AI rollout with certification, workflow redesign and job-security messaging rather than explicit substitution.
f6 Generative AI at Work National Bureau of Economic Research 2023 This field study remains one of the strongest pieces of causal evidence on measured gains, showing productivity improvements in customer support but with large differences by worker experience.
f7 The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers arXiv 2024-05 This paper matters because it examines novice workers directly and highlights that AI can help complete tasks while worsening metacognitive habits and independent problem solving.
f8 Automation from the Worker's Perspective arXiv 2024-09 Based on a large cross-country worker survey, this study shows that perceptions of benefit are conditional on job design, worker status and incentives rather than simple demographic labels.
f9 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity arXiv 2025-07 This randomised study is useful because it cuts against the standard productivity story, finding that frontier coding tools slowed experienced developers in realistic project settings.
f10 Generative AI Uses and Risks for Knowledge Workers in a Science Organization arXiv 2025-01 This organisational study distinguishes copilot use from workflow-agent use and documents risk concerns around security, publication norms and job effects inside a real science institution.
f11 AI and Worker Well-Being: Differential Impacts Across Generational Cohorts and Genders arXiv 2025-11 Using OECD survey microdata, this paper is one of the cleaner pieces of evidence that AI's gains and harms vary by life stage and gender rather than a crude young-versus-old frame.
f12 Generative AI and the Reallocation of Time: Productivity, Leisure, and Fulfilling Work arXiv 2026-02 This paper matters for ROI claims because it finds that time savings can be real while measured output barely moves, with some gains taken as on-the-job leisure rather than higher throughput.
f13 From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms for Dignified Human-AI Interaction arXiv 2026-01 This study names slow-building harms such as skill atrophy and loss of judgement, which are central to the question of whether better outcomes depend more on deployment design than model behaviour alone.

Frontier Lab & Model News

ID Title Outlet Date Significance
t1 Claude 3.5 Sonnet Anthropic 2024-06 Anthropic positioned Claude 3.5 Sonnet as a faster, cheaper frontier model for multi-step workflows, which matters because adoption pressure often follows claims of higher throughput and lower supervision cost.
t2 Claude 3.5 Sonnet Model Card Addendum Anthropic 2024-06 The addendum gives the formal benchmark and safety framing behind Claude 3.5 Sonnet, including its stronger agentic coding and vision scores relative to earlier models.
t3 Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku Anthropic 2024-10 This is one of the clearest lab statements that frontier models were moving from assistant behaviour to direct action on user interfaces, with explicit acknowledgement that the capability was still experimental and error-prone.
t4 Developing a computer use model Anthropic 2024-10 Anthropic’s technical note is directly relevant to human oversight because it details the safety and deployment problems created when models act through the same interfaces as people.
t5 Tracing the thoughts of a large language model Anthropic 2025-03 This interpretability work matters for the brief’s training-versus-deployment question because it argues that understanding internal model strategies is part of making human-facing systems reliable and trustworthy.
t6 OpenAI o3-mini System Card OpenAI 2025-01 OpenAI explicitly rated o3-mini as Medium risk on model autonomy, linking improved coding and research engineering performance to stronger agentic capability and higher oversight demands.
t7 Operator System Card OpenAI 2025-01 Operator is a key source on how labs are designing human-in-the-loop controls such as confirmations, action restrictions, and oversight gates for computer-using agents.
t8 OpenAI GPT-4.5 System Card OpenAI 2025-02 GPT-4.5’s system card frames a large model around more natural interaction and improved alignment with user intent, which is relevant to whether better human outcomes come from model behaviour rather than orchestration alone.
t9 Our updated Preparedness Framework OpenAI 2025-04 The framework introduces long-range autonomy as a research category and makes deployment safety more explicitly operational, showing how frontier labs are formalising risk ownership around increasingly agentic systems.
t10 Introducing GPT-4.1 in the API OpenAI 2025-04 OpenAI marketed GPT-4.1 as better for agents, long context, and real-world software tasks, which is central to the shift from isolated prompts to sustained supervisory work over model-driven processes.
t11 OpenAI o3 and o4-mini System Card OpenAI 2025-04 This system card documents full tool use, including web browsing and file analysis, and ties those capabilities to deliberative alignment and preparedness testing.
t12 Addendum to OpenAI o3 and o4-mini system card: Codex OpenAI 2025-05 The Codex addendum is unusually concrete about workflow design, describing isolated task containers, verifiable evidence, and test-running loops rather than pure chat interaction.
t13 Exclusive: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge Time 2025-09 The report captures external criticism that Gemini 2.5 Pro reached the public before timely safety disclosure, which sharpens the gap between model capability release cycles and accountable human governance.
t14 Google introduces stable Gemini 2.5 Flash and Pro, previews Gemini 2.5 Flash-Lite The Economic Times 2025-06 This marks Google’s move to productionise the Gemini 2.5 line, signalling that reasoning-heavy models were no longer just experimental and were becoming standard building blocks for deployment.
t15 Elon Musk's startup rolls out new Grok-3 chatbot as AI competition intensifies The Guardian 2025-02 The Grok-3 launch illustrates the competitive pressure to release reasoning and search features quickly, even when questions about cost discipline and safeguards remain unresolved.
t16 Grok Is Still Hosting Sexualized Deepfakes of Famous Women Wired 2026-06 Wired’s reporting is a concrete case where deployment and moderation design, not just base-model intelligence, shaped human harm outcomes after release.
t17 SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts arXiv 2025-05 SAGE-Eval is a useful independent check on frontier systems because it tests whether models carry known safety facts into naive user scenarios, which is closely related to real workplace reliance and over-trust.
t18 VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation arXiv 2025-05 VADER compares o3, GPT-4.1, GPT-4.5, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3 Beta on security work and finds only moderate success, tempering claims that current frontier models can be supervised lightly on consequential tasks.

Academic & arXiv

ID Title Outlet Date Significance
a1 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts arXiv 2024-11 METR's RE-Bench gives one of the clearest recent human versus agent comparisons, showing AI agents can move faster than experts on short research-engineering tasks but lose ground as task duration and supervisory demands increase.
a2 HCAST: Human-Calibrated Autonomy Software Tasks arXiv 2025-03 METR's HCAST ties agent performance to human task-time baselines, which is directly useful for judging when oversight remains realistic and when organisations are asking humans to supervise beyond their effective range.
a3 (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable arXiv 2026-06 This preprint isolates workflow design from model quality and finds that human gates plus deterministic execution cut failure rates from 72% to 16% in AI-assisted research.
a4 Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering arXiv 2026-06 Garousi names oversight labour and suggestion overload as hidden costs of coding assistants, making the burden itself part of the productivity calculation rather than an afterthought.
a5 Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance arXiv 2026-05 This paper offers a formal account of why more AI help can lower net productivity once skill development, unreliable outputs, and heterogeneous AI literacy are included.
a6 How AI Impacts Skill Formation arXiv 2026-01 Shen and Tamkin provide experimental evidence that delegation to AI can improve throughput for some users while impairing conceptual understanding, debugging ability, and later independent performance.
a7 Generative AI at Work: From Exposure to Adoption across 35 European Countries arXiv 2026-04 Using a 36,600-worker survey across 35 countries, Henseke shows that adoption depends not just on exposure but on skills, organisational voice, and training, with no detectable task restructuring yet.
a8 Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition arXiv 2025-01 This study shows that transparent multi-step workflows can improve reliance calibration on composite fact-checking tasks, especially when AI advice is misleading.
a9 Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies arXiv 2025-02 Kim, Vaughan, Liao, Lombrozo, and Russakovsky show that explanations can raise reliance on both right and wrong answers, while sources and visible inconsistencies help users discount bad outputs.
a10 Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions arXiv 2024-09 This paper links hallucination handling and cognitive forcing functions to observable reliance patterns in text-generation work, rather than treating verification as a generic best practice.
a11 Human Misperception of Generative-AI Alignment: A Laboratory Experiment arXiv 2025-02 He, Shorrer, and Xia find that people systematically overestimate how closely GenAI choices match human preferences, which matters for welfare claims and delegated decision-making.
a12 Toward Human-AI Complementarity Across Diverse Tasks arXiv 2026-04 This paper finds only modest complementarity gains across realistic tasks and argues that the real bottleneck is routing hard cases to humans in time for them to matter.
a13 Humans overrely on overconfident language models, across languages arXiv 2025-07 Rathi, Jurafsky, and Zhou show that overconfidence and overreliance persist across five languages, suggesting that calibration failures are not a narrow English-only artefact.
a14 When Thinking Pays Off: Incentive Alignment for Human-AI Collaboration arXiv 2025-11 This behavioural experiment shows that overreliance is partly an incentive design problem, and that collaboration quality changes when organisations reward correct dissent rather than passive acceptance.
a15 De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design arXiv 2025-03 This design-focused study connects practitioner concerns about de-skilling and cognitive offloading to the older automation literature on function allocation and responsibility drift.
a16 The Effects of Generative AI on Design Fixation and Divergent Thinking arXiv 2024-03 This experiment finds that image-generation support can increase fixation and reduce originality and variety, giving concrete evidence that convenience can narrow thought rather than broaden it.
a17 Creativity in the Age of AI: Evaluating the Impact of Generative AI on Design Outputs and Designers' Creative Thinking arXiv 2024-10 This study finds more creative-seeming outputs with AI support but uneven cognitive effects across users, which complicates simple claims that AI either helps or harms creativity.
a18 Controlling Context: Generative AI at Work in Integrated Circuit Design and Other High-Precision Domains arXiv 2025-06 Moss, Watkins, Persaud, Karunaratne, and Nafus show that in high-precision domains the key issue is not just accuracy but preserving enough context control for human vigilance and review.
a19 Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study arXiv 2025-10 This representative Swiss panel finds declining public acceptance after the ChatGPT era and rising demand for human-only decision-making, a direct warning against mandate-led deployment.
a20 Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale arXiv 2025-12 This scale paper gives the field a way to measure verification, motivation, and reflection in AI use, which is necessary if organisations want to manage human outcomes rather than token volume.
a21 Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts arXiv 2025-05 This user study shows that metacognitive prompts can increase follow-up inquiry and perspective-taking during AI search, pointing to a concrete intervention for reducing passive acceptance.
a22 Promoting Critical Thinking With Domain-Specific Generative AI Provocations arXiv 2026-03 von Davier, Lee, Forlizzi, and Das argue that productive friction and domain-specific provocations can support critical thinking better than frictionless assistant behaviour.
a23 Current and Future Use of Large Language Models for Knowledge Work arXiv 2025-03 These surveys of knowledge workers show that adoption is already broad, but desired future use centres on workflow integration, which shifts the design question from access to operating model.
a24 An Empirical Study of Generative AI Adoption in Software Engineering arXiv 2025-12 This empirical study reports widespread use and perceived gains in software engineering, while also finding thin objective measurement and weak institutional emphasis on training and governance.
a25 The State of Generative AI in Software Development: Insights from Literature and a Developer Survey arXiv 2026-03 This review-plus-survey argues that value is strongest in routine coding and documentation, while planning and requirements work remain harder, shifting attention toward oversight and specification quality.

VC & Analyst Reports

ID Title Outlet Date Significance
v1 Employers want entry-level workers with senior-level skills in the age of AI, a huge PwC analysis found Business Insider 2026-06 Reports PwC's 2026 AI Jobs Barometer finding that AI-exposed entry-level roles in the US are seven times more likely than in 2019 to require traditionally senior skills such as judgement, leadership, and stakeholder management.
v2 Bosses don't think AI is paying off yet, a PwC survey of 4,500 CEOs found Business Insider 2026-01 Summarises PwC's 2026 Global CEO Survey, which found that 56% of CEOs reported no revenue or cost benefits from AI and only 12% reported both higher revenue and lower costs.
v3 The rise of the 'botsitters' Business Insider 2026-06 Cites Glean's Work AI Institute finding that white-collar workers spend 6.4 hours a week correcting and managing AI, while only 13% see major organisational performance gains.
v4 The top 5 most common ways people say they're using AI in the workplace Business Insider 2025-12 Uses Gallup survey data to show that workplace use is rising, but the dominant applications remain basic chat, writing, and coding assistance rather than autonomous multi-agent supervision.
v5 'Most enterprises are still unprepared to operationalize it': IT leaders are bullish on agents, but keeping falling at the final hurdle - here's why ITPro 2026-06 Summarises new Forrester research saying about 75% of enterprise leaders are adopting agentic AI, yet most remain stuck in pilots because orchestration, governance, and nonhuman identity controls are weak.
v6 Concerns are mounting over the cognitive impact of AI as workers report experiencing 'brain fry' - and it's causing "increased employee errors, decision fatigue, and intention to quit" ITPro 2026-03 Reports Boston Consulting Group research on 'AI brain fry', linking heavy AI oversight work to mental fog, decision fatigue, higher error rates, and higher intention to quit.
v7 Is this the tipping point for AI at work? New Gallup survey finds half of all US employees now use it in some way TechRadar 2026-04 Summarises Gallup's Q1 2026 survey of 23,717 US employees, which found that 50% use AI at work and 13% use it daily, but task-level gains still exceed whole-workflow redesign.
v8 Microsoft, Shopify, and other companies now require employees to use AI. How is AI changing your work? Business Insider 2025-08 Captures the mandate phase of enterprise adoption, citing Bain's estimate that average employer AI spending doubled in 2024 to $10.3 million while regular use remained uneven between leaders and frontline staff.
v9 Generative AI at Work: From Exposure to Adoption across 35 European Countries arXiv 2026-04 Provides cross-country evidence that adoption tracks skills, workplace training, and employee say in organisational decisions, with no detectable task restructuring yet in the 2024 data.
v10 A meta-analysis of the effect of generative AI on productivity and learning in programming arXiv 2026-05 Synthesises 23 studies and finds a moderate productivity gain for coding assistants but no statistically significant improvement in learning outcomes, which is directly relevant to deskilling concerns.
v11 Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape arXiv 2026-04 Survey evidence from 457 researchers shows strong perceived productivity gains but continued distrust on correctness, with AI concentrated in writing and early-stage work rather than core methodological judgement.
v12 Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study arXiv 2025-10 Shows that public acceptance of AI fell after the generative AI boom and demand for human oversight rose, which challenges investor narratives that adoption pressure alone will normalise AI-heavy workflows.
v13 Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey arXiv 2026-02 Finds that workers demand materially higher accuracy from AI at work than in personal use, a useful reminder that enterprise deployment fails when review burdens exceed tolerance for correction.
v14 Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework arXiv 2026-06 Offers a concrete deployment pattern from Nubank, where structured context engineering, human-in-the-loop prompt iteration, and offline-to-online evaluation produced measurable customer-satisfaction and self-service gains.

Blogs & Independent Thinkers

ID Title Outlet Date Significance
b1 Management as AI superpower One Useful Thing 2026-01 Ethan Mollick argues that agentic work shifts value from execution to delegation, evaluation, and specification, grounding the organisational case for treating management skill and subject matter expertise as the scarce resource.
b2 Claude Dispatch and the Power of Interfaces One Useful Thing 2026-03 Mollick uses a valuation-task paper to show that chatbot UX can erase AI productivity gains by overloading users with sprawling outputs, making interface design a first-order operating-model issue.
b3 Choosing to Stay Human One Useful Thing 2026-05 This essay frames AI adoption as a choice about where to preserve human skill formation, warning that convenience can erode writing judgement and flood attention with low-meaning synthetic content.
b4 The lethal trifecta for AI agents: private data, untrusted content, and external communication Simon Willison’s Weblog 2025-06 Willison reduces a broad agent-safety debate to a concrete deployment rule: combining private-data access, untrusted input, and external communication creates a prompt-injection exfiltration hazard.
b5 Writing about Agentic Engineering Patterns Simon Willison’s Weblog 2026-02 Willison argues that coding agents lower the cost of producing working code to near zero, shifting the bottleneck to verification and changing how teams should structure engineering work.
b6 Highlights from my conversation about agentic engineering on Lenny’s Podcast Simon Willison’s Weblog 2026-04 Willison gives a practitioner account of the cognitive cost of supervising multiple agents in parallel, describing four concurrent coding agents as enough to wipe out an experienced engineer by late morning.
b7 If Claude Fable stops helping you, you'll never know Jonathon Ready 2026-06 Ready identifies a direct conflict between a lab's hidden model-side restrictions and user welfare, arguing that silent degradation creates supply-chain risk for ordinary software teams building AI features.
b8 AI as Normal Technology AI as Normal Technology 2025-04 Arvind Narayanan and Sayash Kapoor argue against technological determinism, separating model progress from application design and adoption, which is central to judging whether harms are fixed in training or in deployment.
b9 New Paper: Towards a science of AI agent reliability AI as Normal Technology 2026-02 Kapoor, Narayanan, and Stephan Rabanser argue that capability gains have outpaced reliability gains, offering a framework for why impressive demos do not automatically translate into dependable organisational use.
b10 Why AI hasn’t replaced software engineers, and won’t AI as Normal Technology 2026-06 This essay rejects the threshold story of sudden labour replacement, arguing instead that AI compresses execution while decision-making, verification, and accountability remain stubbornly human.
b11 Predicting AI job exposure Benedict Evans 2026-05 Evans argues that job-exposure charts miss how automation changes business models, regulation, and task composition, which makes simple role-level forecasts unreliable for planning.
b12 (Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable arXiv 2026-06 This paper reports that an unconstrained multi-agent baseline failed in 72 percent of runs, while a harness with deterministic computation and three human decision gates cut failures to 16 percent.
b13 Oversight Structures for Agentic AI in Public-Sector Organizations arXiv 2025-06 This paper argues that agentic AI requires continuous oversight, tighter integration of governance with operations, and cross-departmental coordination rather than episodic compliance review.
b14 As companies rethink AI ROI, Replit's AI chief calls token leaderboards 'very dystopian' Business Insider 2026-06 The report captures an emerging backlash against token-maximising mandates, with Replit's Michele Catasta arguing that raw token burn is a misleading proxy for productivity or value.
b15 I asked 4 executives how they measure AI ROI. None started with AI tokens. Business Insider 2026-06 This report shows large organisations moving from activity metrics to outcome metrics, with BNP Paribas CIB, La Banque Postale, TCS, and NTT DATA all rejecting token counts as the main ROI measure.
b16 How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks arXiv 2026-04 The paper finds that agentic coding runs can consume 1000 times more tokens than simpler code interactions, with high variance and no reliable link between higher token use and better accuracy.

Tech Industry & Practitioner

ID Title Outlet Date Significance
p1 [DORA Accelerate State of DevOps Report 2024](https://dora.dev/research/2024/dora-report/) DORA, Google Cloud 2024
p2 2025 DORA State of AI Assisted Software Development Google Cloud 2025 This follow-on DORA report matters because it shifts assessment from raw output to team archetypes and human factors such as burnout, friction, and perceived value.
p3 DORA Report 2024 – A Look at Throughput and Stability RedMonk 2024-11 Rachel Stephens translates the DORA findings into an operating-model critique, arguing that code generation may not be the bottleneck and that organisations can optimise the wrong constraint.
p4 How much does AI impact development speed? An enterprise-based randomized controlled trial arXiv 2024-10 Google's RCT with full-time engineers provides one of the cleaner measured-benefit studies, finding about a 21% reduction in time on a complex enterprise task under controlled conditions.
p5 Examining the Use and Impact of an AI Code Assistant on Developer Productivity and Experience in the Enterprise arXiv 2024-12 IBM's internal deployment study shows that enterprise gains are uneven, with productivity benefits present but not universal, and with responsibility and ownership of generated code becoming central issues.
p6 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity arXiv 2025-07 METR's field experiment is a strong counterweight to vendor claims, finding experienced open-source developers were 19% slower with frontier AI tools despite expecting to be faster.
p7 The SPACE of AI: Real-World Lessons on AI's Impact on Developers arXiv 2025-07 This mixed-methods study uses the SPACE framework to show that benefits cluster around routine work and depend heavily on task complexity, peer learning, and organisational support.
p8 The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot arXiv 2024-10 This study finds project-level productivity gains in open source but also a 41.6% increase in integration time, making coordination cost a first-order part of the AI productivity story.
p9 AI-assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden arXiv 2025-10 This paper matters because it shows productivity gains can shift maintenance and review burden onto core developers, worsening outcomes for the people who carry system knowledge.
p10 Developer Productivity with GenAI arXiv 2025-10 Using the SPACE lens with 415 practitioners, this paper argues that faster output does not reliably translate into better software or better wellbeing, which is central to judging AI adoption programmes.
p11 What do professional software developers need to know to succeed in an age of Artificial Intelligence? arXiv 2025-05 This practitioner-focused study reframes the adaptation problem around workflow judgement, adjacent engineering skills, and non-technical skills rather than prompt fluency alone.
p12 What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow arXiv 2025-10 By mining Stack Overflow, this paper identifies recurring implementation pain around orchestration complexity, evaluation reliability, and runtime integration in agent systems.
p13 AI-Generated “Workslop” Is Destroying Productivity Harvard Business Review 2025-09 This study-backed HBR piece gives a concrete mechanism for failed ROI, namely low-effort AI output that looks plausible but pushes cognitive and rework costs onto colleagues.
p14 [Workslop: The Hidden Cost of AI-Generated Busywork BetterUp Labs](https://www.betterup.com/workslop) BetterUp Labs 2025
p15 Researchers Asked LLMs for Strategic Advice. They Got “Trendslop” in Return. Harvard Business Review 2026-03 This HBR article extends the critique beyond code into managerial judgement, showing how models can produce fashionable but shallow recommendations that reward buzzwords over reasoning.
p16 'Botsitting' is destroying productivity as workers spend nearly a full day each week making AI 'usable' ITPro 2026-06 This report on Glean's Work AI Institute findings captures the hidden supervision tax, 6.4 hours a week spent feeding context, checking outputs, and correcting errors.
p17 3 in 4 workers say AI reduced productivity and increased workloads, survey finds Business Insider 2024-08 Upwork's survey is useful as an early warning that mandate-led adoption can raise review load and learning overhead faster than it creates value.
p18 84% of software developers are now using AI, but nearly half 'don't trust' the technology over accuracy concerns ITPro 2025-08 This Stack Overflow survey coverage anchors the adoption-trust gap, with broad usage rising while distrust, debugging effort, and security concern remain high.
p19 UK software developers are still cautious about AI, and for good reason ITPro 2025-10 JetBrains' ecosystem survey adds a regional practitioner view showing that caution concentrates around code quality, privacy, and retaining human control over reviews and testing.
p20 No AI overload just yet? Google's new survey reveals how developers are really using AI at work TechRadar 2025-10 This report on Google's survey is valuable because it pairs very high developer adoption with low strong trust, supporting the claim that supervision, not surrender, remains the norm.

We use analytics cookies to understand site usage and improve the service. We do not use marketing cookies.