Research · Summary

Research sweep · deep · 2024 – 2026

Designing AI Operating Models Around Humans

How humans are adapting to AI between June 2024 and June 2026, weighing measured benefits and harms, and how organizations should design operating models around human cognitive load and behavioural patterns rather than forcing adoption, covering cognitive overload from supervising multiple agents at machine speed (context switching, automation complacency, vigilance fatigue), the poor budget and value outcomes of top-down AI mandates and token-maximizing usage, the gap between model welfare functions (such as Anthropic's) and any equivalent human or worker welfare function, and how much good human outcomes depend on model training versus orchestration and deployment design.

GPT-5.5
financial
frontier
academic
vc
blogs
tech

Synthesised 2026-06-16

Overview

Overview

AI adoption has moved faster than organisational learning. By Q1 2026, Gallup found that half of US employees used AI at work and 13% used it daily, yet PwC’s CEO survey found that 56% of CEOs saw no revenue or cost benefit from AI and only 12% saw both revenue growth and lower costs. The centre of gravity has shifted from access to absorption: people can now reach capable tools, but firms are struggling to convert use into durable value.
Sources: TechRadar (2026) (↗); Business Insider (2026) (↗)

The defining change since June 2024 is that AI has become less like a writing aid and more like a delegated worker. Anthropic’s computer-use release and OpenAI’s Operator system card both moved frontier systems towards browser and interface action under human oversight. That shift increases the value of good delegation, but it also moves hidden work onto humans: checking, pacing, escalation, context recovery, and deciding when not to trust the machine.
Sources: Anthropic (2024) (↗); OpenAI (2025) (↗)

The evidence does not support a flat story about “AI helps workers” or “AI harms workers”. The NBER customer-support study found larger gains for novice workers, while METR’s field experiment found experienced open-source developers were 19% slower with early-2025 AI tools despite expecting a 24% speed-up. Effects vary by task, skill, context, measurement method, and how much review burden the workflow creates.
Sources: National Bureau of Economic Research (2023) (↗); arXiv (2025) (↗)

The practical lesson is simple but demanding: organisations should design AI operating models around human attention, judgement, and behavioural limits. Token counts, tool counts, and leaderboards measure activity, not value. The better evidence points towards constrained workflows, explicit human gates, selective use cases, and accountability for worker welfare, not blanket mandates to use more AI.
Sources: Business Insider (2026) (↗); arXiv (2026) (↗)

Timeline

Key milestones, June 2024 to June 2026

Q2 2024

Frontier assistants become faster workflow tools

Q3 2024

Developer and design studies show early deskilling and fixation risks

Q4 2024

Computer-use agents make human oversight a deployment problem
DORA reports weaker delivery outcomes with higher AI adoption

Q1 2025

Browser agents formalise confirmation and action restrictions
Appropriate-reliance research matures

Q2 2025

Preparedness frameworks add long-range autonomy
Agent reliability becomes a public research agenda

Q3 2025

Experienced developers show negative productivity effects
Workslop and maintenance burden enter practitioner debate

Q4 2025

AI acceptance softens after the boom
Token use and value measurement diverge

Q1 2026

Workplace adoption reaches mainstream levels
CEOs report weak revenue and cost outcomes

Q2 2026

Botsitting and brain fry become named operating costs
Human-gated agent workflows outperform unconstrained multi-agent systems

Key Findings

Individual productivity gains are real, but they do not automatically compound into organisational performance. Google’s enterprise RCT found about a 21% reduction in task time on a complex internal task, and IBM’s internal developer study reported net gains for many developers. Against that, DORA’s 2024 report associated higher AI adoption with lower delivery stability, lower throughput, and less time spent on valuable work.
Sources: arXiv (2024) (↗); arXiv (2024) (↗); DORA, Google Cloud (2024) (↗)
Skill effects are asymmetric. Novices can gain speed and quality when AI supplies missing procedural knowledge, as in the NBER customer-support study. But novice programmers can also lose independent problem-solving capacity, and experienced developers can lose time when AI increases verification and integration work.
Sources: National Bureau of Economic Research (2023) (↗); arXiv (2024) (↗); arXiv (2025) (↗)
AI adoption is widening workplace divides. The Financial Times and Focaldata found that daily AI use in 2026 was concentrated among higher earners and more experienced staff, with a persistent gender gap. Henseke’s cross-European study found only 12% average workplace adoption in 2024, with adoption shaped by training, non-routine cognitive work, and employee say in organisational decisions.
Sources: Financial Times (2026) (↗); arXiv (2026) (↗)
Human oversight is not a free safety layer. Glean’s Work AI Institute reported workers spending 6.4 hours a week “botsitting” systems, while BCG-linked findings described decision fatigue, higher error rates, and higher intention to quit among workers doing heavy AI oversight. In software engineering, Garousi’s 2026 paper frames review effort and suggestion overload as direct costs rather than side effects.
Sources: Business Insider (2026) (↗); ITPro (2026) (↗); arXiv (2026) (↗)
Architecture often matters more than raw model capability. Zhu, Wang, and Zhang’s 2026 social-science experiment found that an unconstrained multi-agent baseline failed in 72% of runs, while a workflow with deterministic execution and three human gates cut failures to 16%. Nubank’s customer-support agent paper also points towards context engineering, evaluation, and escalation design as the source of value at scale.
Sources: arXiv (2026) (↗); arXiv (2026) (↗)
Reliance failures are behavioural, not just technical. Studies on appropriate reliance find that explanations can increase trust in wrong answers, while sources, visible inconsistencies, and structured workflows improve calibration. Rathi, Jurafsky, and Zhou found humans overrely on overconfident language models across languages, showing that fluent outputs can distort judgement even when users know the system may be wrong.
Sources: arXiv (2025) (↗); arXiv (2025) (↗); arXiv (2025) (↗)
Forced adoption and token-maxing are weak value strategies. Business Insider reported that executives at Replit, BNP Paribas CIB, La Banque Postale, TCS, and NTT DATA rejected token counts and leaderboards as ROI measures. A 2026 arXiv study on agentic coding tasks found that higher token use does not reliably buy higher accuracy.
Sources: Business Insider (2026) (↗); Business Insider (2026) (↗); arXiv (2026) (↗)
The model-welfare debate is ahead of the worker-welfare debate. Anthropic publishes model cards, computer-use guidance, interpretability work, and policy choices around model behaviour, while the research brief surfaces no equivalent mainstream organisational function for human welfare in AI deployment. The closest emerging roles sit in governance, responsible AI, HR, security, and operations, but they rarely own cognitive load, dignity, deskilling, and review burden as one accountable brief.
Sources: Anthropic (2024) (↗); Anthropic (2024) (↗); Anthropic (2025) (↗); arXiv (2026) (↗)

Evidence & Data

The quantitative picture is split. Adoption has crossed into the mainstream, with Gallup reporting 50% workplace use and 13% daily use in the US by Q1 2026. But Henseke’s 35-country European study found only 12% average workplace adoption in 2024 and no detectable task restructuring yet, which suggests that usage statistics can outrun organisational change.
Sources: TechRadar (2026) (↗); arXiv (2026) (↗)

The value numbers are weaker than the adoption numbers. PwC’s 2026 CEO survey found 56% of CEOs reporting no revenue or cost benefit, and only 12% reporting both higher revenue and lower costs. Glean’s 2026 findings add a labour-cost explanation: workers spend 6.4 hours a week making AI usable, and only 13% saw major organisational performance gains.
Sources: Business Insider (2026) (↗); Business Insider (2026) (↗)

Developer evidence is the sharpest contradiction. Google’s enterprise RCT found a 21% task-time reduction, but METR found experienced open-source developers 19% slower with AI tools. A separate open-source study found productivity gains alongside a 41.6% rise in integration time, which points to a familiar pattern: AI can speed local production while increasing system-level coordination cost.
Sources: arXiv (2024) (↗); arXiv (2025) (↗); arXiv (2024) (↗)

Agentic systems make the same point in starker form. RE-Bench showed frontier agents outperforming human experts on short research-engineering tasks, while HCAST translated autonomy into human-time thresholds such as one-minute, one-hour, and four-hour tasks. The 2026 human-oversight social-science paper then showed why capability is not enough: unconstrained multi-agent workflows failed in 72% of runs, but deterministic execution and three human gates reduced failures to 16%.
Sources: arXiv (2024) (↗); arXiv (2025) (↗); arXiv (2026) (↗)

Signals & Tensions

The first tension is between model progress and human absorptive capacity. Anthropic and OpenAI moved towards computer-use and browser agents, but both also documented restrictions, confirmations, and risk controls. That is an implicit admission that more capable action systems increase the need for slower human-facing design.
Sources: Anthropic (2024) (↗); OpenAI (2025) (↗)

The second tension is between worker empowerment and managerial surveillance. Walmart’s reported rollout paired AI with certification and operational redesign, while other firms have pushed usage mandates. The evidence favours worker-in-the-loop design, but many programmes still measure activity because it is easier than measuring judgement, quality, or downstream burden.
Sources: Financial Times (2026) (↗); Business Insider (2025) (↗); Business Insider (2026) (↗)

The third tension is that AI may raise the floor while lowering some ceilings. Customer-support novices gained from AI assistance, but learning and programming studies show weaker conceptual understanding, poorer debugging, or reduced independent judgement. That makes adoption policy a training policy, not just a procurement choice.
Sources: National Bureau of Economic Research (2023) (↗); arXiv (2026) (↗); arXiv (2024) (↗)

The fourth tension is that safety research still focuses more on model behaviour than on worker burden. OpenAI’s preparedness framework adds long-range autonomy, and Anthropic’s interpretability work tries to understand model internals. The organisational equivalent would track review load, alert fatigue, deskilling, loss of agency, and escalation quality, but that function is not yet visible at the same maturity.
Sources: OpenAI (2025) (↗); Anthropic (2025) (↗); arXiv (2026) (↗)

The fifth tension is that public harm often comes from product design rather than benchmark weakness. Wired’s reporting on Grok hosting sexualised deepfakes and Time’s reporting on Gemini safety pledge controversy both show harm emerging from release governance, moderation, disclosure, and enforcement choices. Better training helps, but deployment design decides who carries the risk.
Sources: Wired (2026) (↗); Time (2025) (↗)

Open Questions

Which AI gains survive full-cost accounting after review time, integration work, rework, context switching, and colleague clean-up are included? Current studies measure slices of the workflow better than the whole system.
Sources: arXiv (2025) (↗); Harvard Business Review (2025) (↗)
What is the safe agent-to-human ratio for different kinds of knowledge work? The evidence says multiple agents can overwhelm attention, but it does not yet give stable staffing rules by task type, risk level, or worker expertise.
Sources: Simon Willison’s Weblog (2026) (↗); arXiv (2026) (↗)
How should organisations measure worker welfare in AI deployments? Adoption dashboards rarely capture cognitive load, vigilance fatigue, deskilling, loss of agency, or whether humans have real authority to stop an automated process.
Sources: arXiv (2026) (↗); arXiv (2026) (↗)
Which parts of good human outcomes come from model training, and which come from orchestration? Current evidence suggests workflow design, gating, context control, and escalation can dominate raw capability, but the field lacks clean comparative trials across the same task with different deployment designs.
Sources: arXiv (2026) (↗); arXiv (2025) (↗)
What happens to early-career formation if entry-level workers are expected to supervise outputs that previously taught them the job? PwC’s jobs analysis suggests entry-level roles increasingly demand senior skills, while skill-formation studies warn that assistance can weaken conceptual understanding.
Sources: Business Insider (2026) (↗); arXiv (2026) (↗)
Who owns the operating model for AI work? Security owns data risk, legal owns compliance, HR owns training, and product teams own rollout, but the evidence points to a missing accountable role for human attention, judgement, and welfare across the whole deployment.
Sources: arXiv (2025) (↗); arXiv (2026) (↗)

The organisations that get this right will not be the ones that make people use the most AI. They will be the ones that spend human attention as carefully as compute.

![[sources-how-humans-are-adapting-to-ai-between-june-2024-an]]

Sources

Summary: ↑ Back to summary

Financial Press

ID	Title	Outlet	Date	Significance
f1	AI and the productivity paradox	Financial Times	2026-06	This FT newsletter gives a current enterprise view of AI's hidden supervision costs, including 'botsitting' and the 'toggle tax', and ties them to weak company-level productivity gains despite heavy employee usage.
f2	Successful AI adoption needs workers in the loop	Financial Times	2025-10	This piece is directly on point for operating-model design, arguing that firms get better results when employees retain agency and oversight rather than being subjected to abstract top-down AI programmes.
f3	High earners race ahead on AI as workplace divide widens	Financial Times	2026-04	The FT and Focaldata survey shows adoption is uneven by income, experience and gender, which matters for any claim that AI benefits are broadly distributed across organisations.
f4	AI's adoption problem	Financial Times	2026-05	This article captures the widening gap between executive optimism and worker scepticism, and links adoption failure to poor organisational messaging and weak trust.
f5	Walmart tells workers that AI will improve their jobs, not steal them	Financial Times	2026-06	Walmart offers a concrete case of a large employer trying to pair AI rollout with certification, workflow redesign and job-security messaging rather than explicit substitution.
f6	Generative AI at Work	National Bureau of Economic Research	2023	This field study remains one of the strongest pieces of causal evidence on measured gains, showing productivity improvements in customer support but with large differences by worker experience.
f7	The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers	arXiv	2024-05	This paper matters because it examines novice workers directly and highlights that AI can help complete tasks while worsening metacognitive habits and independent problem solving.
f8	Automation from the Worker's Perspective	arXiv	2024-09	Based on a large cross-country worker survey, this study shows that perceptions of benefit are conditional on job design, worker status and incentives rather than simple demographic labels.
f9	Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity	arXiv	2025-07	This randomised study is useful because it cuts against the standard productivity story, finding that frontier coding tools slowed experienced developers in realistic project settings.
f10	Generative AI Uses and Risks for Knowledge Workers in a Science Organization	arXiv	2025-01	This organisational study distinguishes copilot use from workflow-agent use and documents risk concerns around security, publication norms and job effects inside a real science institution.
f11	AI and Worker Well-Being: Differential Impacts Across Generational Cohorts and Genders	arXiv	2025-11	Using OECD survey microdata, this paper is one of the cleaner pieces of evidence that AI's gains and harms vary by life stage and gender rather than a crude young-versus-old frame.
f12	Generative AI and the Reallocation of Time: Productivity, Leisure, and Fulfilling Work	arXiv	2026-02	This paper matters for ROI claims because it finds that time savings can be real while measured output barely moves, with some gains taken as on-the-job leisure rather than higher throughput.
f13	From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms for Dignified Human-AI Interaction	arXiv	2026-01	This study names slow-building harms such as skill atrophy and loss of judgement, which are central to the question of whether better outcomes depend more on deployment design than model behaviour alone.

Frontier Lab & Model News

ID	Title	Outlet	Date	Significance
t1	Claude 3.5 Sonnet	Anthropic	2024-06	Anthropic positioned Claude 3.5 Sonnet as a faster, cheaper frontier model for multi-step workflows, which matters because adoption pressure often follows claims of higher throughput and lower supervision cost.
t2	Claude 3.5 Sonnet Model Card Addendum	Anthropic	2024-06	The addendum gives the formal benchmark and safety framing behind Claude 3.5 Sonnet, including its stronger agentic coding and vision scores relative to earlier models.
t3	Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku	Anthropic	2024-10	This is one of the clearest lab statements that frontier models were moving from assistant behaviour to direct action on user interfaces, with explicit acknowledgement that the capability was still experimental and error-prone.
t4	Developing a computer use model	Anthropic	2024-10	Anthropic’s technical note is directly relevant to human oversight because it details the safety and deployment problems created when models act through the same interfaces as people.
t5	Tracing the thoughts of a large language model	Anthropic	2025-03	This interpretability work matters for the brief’s training-versus-deployment question because it argues that understanding internal model strategies is part of making human-facing systems reliable and trustworthy.
t6	OpenAI o3-mini System Card	OpenAI	2025-01	OpenAI explicitly rated o3-mini as Medium risk on model autonomy, linking improved coding and research engineering performance to stronger agentic capability and higher oversight demands.
t7	Operator System Card	OpenAI	2025-01	Operator is a key source on how labs are designing human-in-the-loop controls such as confirmations, action restrictions, and oversight gates for computer-using agents.
t8	OpenAI GPT-4.5 System Card	OpenAI	2025-02	GPT-4.5’s system card frames a large model around more natural interaction and improved alignment with user intent, which is relevant to whether better human outcomes come from model behaviour rather than orchestration alone.
t9	Our updated Preparedness Framework	OpenAI	2025-04	The framework introduces long-range autonomy as a research category and makes deployment safety more explicitly operational, showing how frontier labs are formalising risk ownership around increasingly agentic systems.
t10	Introducing GPT-4.1 in the API	OpenAI	2025-04	OpenAI marketed GPT-4.1 as better for agents, long context, and real-world software tasks, which is central to the shift from isolated prompts to sustained supervisory work over model-driven processes.
t11	OpenAI o3 and o4-mini System Card	OpenAI	2025-04	This system card documents full tool use, including web browsing and file analysis, and ties those capabilities to deliberative alignment and preparedness testing.
t12	Addendum to OpenAI o3 and o4-mini system card: Codex	OpenAI	2025-05	The Codex addendum is unusually concrete about workflow design, describing isolated task containers, verifiable evidence, and test-running loops rather than pure chat interaction.
t13	Exclusive: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge	Time	2025-09	The report captures external criticism that Gemini 2.5 Pro reached the public before timely safety disclosure, which sharpens the gap between model capability release cycles and accountable human governance.
t14	Google introduces stable Gemini 2.5 Flash and Pro, previews Gemini 2.5 Flash-Lite	The Economic Times	2025-06	This marks Google’s move to productionise the Gemini 2.5 line, signalling that reasoning-heavy models were no longer just experimental and were becoming standard building blocks for deployment.
t15	Elon Musk's startup rolls out new Grok-3 chatbot as AI competition intensifies	The Guardian	2025-02	The Grok-3 launch illustrates the competitive pressure to release reasoning and search features quickly, even when questions about cost discipline and safeguards remain unresolved.
t16	Grok Is Still Hosting Sexualized Deepfakes of Famous Women	Wired	2026-06	Wired’s reporting is a concrete case where deployment and moderation design, not just base-model intelligence, shaped human harm outcomes after release.
t17	SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts	arXiv	2025-05	SAGE-Eval is a useful independent check on frontier systems because it tests whether models carry known safety facts into naive user scenarios, which is closely related to real workplace reliance and over-trust.
t18	VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation	arXiv	2025-05	VADER compares o3, GPT-4.1, GPT-4.5, Claude 3.7 Sonnet, Gemini 2.5 Pro, and Grok 3 Beta on security work and finds only moderate success, tempering claims that current frontier models can be supervised lightly on consequential tasks.

Academic & arXiv

ID	Title	Outlet	Date	Significance
a1	RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts	arXiv	2024-11	METR's RE-Bench gives one of the clearest recent human versus agent comparisons, showing AI agents can move faster than experts on short research-engineering tasks but lose ground as task duration and supervisory demands increase.
a2	HCAST: Human-Calibrated Autonomy Software Tasks	arXiv	2025-03	METR's HCAST ties agent performance to human task-time baselines, which is directly useful for judging when oversight remains realistic and when organisations are asking humans to supervise beyond their effective range.
a3	(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable	arXiv	2026-06	This preprint isolates workflow design from model quality and finds that human gates plus deterministic execution cut failure rates from 72% to 16% in AI-assisted research.
a4	Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering	arXiv	2026-06	Garousi names oversight labour and suggestion overload as hidden costs of coding assistants, making the burden itself part of the productivity calculation rather than an afterthought.
a5	Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance	arXiv	2026-05	This paper offers a formal account of why more AI help can lower net productivity once skill development, unreliable outputs, and heterogeneous AI literacy are included.
a6	How AI Impacts Skill Formation	arXiv	2026-01	Shen and Tamkin provide experimental evidence that delegation to AI can improve throughput for some users while impairing conceptual understanding, debugging ability, and later independent performance.
a7	Generative AI at Work: From Exposure to Adoption across 35 European Countries	arXiv	2026-04	Using a 36,600-worker survey across 35 countries, Henseke shows that adoption depends not just on exposure but on skills, organisational voice, and training, with no detectable task restructuring yet.
a8	Fine-Grained Appropriate Reliance: Human-AI Collaboration with a Multi-Step Transparent Decision Workflow for Complex Task Decomposition	arXiv	2025-01	This study shows that transparent multi-step workflows can improve reliance calibration on composite fact-checking tasks, especially when AI advice is misleading.
a9	Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies	arXiv	2025-02	Kim, Vaughan, Liao, Lombrozo, and Russakovsky show that explanations can raise reliance on both right and wrong answers, while sources and visible inconsistencies help users discount bad outputs.
a10	Emerging Reliance Behaviors in Human-AI Text Generation: Hallucinations, Data Quality Assessment, and Cognitive Forcing Functions	arXiv	2024-09	This paper links hallucination handling and cognitive forcing functions to observable reliance patterns in text-generation work, rather than treating verification as a generic best practice.
a11	Human Misperception of Generative-AI Alignment: A Laboratory Experiment	arXiv	2025-02	He, Shorrer, and Xia find that people systematically overestimate how closely GenAI choices match human preferences, which matters for welfare claims and delegated decision-making.
a12	Toward Human-AI Complementarity Across Diverse Tasks	arXiv	2026-04	This paper finds only modest complementarity gains across realistic tasks and argues that the real bottleneck is routing hard cases to humans in time for them to matter.
a13	Humans overrely on overconfident language models, across languages	arXiv	2025-07	Rathi, Jurafsky, and Zhou show that overconfidence and overreliance persist across five languages, suggesting that calibration failures are not a narrow English-only artefact.
a14	When Thinking Pays Off: Incentive Alignment for Human-AI Collaboration	arXiv	2025-11	This behavioural experiment shows that overreliance is partly an incentive design problem, and that collaboration quality changes when organisations reward correct dissent rather than passive acceptance.
a15	De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design	arXiv	2025-03	This design-focused study connects practitioner concerns about de-skilling and cognitive offloading to the older automation literature on function allocation and responsibility drift.
a16	The Effects of Generative AI on Design Fixation and Divergent Thinking	arXiv	2024-03	This experiment finds that image-generation support can increase fixation and reduce originality and variety, giving concrete evidence that convenience can narrow thought rather than broaden it.
a17	Creativity in the Age of AI: Evaluating the Impact of Generative AI on Design Outputs and Designers' Creative Thinking	arXiv	2024-10	This study finds more creative-seeming outputs with AI support but uneven cognitive effects across users, which complicates simple claims that AI either helps or harms creativity.
a18	Controlling Context: Generative AI at Work in Integrated Circuit Design and Other High-Precision Domains	arXiv	2025-06	Moss, Watkins, Persaud, Karunaratne, and Nafus show that in high-precision domains the key issue is not just accuracy but preserving enough context control for human vigilance and review.
a19	Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study	arXiv	2025-10	This representative Swiss panel finds declining public acceptance after the ChatGPT era and rising demand for human-only decision-making, a direct warning against mandate-led deployment.
a20	Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale	arXiv	2025-12	This scale paper gives the field a way to measure verification, motivation, and reflection in AI use, which is necessary if organisations want to manage human outcomes rather than token volume.
a21	Enhancing Critical Thinking in Generative AI Search with Metacognitive Prompts	arXiv	2025-05	This user study shows that metacognitive prompts can increase follow-up inquiry and perspective-taking during AI search, pointing to a concrete intervention for reducing passive acceptance.
a22	Promoting Critical Thinking With Domain-Specific Generative AI Provocations	arXiv	2026-03	von Davier, Lee, Forlizzi, and Das argue that productive friction and domain-specific provocations can support critical thinking better than frictionless assistant behaviour.
a23	Current and Future Use of Large Language Models for Knowledge Work	arXiv	2025-03	These surveys of knowledge workers show that adoption is already broad, but desired future use centres on workflow integration, which shifts the design question from access to operating model.
a24	An Empirical Study of Generative AI Adoption in Software Engineering	arXiv	2025-12	This empirical study reports widespread use and perceived gains in software engineering, while also finding thin objective measurement and weak institutional emphasis on training and governance.
a25	The State of Generative AI in Software Development: Insights from Literature and a Developer Survey	arXiv	2026-03	This review-plus-survey argues that value is strongest in routine coding and documentation, while planning and requirements work remain harder, shifting attention toward oversight and specification quality.

VC & Analyst Reports

ID	Title	Outlet	Date	Significance
v1	Employers want entry-level workers with senior-level skills in the age of AI, a huge PwC analysis found	Business Insider	2026-06	Reports PwC's 2026 AI Jobs Barometer finding that AI-exposed entry-level roles in the US are seven times more likely than in 2019 to require traditionally senior skills such as judgement, leadership, and stakeholder management.
v2	Bosses don't think AI is paying off yet, a PwC survey of 4,500 CEOs found	Business Insider	2026-01	Summarises PwC's 2026 Global CEO Survey, which found that 56% of CEOs reported no revenue or cost benefits from AI and only 12% reported both higher revenue and lower costs.
v3	The rise of the 'botsitters'	Business Insider	2026-06	Cites Glean's Work AI Institute finding that white-collar workers spend 6.4 hours a week correcting and managing AI, while only 13% see major organisational performance gains.
v4	The top 5 most common ways people say they're using AI in the workplace	Business Insider	2025-12	Uses Gallup survey data to show that workplace use is rising, but the dominant applications remain basic chat, writing, and coding assistance rather than autonomous multi-agent supervision.
v5	'Most enterprises are still unprepared to operationalize it': IT leaders are bullish on agents, but keeping falling at the final hurdle - here's why	ITPro	2026-06	Summarises new Forrester research saying about 75% of enterprise leaders are adopting agentic AI, yet most remain stuck in pilots because orchestration, governance, and nonhuman identity controls are weak.
v6	Concerns are mounting over the cognitive impact of AI as workers report experiencing 'brain fry' - and it's causing "increased employee errors, decision fatigue, and intention to quit"	ITPro	2026-03	Reports Boston Consulting Group research on 'AI brain fry', linking heavy AI oversight work to mental fog, decision fatigue, higher error rates, and higher intention to quit.
v7	Is this the tipping point for AI at work? New Gallup survey finds half of all US employees now use it in some way	TechRadar	2026-04	Summarises Gallup's Q1 2026 survey of 23,717 US employees, which found that 50% use AI at work and 13% use it daily, but task-level gains still exceed whole-workflow redesign.
v8	Microsoft, Shopify, and other companies now require employees to use AI. How is AI changing your work?	Business Insider	2025-08	Captures the mandate phase of enterprise adoption, citing Bain's estimate that average employer AI spending doubled in 2024 to $10.3 million while regular use remained uneven between leaders and frontline staff.
v9	Generative AI at Work: From Exposure to Adoption across 35 European Countries	arXiv	2026-04	Provides cross-country evidence that adoption tracks skills, workplace training, and employee say in organisational decisions, with no detectable task restructuring yet in the 2024 data.
v10	A meta-analysis of the effect of generative AI on productivity and learning in programming	arXiv	2026-05	Synthesises 23 studies and finds a moderate productivity gain for coding assistants but no statistically significant improvement in learning outcomes, which is directly relevant to deskilling concerns.
v11	Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape	arXiv	2026-04	Survey evidence from 457 researchers shows strong perceived productivity gains but continued distrust on correctness, with AI concentrated in writing and early-stage work rather than core methodological judgement.
v12	Reduced AI Acceptance After the Generative AI Boom: Evidence From a Two-Wave Survey Study	arXiv	2025-10	Shows that public acceptance of AI fell after the generative AI boom and demand for human oversight rose, which challenges investor narratives that adoption pressure alone will normalise AI-heavy workflows.
v13	Accuracy Standards for AI at Work vs. Personal Life: Evidence from an Online Survey	arXiv	2026-02	Finds that workers demand materially higher accuracy from AI at work than in personal use, a useful reminder that enterprise deployment fails when review burdens exceed tolerance for correction.
v14	Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework	arXiv	2026-06	Offers a concrete deployment pattern from Nubank, where structured context engineering, human-in-the-loop prompt iteration, and offline-to-online evaluation produced measurable customer-satisfaction and self-service gains.

Blogs & Independent Thinkers

ID	Title	Outlet	Date	Significance
b1	Management as AI superpower	One Useful Thing	2026-01	Ethan Mollick argues that agentic work shifts value from execution to delegation, evaluation, and specification, grounding the organisational case for treating management skill and subject matter expertise as the scarce resource.
b2	Claude Dispatch and the Power of Interfaces	One Useful Thing	2026-03	Mollick uses a valuation-task paper to show that chatbot UX can erase AI productivity gains by overloading users with sprawling outputs, making interface design a first-order operating-model issue.
b3	Choosing to Stay Human	One Useful Thing	2026-05	This essay frames AI adoption as a choice about where to preserve human skill formation, warning that convenience can erode writing judgement and flood attention with low-meaning synthetic content.
b4	The lethal trifecta for AI agents: private data, untrusted content, and external communication	Simon Willison’s Weblog	2025-06	Willison reduces a broad agent-safety debate to a concrete deployment rule: combining private-data access, untrusted input, and external communication creates a prompt-injection exfiltration hazard.
b5	Writing about Agentic Engineering Patterns	Simon Willison’s Weblog	2026-02	Willison argues that coding agents lower the cost of producing working code to near zero, shifting the bottleneck to verification and changing how teams should structure engineering work.
b6	Highlights from my conversation about agentic engineering on Lenny’s Podcast	Simon Willison’s Weblog	2026-04	Willison gives a practitioner account of the cognitive cost of supervising multiple agents in parallel, describing four concurrent coding agents as enough to wipe out an experienced engineer by late morning.
b7	If Claude Fable stops helping you, you'll never know	Jonathon Ready	2026-06	Ready identifies a direct conflict between a lab's hidden model-side restrictions and user welfare, arguing that silent degradation creates supply-chain risk for ordinary software teams building AI features.
b8	AI as Normal Technology	AI as Normal Technology	2025-04	Arvind Narayanan and Sayash Kapoor argue against technological determinism, separating model progress from application design and adoption, which is central to judging whether harms are fixed in training or in deployment.
b9	New Paper: Towards a science of AI agent reliability	AI as Normal Technology	2026-02	Kapoor, Narayanan, and Stephan Rabanser argue that capability gains have outpaced reliability gains, offering a framework for why impressive demos do not automatically translate into dependable organisational use.
b10	Why AI hasn’t replaced software engineers, and won’t	AI as Normal Technology	2026-06	This essay rejects the threshold story of sudden labour replacement, arguing instead that AI compresses execution while decision-making, verification, and accountability remain stubbornly human.
b11	Predicting AI job exposure	Benedict Evans	2026-05	Evans argues that job-exposure charts miss how automation changes business models, regulation, and task composition, which makes simple role-level forecasts unreliable for planning.
b12	(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable	arXiv	2026-06	This paper reports that an unconstrained multi-agent baseline failed in 72 percent of runs, while a harness with deterministic computation and three human decision gates cut failures to 16 percent.
b13	Oversight Structures for Agentic AI in Public-Sector Organizations	arXiv	2025-06	This paper argues that agentic AI requires continuous oversight, tighter integration of governance with operations, and cross-departmental coordination rather than episodic compliance review.
b14	As companies rethink AI ROI, Replit's AI chief calls token leaderboards 'very dystopian'	Business Insider	2026-06	The report captures an emerging backlash against token-maximising mandates, with Replit's Michele Catasta arguing that raw token burn is a misleading proxy for productivity or value.
b15	I asked 4 executives how they measure AI ROI. None started with AI tokens.	Business Insider	2026-06	This report shows large organisations moving from activity metrics to outcome metrics, with BNP Paribas CIB, La Banque Postale, TCS, and NTT DATA all rejecting token counts as the main ROI measure.
b16	How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks	arXiv	2026-04	The paper finds that agentic coding runs can consume 1000 times more tokens than simpler code interactions, with high variance and no reliable link between higher token use and better accuracy.

Tech Industry & Practitioner

ID	Title	Outlet	Date	Significance
p1	[DORA	Accelerate State of DevOps Report 2024](https://dora.dev/research/2024/dora-report/)	DORA, Google Cloud	2024
p2	2025 DORA State of AI Assisted Software Development	Google Cloud	2025	This follow-on DORA report matters because it shifts assessment from raw output to team archetypes and human factors such as burnout, friction, and perceived value.
p3	DORA Report 2024 – A Look at Throughput and Stability	RedMonk	2024-11	Rachel Stephens translates the DORA findings into an operating-model critique, arguing that code generation may not be the bottleneck and that organisations can optimise the wrong constraint.
p4	How much does AI impact development speed? An enterprise-based randomized controlled trial	arXiv	2024-10	Google's RCT with full-time engineers provides one of the cleaner measured-benefit studies, finding about a 21% reduction in time on a complex enterprise task under controlled conditions.
p5	Examining the Use and Impact of an AI Code Assistant on Developer Productivity and Experience in the Enterprise	arXiv	2024-12	IBM's internal deployment study shows that enterprise gains are uneven, with productivity benefits present but not universal, and with responsibility and ownership of generated code becoming central issues.
p6	Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity	arXiv	2025-07	METR's field experiment is a strong counterweight to vendor claims, finding experienced open-source developers were 19% slower with frontier AI tools despite expecting to be faster.
p7	The SPACE of AI: Real-World Lessons on AI's Impact on Developers	arXiv	2025-07	This mixed-methods study uses the SPACE framework to show that benefits cluster around routine work and depend heavily on task complexity, peer learning, and organisational support.
p8	The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot	arXiv	2024-10	This study finds project-level productivity gains in open source but also a 41.6% increase in integration time, making coordination cost a first-order part of the AI productivity story.
p9	AI-assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden	arXiv	2025-10	This paper matters because it shows productivity gains can shift maintenance and review burden onto core developers, worsening outcomes for the people who carry system knowledge.
p10	Developer Productivity with GenAI	arXiv	2025-10	Using the SPACE lens with 415 practitioners, this paper argues that faster output does not reliably translate into better software or better wellbeing, which is central to judging AI adoption programmes.
p11	What do professional software developers need to know to succeed in an age of Artificial Intelligence?	arXiv	2025-05	This practitioner-focused study reframes the adaptation problem around workflow judgement, adjacent engineering skills, and non-technical skills rather than prompt fluency alone.
p12	What Challenges Do Developers Face in AI Agent Systems? An Empirical Study on Stack Overflow	arXiv	2025-10	By mining Stack Overflow, this paper identifies recurring implementation pain around orchestration complexity, evaluation reliability, and runtime integration in agent systems.
p13	AI-Generated “Workslop” Is Destroying Productivity	Harvard Business Review	2025-09	This study-backed HBR piece gives a concrete mechanism for failed ROI, namely low-effort AI output that looks plausible but pushes cognitive and rework costs onto colleagues.
p14	[Workslop: The Hidden Cost of AI-Generated Busywork	BetterUp Labs](https://www.betterup.com/workslop)	BetterUp Labs	2025
p15	Researchers Asked LLMs for Strategic Advice. They Got “Trendslop” in Return.	Harvard Business Review	2026-03	This HBR article extends the critique beyond code into managerial judgement, showing how models can produce fashionable but shallow recommendations that reward buzzwords over reasoning.
p16	'Botsitting' is destroying productivity as workers spend nearly a full day each week making AI 'usable'	ITPro	2026-06	This report on Glean's Work AI Institute findings captures the hidden supervision tax, 6.4 hours a week spent feeding context, checking outputs, and correcting errors.
p17	3 in 4 workers say AI reduced productivity and increased workloads, survey finds	Business Insider	2024-08	Upwork's survey is useful as an early warning that mandate-led adoption can raise review load and learning overhead faster than it creates value.
p18	84% of software developers are now using AI, but nearly half 'don't trust' the technology over accuracy concerns	ITPro	2025-08	This Stack Overflow survey coverage anchors the adoption-trust gap, with broad usage rising while distrust, debugging effort, and security concern remain high.
p19	UK software developers are still cautious about AI, and for good reason	ITPro	2025-10	JetBrains' ecosystem survey adds a regional practitioner view showing that caution concentrates around code quality, privacy, and retaining human control over reviews and testing.
p20	No AI overload just yet? Google's new survey reveals how developers are really using AI at work	TechRadar	2025-10	This report on Google's survey is valuable because it pairs very high developer adoption with low strong trust, supporting the claim that supervision, not surrender, remains the norm.