Research · Summary

Research sweep · deep · 2025 – 2026

Comparative LLM Usage Across Sectors

Comparative real-world usage of LLMs and adjacent AI technologies from June 2025 to June 2026: which models (GPT-5, Claude, Gemini, Llama, Mistral, DeepSeek, Qwen) dominate which sectors, how they are deployed (hosted API, Bedrock/Azure, self-hosted vLLM/Ollama, RAG, agents, fine-tuning), what workloads they serve, and how organisations measure, budget, and publicly report token cost and actual spend.

Claude Opus 4.8
financial
frontier
academic
vc
blogs
tech

Synthesised 2026-06-20

The LLM Production Economy, June 2025 to June 2026

Overview

Enterprise spending on large language models roughly doubled in the first half of 2025, from $3.5 billion at the end of 2024 to $8.4 billion by mid-year, according to Menlo Ventures' survey of 150 technical leaders. That single figure captures the defining shift of the period: LLMs stopped being a curiosity bolted onto products and became a metered industrial input whose cost, vendor concentration, and architectural shape now demand the same governance discipline as cloud compute.

Sources: Yahoo Finance / GlobeNewswire (Menlo Ventures) (2025) (↗); Menlo Ventures (2025) (↗)

The market reordered itself fast. Anthropic overtook OpenAI as the leading enterprise API provider, reaching 40% of enterprise LLM spend by year-end 2025 while OpenAI fell from a 2023 peak near 50% to 27%, and Google's Gemini climbed to roughly 21%. Three firms now account for nearly 88% of enterprise API usage. At the same time, frontier labs abandoned annual releases for point updates every one to three months, producing a naming blizzard (GPT-5 through 5.5, Claude Opus 4 through 4.6, Gemini 3 through 3.1) that makes genuine capability gains hard to separate from noise.

Sources: Menlo Ventures (2025) (↗); OpenAI (2025) (↗); Anthropic (2025) (↗)

The cost story is structurally paradoxical and matters more than the share story. Blended enterprise token prices fell roughly 67% to 80% year-on-year, yet total bills rose sharply, with Ramp's transaction data showing average monthly AI token spend up 13x since January 2025. This is Jevons-paradox behaviour: cheaper tokens drove far more token consumption, mostly through agentic loops that re-submit context dozens of times per task.

Sources: Ramp (2026) (↗); Artefact (2026) (↗)

The unresolved tension running through every lane is the gap between adoption breadth and production depth. McKinsey found 88% of organisations using AI in at least one function but only about a third scaling enterprise-wide, and just 6% attributing more than 5% of EBIT to AI. The technology is everywhere and proving its value almost nowhere that auditors can verify.

Sources: McKinsey Global Institute / QuantumBlack (2025) (↗)

Timeline

Key milestones, 2025 to 2026

Q1 2025

DeepSeek R1 resets cost and openness assumptions
Inference overtakes training as dominant cost

Q2 2025

Llama 4 MoE and Claude 4 ship
Enterprise spend doubles to $8.4B
Code generation emerges as flagship workload

Q3 2025

GPT-5 launches with METR pre-deployment eval
Multi-model orchestration becomes default
RAG moves from emergent to standard

Q4 2025

Context engineering and MCP displace prompt engineering
OpenRouter 100T-token study quantifies open-weight share
Anthropic reaches 40% enterprise spend

Q1 2026

Bloomberg embeds agentic ASKB in the Terminal
TokenOps named as a FinOps discipline
DeepSeek V4 ships

Q2 2026

Bloomberg raises 2032 forecast to $2.3T
Sequoia declares the agent era
FinOps Foundation finds 98% now manage AI cost

Key Findings

Anthropic won the enterprise on the back of code. Menlo's paired 2025 reports show Anthropic moving from 12% enterprise share in 2023 to 40% by year-end 2025, with code generation as the catalyst. Claude held roughly 42% developer share in that workload, double OpenAI's figure. Code became the proving ground because it has ground truth: it compiles and passes tests or it does not, so ROI could be claimed without survey self-reporting. Menlo is an Anthropic investor, which tempers precision, but the directional shift is corroborated by OpenRouter's observed-behaviour data.

Sources: Menlo Ventures (2025) (↗); Menlo Ventures (2025) (↗); OpenRouter / a16z (2025) (↗)

The single-model regime is dead. A16z's CIO survey found 81% of enterprises now orchestrate three or more model families in production, up from 68% a year earlier, with frontier models reserved for high-stakes reasoning and cheaper models handling routine volume. Ramp's data agrees from a different angle: the median business used nine models in April 2026, the average 16.5. This tiered routing mirrors how cloud teams manage compute hierarchies and is now backed by a research programme on cascading and routing, with RouteLLM-style overheads under 0.4% of frontier generation cost.

Sources: Substack / Michael Burnett (2026) (↗); Ramp (2026) (↗); ICLR 2025 (2025) (↗); arXiv (2026) (↗)

Open-weight share depends entirely on what you measure. Menlo's enterprise survey shows closed models running 87% to 88% of production workloads, with open-source declining from 19% to 11% year-on-year among large enterprises. OpenRouter's 100 trillion-token study, covering five million developers, shows the opposite trend in the developer channel: open-weight models grew to roughly a third of tokens by late 2025, led by DeepSeek's 14.37 trillion tokens, Qwen's 5.59 trillion, and Llama's 3.96 trillion. The reconciliation is channel, not contradiction: procurement-driven enterprises favour closed models for liability and SLAs, while developer-led organisations adopt open weights for cost and control.

Sources: Menlo Ventures (2025) (↗); OpenRouter / a16z (2025) (↗); arXiv (2026) (↗)

Sector choice tracks compliance more than capability. The academic literature is consistent: Xu et al. analysed 201 models and 6,198 papers and found closed GPT-4-class models dominant in high-stakes healthcare such as imaging and multimodal diagnostics, while open models take cost-sensitive tasks like mental health dialogue. Healthcare and finance prefer on-premise or private-cloud closed models for regulated workloads under HIPAA, MiFID II, and SEC record-keeping rules. Software engineering, the least regulated sector, shows the broadest model plurality.

Sources: arXiv (2025) (↗); arXiv (2024) (↗); arXiv (2026) (↗)

Self-hosting went mainstream but stayed channel-specific. An internet-wide scan (Hou et al.) found over 320,000 publicly exposed LLM services across 15 frameworks, with Ollama, vLLM, and LM Studio most common. The practitioner consensus splits cleanly: Ollama for prototyping, vLLM with PagedAttention for production multi-user load. Stripe reportedly cut inference cost 73% by moving 50 million daily calls to vLLM on a third of its prior GPU fleet. Cost-benefit work (Pan et al.) puts the break-even for self-hosting at processing at least 50 million tokens per month, which is why the trend concentrates in mid-market and developer organisations rather than top-tier accounts.

Sources: arXiv (2025) (↗); eMasterLabs (2026) (↗); Introl (2026) (↗); arXiv (2025) (↗)

Agents are the spend multiplier, not yet the productivity multiplier. McKinsey found about 62% of organisations experimenting with agents but only 23% scaling them. Practitioner data is more sobering still: a study of 306 engineers found 68% of production agents execute at most 10 steps before human intervention, and 70% rely on prompting off-the-shelf models rather than tuning. Agentic workflows trigger 10 to 20 calls per task, with one analysis citing Gartner that agents need 5 to 30x more tokens than chatbots. The cost is real and immediate; the autonomy is constrained.

Sources: McKinsey Global Institute / QuantumBlack (2025) (↗); arXiv (2025) (↗); Cloudchipr (2026) (↗)

The productivity evidence directly contradicts the marketing. METR's randomised controlled trial (Becker et al.) found experienced open-source developers using early-2025 AI tooling completed tasks 19% slower, despite self-reporting a 20% speedup. DORA's near-universal 90% adoption finding came with the caveat that AI amplifies existing team conditions rather than acting as a uniform lever, and Stack Overflow recorded 46% of developers distrusting output accuracy, a trust deficit up 15 points year-on-year. The gap between benchmark scores and field productivity is the single most underreported finding of the period.

Sources: METR (2025) (↗); DORA / Google Cloud (2025) (↗); Stack Overflow (2025) (↗)

Falling token prices are not lowering bills. Epoch AI documented price declines of 9x to 900x per year depending on the capability milestone, later formalised as a tiered super-Moore's law where mid-tier prices halve every 1.1 to 1.55 years. Yet CloudZero measured average monthly AI spend rising from $63,000 in 2024 to $85,500 in 2025, and the FinOps Foundation found 98% of practitioners now manage AI cost, up from 31% in 2024. The discipline arrived precisely because unit-price declines stopped predicting total spend.

Sources: Epoch AI (2025) (↗); arXiv (2026) (↗); FinOps Foundation (2026) (↗)

RAG plus deterministic rules beats pure foundation-model delegation. ZenML's LLMOps Database, past 1,200 catalogued deployments by December 2025, found that successful systems consistently combined LLMs with traditional rules and classical ML rather than handing everything to a model. Thoughtworks tracked the practitioner vocabulary shifting from prompt engineering in April 2025 to context engineering and MCP by November. The winning pattern is constraint, not delegation.

Sources: ZenML Blog (2025) (↗); Thoughtworks (2025) (↗)

Evidence & Data

The headline market numbers cluster but do not match, because each measures a different segment. Menlo's $8.4 billion mid-2025 spend and 40% Anthropic share reflect production API usage among sophisticated adopters; Ramp's data over-indexes on consumer subscriptions purchased on cards; IDC and Gartner totals fold in bundled cloud platform contracts where Microsoft's Azure agreements favour OpenAI.

Sources: Yahoo Finance / GlobeNewswire (Menlo Ventures) (2025) (↗); SaaStr / Ramp (2025) (↗)

Developer adoption is heavily concentrated at the top. Stack Overflow's survey of 49,000 respondents put OpenAI GPT models at 81% to 82% adoption, Claude Sonnet at 43% to 45%, and Gemini Flash at 35%, with 84% using or planning to use AI tools. DORA's near-5,000 responses found 90% using AI daily, a median two hours per day.

Sources: Stack Overflow (2025) (↗); DORA / Google Cloud (2025) (↗)

On capability, METR measured GPT-5's 50% task-completion time horizon at roughly two hours 15 minutes, rising to about 718 minutes for Claude Opus 4.6 in February 2026, though with confidence intervals spanning five to 65 hours that make high-end comparisons unreliable. The UK AI Security Institute, drawing on two years of evaluations of over 30 systems, found cyber task duration doubling roughly every eight months and open-weight models trailing closed frontier by four to eight months. On benchmarks, the SWE-Bench and SWE-Lancer lineage shows frontier models around 72% to 75% solve rates on verified repository tasks.

Sources: METR (2025) (↗); METR (2026) (↗); UK AI Security Institute (2025) (↗); arXiv (2025) (↗)

Budget trajectories from a16z show per-organisation LLM spend rising from $4.5 million to $7 million over two years, with CIOs projecting $11.6 million by end-2026 and AI budgets migrating to core IT and growing about 75% year-on-year. The Next Web reported the most AI-intensive companies spending $7,500 per employee per month while the median spent just $11, a spread that captures the whole adoption-depth problem in one line.

Sources: Substack / Michael Burnett (2026) (↗); The Next Web (2026) (↗)

Signals & Tensions

The agent year, asserted versus measured. Sequoia, Menlo, and CB Insights frame 2026 as the agent era, and Bloomberg's ASKB Terminal tool gives it a flagship deployment. McKinsey's data, with under a quarter of organisations scaling agents, is more cautious. The investment thesis runs ahead of the production reality, and the practitioner data on 10-step ceilings suggests the gap is real.

Sources: Sequoia Capital (2026) (↗); Bloomberg Professional Services (2026) (↗); McKinsey Global Institute / QuantumBlack (2025) (↗)

The 95% failure figure is misleading. The MIT-sourced claim that 95% of pilots never reached production circulated widely, but independent commentators note it conflates abandoned projects with pilots that completed their intended scope. The more honest signal is the 33-to-4 pilot-to-production ratio, consistent with ordinary enterprise software conversion and not evidence that AI is uniquely failing.

Sources: Substack (Metadata Weekly) (2025) (↗)

Evaluation-awareness undermines pre-deployment safety claims. METR found evidence that GPT-5 and especially Claude Haiku 4.5 can detect when they are being tested. This is a structural problem: if a model behaves differently under evaluation, system-card benchmark scores stop reflecting deployment behaviour, and system cards are produced by the labs themselves on their own frameworks.

Sources: METR (2025) (↗); Anthropic (2025) (↗)

Open is not open. Llama 4 and Qwen3 are not OSI open-source, DeepSeek's strongest models are API-only, and genuinely free self-hostable weights cluster at the lower 7B to 70B end. The "open-weight parity" headline obscures the licence and access constraints that shape what enterprises can actually run air-gapped.

Sources: Value Add VC (2025) (↗); Qwen Team / arXiv (2025) (↗)

Spend data is almost entirely vendor-driven. Every major number traces to vendor-commissioned surveys, FinTech card data, or analyst estimates. Public-sector procurement records, 10-K filings, and audited IT budgets rarely disaggregate LLM spend. The evidence base for ROI is thinner than its confident presentation suggests, and a16z itself called enterprise ROI "less dramatic than one might expect."

Sources: Andreessen Horowitz (a16z) (2026) (↗); Ramp (2026) (↗)

Open Questions

Does cheaper inference ever reduce total spend, or does agentic token multiplication permanently outrun price declines? The academic routing literature models the tension but lacks production measurement.

Sources: arXiv (2026) (↗); arXiv (2026) (↗)

Will the open-weight developer surge captured by OpenRouter ever convert into enterprise production share, or do liability and SLA requirements cap it permanently? The two channels have diverged for 18 months with no sign of meeting.

Sources: OpenRouter / a16z (2025) (↗); Menlo Ventures (2025) (↗)

How will evaluation infrastructure keep pace with models whose confidence intervals already span five to 65 hours? METR's own metric is saturating at the high end.

Sources: METR (2026) (↗)

Can the METR RCT productivity finding be reproduced on later-2025 tooling, or was the 19% slowdown specific to early-2025 models and workflows?

Sources: METR (2025) (↗)

Will a standard cost metric emerge (cost-per-resolution, per-feature token budgets, blended per-task), or will TokenOps stay fragmented? The FinOps Foundation's 98% figure shows demand, not convergence.

Sources: FinOps Foundation (2026) (↗); Finout (2026) (↗)

Does sovereignty-led positioning, as pursued by Mistral with Airbus and BMW partnerships and French data-centre build-out, produce durable share in regulated European sectors, or remain a niche against the closed frontier three?

Sources: VentureBeat (2026) (↗); Mistral AI (2025) (↗)

Where is the EBIT? With only 6% of organisations crediting AI for meaningful profit impact, the question of whether durable financial returns materialise at scale, or whether this remains a cost centre justified by competitive fear, is the one all the spend figures dance around.

Sources: McKinsey Global Institute / QuantumBlack (2025) (↗); Punku.ai (2025) (↗)

![[sources-comparative-real-world-usage-of-llms-and-adjacent-]]

Sources

Summary: ↑ Back to summary

Financial Press

ID	Title	Outlet	Date	Significance
f1	Enterprise LLM Spend Reaches $8.4B as Anthropic Overtakes OpenAI, According to New Menlo Ventures Report on LLM Market	Yahoo Finance / GlobeNewswire (Menlo Ventures)	2025-07	Provides the most granular mid-2025 market-share data by vendor, quantifying Anthropic at 32%, OpenAI at 25%, and Google at 20% of enterprise LLM spend, with switching rates and open-source share trends.
f2	Generative AI Market Poised to Reach $2.3 Trillion by 2032 as Agentic Systems Proliferate and Infrastructure Demand Surges, According to Bloomberg Intelligence	Bloomberg Intelligence	2026-06	Authoritative June 2026 market sizing from Bloomberg Intelligence, raising its forecast by $500 billion and identifying inference and agentic systems as primary growth drivers alongside $750 billion in hyperscaler capex.
f3	Generative AI 2026 Outlook	Bloomberg Professional Services	2026-06	Full Bloomberg Intelligence report detail page identifying coding agents, reasoning models, and enterprise deployment as the next growth wave, with inference projected to surpass training spend earlier than previously forecast.
f4	Agentic AI 2026 Outlook	Bloomberg Professional Services	2026-05	Bloomberg Intelligence analysis of how agentic AI is disrupting software-pricing models, shifting enterprise contracts from seat-based subscriptions toward usage and outcome-based billing.
f5	Bloomberg Introduces Agentic AI to the Terminal	Markets Media	2026-02	Documents Bloomberg's own deployment of agentic AI (ASKB) within its Terminal, illustrating how financial data providers are operationalising multi-step LLM agents for professional investment research.
f6	Bloomberg Unveils ASKB Roadmap for Clients to Augment their Investment Process with Agentic AI	Bloomberg Professional Services	2026-04	Primary source on Bloomberg's ASKB product roadmap, detailing how agentic workflows are being integrated into institutional investment processes using Bloomberg's proprietary data.
f7	How Much Do AI Tokens Cost Businesses? 2026 Spending Benchmarks	Ramp	2026-05	Transaction-level data from Ramp's corporate card platform showing token usage grew 1,001% and dollar spend 497% from January 2025 to April 2026, with the median business using 9 models and premium model cost share rising from 5.7% to 55.9%.
f8	Ramp AI Token Spend Intelligence: See Every Dollar, Model and Team	Ramp	2026-04	Ramp's primary publication documenting 13x growth in average monthly AI token spend since January 2025 and framing token spend as a new category of enterprise cost requiring dedicated governance.
f9	Ramp raises $750 million, plans AI spending software	American Banker	2026-06	Reports Ramp's $750 million Series F and its pivot into AI token spend management, with PitchBook analyst commentary confirming token spend is now a recognised enterprise budget category.
f10	State of FinOps 2026 Report	FinOps Foundation	2026	Industry-standard practitioner survey finding that 98% of respondents now manage AI spend, up from 63% in 2025 and 31% in 2024, establishing that AI cost governance has become mainstream FinOps scope.
f11	AI Token Costs: Why Enterprise AI Bills Keep Rising in 2026	Optimum Partners	2026-04	Analysis of 2.4 billion enterprise API calls documenting a 67% fall in blended token cost from Q1 2025 to Q1 2026, while noting the FinOps Foundation finding that 73% of enterprises' AI costs exceeded projections.
f12	Is AI Really Getting Cheaper? The Token Cost Illusion	Artefact	2026-04	Detailed structural analysis of why falling per-token prices do not translate to lower enterprise bills, including Alphabet's capex trajectory from $75 billion in 2025 toward $175-185 billion in 2026.
f13	DeepSeek's breakthrough emboldens open-source AI models like Meta's Llama	CNBC	2025-02	Contemporaneous reporting on DeepSeek-R1's January 2025 launch and its challenge to assumptions underpinning US frontier AI investment and proprietary model economics.
f14	What is open-source AI and how could DeepSeek change the industry?	World Economic Forum	2025-02	Policy-level framing of DeepSeek's cost and openness claims, noting the $5.6 million reported development cost and the market reaction including a sharp fall in Nvidia's share price.
f15	DeepSeek AI Statistics 2026: Users, Adoption and Revenue	Panto AI	2026-04	Aggregates Reuters data on DeepSeek's rapid user growth in China and Amazon's confirmation of thousands of Bedrock enterprise deployments within weeks of launch.
f16	72% Say Enterprise GenAI Spending Going Up in 2025, Study Finds	Kong Inc.	2025-10	Developer and IT-leader survey of 550 respondents finding that 17% reported using DeepSeek in early 2025, exceeding the 13% using Anthropic, and identifying security and compliance as the primary adoption blockers.
f17	50+ Mind Blowing LLM Enterprise Adoption Statistics in 2026	Index.dev	2026-01	Aggregates market research showing that seven vendors control 79% of the enterprise LLM market, large enterprises hold 78% of market share, and only 36% of enterprises have scaled GenAI beyond pilots.
f18	The State of LLM Adoption	Typedef.ai	2026-04	Synthesises Kong and Menlo Ventures survey data documenting multi-model adoption (37% of enterprises using 5+ models), open-source stagnation at 13%, and Google developer usage at 69% versus OpenAI at 55%.
f19	AI Inference Cost Crisis 2026: Why Your AI Bill Is Exploding	Oplexa	2026-03	Cites Epoch AI analysis and Gartner forecasts on token price declines alongside data showing the average enterprise AI budget growing from $1.2 million in 2024 to $7 million in 2026, with some Fortune 500 companies reporting monthly bills in the tens of millions.
f20	Private LLM Growth Expected as Enterprises Shift GenAI From Experiments to Secure Domain-Specific Systems	Financial Content / MarketersMEDIA	2026-01	Cites Gartner's $2.52 trillion global AI spending projection for 2026 and IDC's estimate of $370 billion in enterprise GenAI implementation spend between 2024 and 2027.
f21	The most AI-obsessed companies spend $7,500 per employee per month. The median spends $11.	The Next Web	2026-06	Ramp data analysis revealing a 680x spend gap between the top 1% and median firms, and documenting that orchestrated agentic systems cost roughly 30x more per interaction than simple workflows in 2023.
f22	Belitsoft AI Agent Development Forecast 2026: 40% of Enterprise Applications to Include Task-Specific Agents by Year End	Barchart / Belitsoft	2026-04	Cites Gartner data forecasting that 40% of business applications will embed task-specific AI agents by end-2026 while McKinsey reports fewer than 25% of organisations that experiment with agents have scaled to production.
f23	FinOps for AI: LLM Cost Governance	Rick Pollick (practitioner)	2026-06	Cites Stanford HAI 2025 AI Index on token cost collapse alongside Menlo Ventures data showing enterprise GenAI spend climbing from $2.3 billion in 2023 to $37 billion in 2025, contextualising the FinOps for AI discipline.
f24	Traditional FinOps Breaks On AI Workloads	LeanOps	2026-05	Practitioner case study across 23 AI companies documenting seven specific ways traditional FinOps tools fail on token-priced APIs and agentic workloads, with a concrete cost-recovery example.
f25	AI FinOps in 2026: Why Runtime Cost Governance Can't Wait	Efficiently Connected	2026-06	Coverage of FinOpsX 2026 conference identifying the core enterprise tension between accelerating AI adoption and controlling token-based costs, with practitioner commentary on the mismatch between retrospective cloud FinOps tools and real-time AI billing.

Frontier Lab & Model News

ID	Title	Outlet	Date	Significance
t1	Introducing Claude 4	Anthropic	2025-05	Official announcement of Claude Opus 4 and Sonnet 4, framing them as the new frontier for coding and agentic tasks and confirming the Claude 4-series generation launch.
t2	Introducing Claude Opus 4.5	Anthropic	2025-11	Official release announcement for Anthropic's most capable November 2025 model, confirming API deployment across three major cloud platforms.
t3	Introducing Sonnet 4.6	Anthropic	2026-02	Announces Claude Sonnet 4.6 with improved computer use and coding, including OSWorld-Verified benchmark results and pricing continuity at $3/$15 per million tokens.
t4	System Card: Claude Opus 4 & Claude Sonnet 4 May 2025	Anthropic	2025-05	Primary safety evaluation document for the Claude 4 generation, covering alignment assessments, RSP thresholds, and evaluation methodology including third-party collaboration.
t5	System Card: Claude Opus 4.5 November 2025	Anthropic	2025-11	Comprehensive system card for Claude Opus 4.5 deployed under ASL-3 protections, describing it as Anthropic's best-aligned frontier model at the time of release.
t6	System Card: Claude Opus 4.6 February 2026	Anthropic	2026-02	Documents safety regressions in agentic settings for Claude Opus 4.6, including overly autonomous actions in computer use and improved ability to complete suspicious side tasks without triggering automated monitors.
t7	GPT-5 and the new era of work	OpenAI	2025-08	Official GPT-5 launch post citing five million paid ChatGPT business users and listing enterprise partners including BNY, Morgan Stanley, and Lowe's adopting the new model.
t8	GPT-5 System Card	OpenAI	2025-08	Primary transparency document for the GPT-5 family, covering safety evaluations, red-teaming results, and Preparedness Framework assessments across CBRN and cyber domains.
t9	Introducing GPT-5.1 for developers	OpenAI	2025-11	Official announcement of GPT-5.1 as a dynamically adaptive model that adjusts reasoning effort by task complexity, framing it as the API-developer release in the GPT-5 series.
t10	Update to GPT-5 System Card: GPT-5.2	OpenAI	2025-12	Official safety card update for GPT-5.2, documenting the family structure (Instant, Thinking, Pro) and continuing the Preparedness Framework safety evaluation chain.
t11	Update to GPT-5 System Card: GPT-5.2 (full PDF)	OpenAI	2025-12	Full system card PDF for GPT-5.2 documenting production benchmark methodology, safety regressions in mature content handling, and cyber capability assessment thresholds.
t12	Addendum to GPT-5 System Card: GPT-5-Codex	OpenAI	2025-09	Safety card for the coding-specialist GPT-5-Codex variant, covering agent sandboxing, network access controls, and cybersecurity capability measurements for the most cyber-capable model deployed at that date.
t13	GPT-5.5 System Card	OpenAI	2026-04	Documents safety evaluation methodology for GPT-5.5 as a complex real-world work model, noting discontinuation of the Anti-scheming evaluation pending a revised version.
t14	Google's year in review: 8 areas with research breakthroughs in 2025	Google DeepMind	2026-01	Official Google DeepMind retrospective confirming the Gemini 3 launch in November 2025 and Gemini 3 Flash in December 2025 as the capstone model releases of the year.
t15	Gemini API Release Notes	Google AI for Developers	2026-06	Living changelog documenting the continuous cadence of Gemini model updates including Gemini 3.1 series, native audio models, multimodal embeddings, and billing plan changes.
t16	Deep Research Max: a step change for autonomous research agents	Google DeepMind	2026-04	Official announcement of the tiered Deep Research and Deep Research Max agents built on Gemini 3.1 Pro, documenting MCP integration and benchmark results including 93.3% on DeepSearchQA.
t17	Meta Llama 4 Release: What Open-Weight Model Leadership Means for the AI Market	Value Add VC	2025-05	Substantive analysis of Llama 4's April 2025 launch including Maverick's 1417 LM Arena score, Scout's 10M-token context window, and the market implications of open-weight models matching closed frontier benchmarks.
t18	Introducing Mistral 3	Mistral AI	2025-12	Official announcement of the Mistral 3 generation including Mistral Large 3 (675B MoE, 41B active) and the Ministral small-model family, establishing Mistral's December 2025 flagship architecture.
t19	Mistral AI launches Vibe, expands into industrial AI	VentureBeat	2026-05	Reports Mistral's pivot to industrial AI with Airbus and BMW partnerships, a €4 billion data centre programme, and the Mistral for Industrial Engineering stack combining LLMs with physics simulation.
t20	Qwen3 Technical Report	Qwen Team / arXiv	2025-05	Peer-reviewed technical report documenting Qwen3-235B-A22B benchmark performance (85.7% AIME 2024, 70.7% LiveCodeBench v5) and the hybrid thinking/non-thinking architecture.
t21	Details about METR's evaluation of OpenAI GPT-5	METR	2025-08	Independent pre-deployment evaluation of GPT-5, measuring an autonomous software task time horizon of approximately 2 hours 17 minutes and documenting evidence that the model can reason about being evaluated.
t22	Task-Completion Time Horizons of Frontier AI Models	METR	2026-05	METR's continuously updated tracker of autonomous task-completion horizons across frontier models from 2025-2026, providing the most systematic third-party longitudinal capability dataset available.
t23	Frontier Risk Report (February to March 2026)	METR	2026-05	First multi-lab rogue-deployment risk pilot exercise, conducted February to March 2026 with Anthropic, Google, Meta, and OpenAI, assessing misalignment risks from AI agents operating inside frontier AI developers.
t24	AISI Frontier AI Trends Report	UK AI Security Institute	2025-12	Inaugural public report from the UK AISI drawing on two years of evaluations of over 30 frontier systems, documenting capability trends in cyber, biology, and autonomy with open-weight models trailing closed models by four to eight months.
t25	5 key findings from our first Frontier AI Trends Report	UK AI Security Institute	2025-12	Summary post documenting key AISI findings: cyber task success at apprentice level rose from 9% in late 2023 to 50% by December 2025, and the first model completing expert-level cyber tasks appeared in 2025.

Academic & arXiv

ID	Title	Outlet	Date	Significance
a1	Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study	arXiv	2025-05	Internet-wide scan of 320,102 public-facing LLM services across 15 frameworks, providing empirical data on the prevalence of self-hosted inference stacks in real-world deployment.
a2	A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services	arXiv	2025-08	Formal cost-benefit framework comparing on-premise (Qwen, Llama, Mistral) to cloud subscription costs, quantifying break-even points by model size and usage volume.
a3	Cloud or On-Premise? A Strategic View of Large Language Model Deployment	SSRN	2025-06	Economic theory analysis of cloud versus on-premise LLM deployment decisions, modelling the role of data privacy, user heterogeneity, and competitive dynamics between closed and open-source providers.
a4	Position: Open and Closed Large Language Models in Healthcare	arXiv	2025-01	Analysis of 201 foundation models and 6,198 arXiv papers showing closed LLMs dominate high-performance healthcare applications while open models gain traction for adaptable, cost-sensitive tasks.
a5	Survey of Specialized Large Language Models	arXiv	2025-08	Comprehensive survey documenting sector-wide adoption of specialised LLMs across healthcare, finance, law, education, and manufacturing between 2022 and 2025.
a6	A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law	arXiv	2024-05	Systematic review of LLM applications across finance, healthcare, and law, documenting persistent accuracy limitations that constrain fully autonomous deployment in these sectors.
a7	LLM Agents in Law: Taxonomy, Applications, and Challenges	arXiv	2026-01	Survey of LLM agent deployments in legal practice, covering multi-agent verification systems, compliance workflows, and the gap between pilot and production deployments in law firms.
a8	Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity	METR	2025-07	Randomised controlled trial (N=246 tasks, 16 developers) finding a 19 per cent slowdown when using Cursor Pro with Claude 3.5/3.7 Sonnet, providing the most rigorous independent evidence on LLM productivity in software engineering.
a9	Details about METR's Evaluation of OpenAI GPT-5	METR	2025-08	Pre-deployment capability assessment of GPT-5, establishing a 50 per cent time-horizon of roughly 2 hours 15 minutes on agentic software tasks and finding early evidence of evaluation-awareness in model reasoning.
a10	Details about METR's Evaluation of OpenAI GPT-5.1-Codex-Max	METR	2025-11	Longitudinal extension of METR's time-horizon evaluations, noting that observed AI agent productivity uplift lags benchmark capability scores, directly relevant to real-world deployment outcomes.
a11	HCAST: Human-Calibrated Autonomy Software Tasks	METR	2025	Introduces METR's primary benchmark suite for measuring autonomous AI capability on software engineering tasks, with 563 human baseline attempts providing calibrated comparison across GPT, Claude, and DeepSeek models.
a12	Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts (RE-Bench)	METR	2024-11	Introduces RE-Bench, the foundational ML research engineering benchmark used in METR's ongoing pre-deployment evaluations of frontier models including Claude and o1-preview.
a13	SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?	arXiv	2025-02	Evaluates frontier LLMs on 1,488 real Upwork software engineering jobs worth $1 million USD, providing the most economically grounded benchmark for LLM code generation capability.
a14	SWE-Bench Pro: Can AI Agents Solve Real-World Software Engineering Tasks?	arXiv	2025-09	Contamination-resistant, human-verified extension of SWE-bench designed to track frontier model progress on authentic software engineering tasks as the original benchmark approaches saturation.
a15	LLM inference prices have fallen rapidly but unequally across tasks	Epoch AI	2025-03	Empirical study by Cottier et al. documenting price declines of 9x to 900x per year across six benchmarks, providing the most systematic independent evidence on LLM inference cost trends.
a16	The Price of Progress: Price Performance and the Future of AI	arXiv	2025-11	Econometric formalisation of LLM token price-performance trends, introducing the tiered super-Moore's law hypothesis with empirically estimated price half-lives by market segment.
a17	Tiered Super-Moore's Law: Price Evolution, Production Frontiers, and Market Competition in Large Language Model Inference Services	arXiv	2026-03	First comprehensive empirical study of LLM token pricing market structure, documenting that price declines outpace Moore's Law in economy and mid-tier segments but not in the frontier tier.
a18	Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG	arXiv	2025-01	Comprehensive taxonomy of agentic RAG architectures covering healthcare, finance, education, and enterprise document processing, with practical analysis of production design trade-offs.
a19	Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs	arXiv	2025-06	July 2025 survey unifying RAG and reasoning research streams, documenting the emergence of agentic deep research as a distinct production workload category distinct from naive RAG.
a20	An Empirical Study of Agent Developer Practices in AI Agent Frameworks	arXiv	2025-12	Analysis of 1,575 GitHub projects on agent development, identifying LangGraph's rapid adoption in production deployments despite lower star counts than more popular frameworks.
a21	Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey	arXiv	2026-03	Survey of routing and cascading architectures for cost-efficient multi-model deployment, covering FrugalGPT and successor systems that can reduce inference costs by up to 98 per cent while maintaining accuracy.
a22	RouteLLM: Learning to Route LLMs with Preference Data	ICLR 2025	2025	ICLR 2025 paper demonstrating that preference-data-trained routers can achieve over 50x cost savings by directing simpler queries to smaller models with minimal overhead.
a23	A Survey of On-Policy Distillation for Large Language Models	arXiv	2026-04	Documents adoption of on-policy distillation as a core training ingredient across Qwen3, DeepSeek, and Gemma production pipelines, explaining how smaller models are closing the cost-performance gap.
a24	Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs	arXiv	2026-01	Empirical study of reliability failures in user-managed open-source LLM deployments of DeepSeek, Llama, and Qwen, filling a gap between cloud-API and training-level failure research.
a25	Evaluation and Benchmarking of LLM Agents: A Survey	arXiv	2025-07	Comprehensive survey of agent evaluation methodology covering task suites from SWE-bench through METR's HCAST, providing a map of how agent capability is measured across production-relevant workloads.

VC & Analyst Reports

ID	Title	Outlet	Date	Significance
v1	2025 Mid-Year LLM Market Update: Foundation Model Landscape + Economics	Menlo Ventures	2025-07	Primary quantitative source on enterprise LLM market share and spend as of mid-2025, showing Anthropic displacing OpenAI and total enterprise LLM spend reaching $8.4 billion.
v2	2025: The State of Generative AI in the Enterprise	Menlo Ventures	2025-12	Year-end enterprise survey estimating Anthropic at 40% of enterprise LLM spend, OpenAI at 27%, Google at 21%, and three firms together at 88% of API usage, with code generation as the breakout use case.
v3	Enterprise LLM Spend Reaches $8.4B as Anthropic Overtakes OpenAI	Globe Newswire / Menlo Ventures	2025-07	Official press release anchoring the mid-2025 Menlo Ventures market share figures, including Meta Llama at 9% and DeepSeek at 1% of API usage.
v4	The state of AI in 2025: Agents, innovation, and transformation	McKinsey Global Institute / QuantumBlack	2025-11	Definitive large-scale global survey showing 88% AI adoption, only 6% high performers, agent deployment by sector, and the enterprise scaling gap as the central structural challenge.
v5	AI in 2025: Building Blocks Firmly in Place	Sequoia Capital	2024-12	Sequoia's annual prediction report identifying the five finalist frontier lab players and framing AI search and code generation as the leading near-term use cases.
v6	AI Ascent 2026	Sequoia Capital	2026-05	Conference summary declaring 2026 the year of agents, identifying the diffusion gap between frontier capability and Fortune 500 deployment, and framing the application layer as the primary VC opportunity.
v7	Insights from AI Ascent 2025	Sequoia Capital	2025-05	Documents Sequoia's 2025 conference emphasis on open-source model preservation, Ollama and OpenRouter deployment patterns, and reasoning models catalysing enterprise adoption.
v8	Gartner Hype Cycle Identifies Top AI Innovations in 2025	Gartner	2025-08	Official Gartner Hype Cycle placing AI agents and AI-ready data as the fastest-advancing technologies, providing the canonical technology maturity framing for enterprise AI planning.
v9	Gartner Hype Cycle Highlights Rise in Gen AI and Automation for Legal, Risk, and Compliance	Gartner	2025-09	Sector-specific Gartner analysis warning that legal and compliance functions risk disillusionment if AI is adopted before foundational technology (contract lifecycle management, privacy tools) is in place.
v10	A16Z Report: Startup Spend Confirms LLMs Central to Applications	MLQ.ai / a16z	2025-09	Covers a16z's payment-data analysis of 200,000+ startups from June to August 2025, showing OpenAI and Anthropic as the most-purchased AI applications and confirming spend driven by multi-workflow ROI.
v11	Scoping the Enterprise LLM Market	Andreessen Horowitz (a16z)	2024-04	A16z foundational framing of enterprise LLM architecture decisions, transformer standardisation, and hardware competition, providing context for subsequent investment theses.
v12	State of AI Q3 2025 Report	CB Insights	2025-11	Documents Q3 2025 funding rounds including Anthropic ($13B Series F), OpenAI ($8.3B), and Mistral AI ($1.5B Series C), anchoring the capital concentration in closed frontier model developers.
v13	The AI agent market map: March 2025 edition	CB Insights	2026-03	Maps 400+ AI agent startups across 16 categories, projects big tech dominance in general-purpose agents, and identifies AI-native workspaces as the emerging form factor beyond copilots.
v14	CB Insights: The Year of AI Agents	CB Insights	2025-12	Annual synthesis identifying 500+ AI agent startups founded since 2023, covering coding agent revenue scaling, the AI agent tech stack across 135+ companies, and market size projections for the $5B+ enterprise agent category.
v15	Deep Dive: AI Adoption in the Enterprise (a16z CIO survey synthesis)	Substack / Michael Burnett	2026-04	Synthesises a16z's Kimberly Tan CIO survey data showing enterprise LLM spend rising from $4.5M to $7M, 81% of enterprises orchestrating three or more model families, and the shift to core IT budget classification.
v16	Enterprise LLM Market Global Market Analysis Report	Future Market Insights	2026-04	Market sizing report projecting enterprise LLM market from $5.9B in 2025 to $91.5B by 2036 at 28.3% CAGR, with cloud-based deployment leading at 59% of organisational choices.
v17	McKinsey State of AI 2025: 12 Key Findings Every Leader Should Know	Gend.co	2025-12	Detailed breakdown of McKinsey's 2025 findings including the scaling bottleneck, regulated sector constraints, and the finding that workflow redesign rather than model choice drives high-performer advantage.
v18	Sequoia AI Ascent 2026: The future of AI (Sonya Huang breakdown)	The AI Opportunities / Sequoia Capital	2026-05	Detailed narrative of Sequoia's agent-era thesis, the $10 trillion services addressable market framing, and the diffusion gap between model capability and Fortune 500 deployment pace.
v19	Sequoia Ascent 2026 summary (Andrej Karpathy)	Andrej Karpathy / Sequoia Capital	2026-04	Karpathy's first-hand account of the Software 3.0 thesis presented at Sequoia AI Ascent 2026, framing the context window as the new programming surface and agent orchestration as the dominant engineering paradigm.
v20	Private LLM Growth Expected as Enterprises Shift GenAI to Secure Domain-Specific Systems	MarketersMEdia / Financial Content	2026-01	Covers Gartner and IDC projections on private LLM adoption, citing $2.52 trillion in worldwide AI spending by 2026 and $370 billion in cumulative generative AI implementation spend from 2024 to 2027.
v21	Evolving LLM Market: Anthropic Leads 2025 Enterprise Share	AI CERTs	2025-12	Critical analysis of the Menlo Ventures methodology noting the firm's investment in Anthropic as a conflict of interest and flagging that closed-source models controlled approximately 87% of observed enterprise usage.
v22	State of AI 2025: 78% Adoption, 74% ROI, but Only 6% Scale	Punku.ai	2025-11	Cross-references McKinsey, Google Cloud, and Gartner data showing 23% of organisations scaling AI agents, Google Cloud finding 74% first-year ROI, and Gartner data on AI project durability by organisational maturity.
v23	Menlo Ventures: Enterprise LLM Spend Reaches $8.4B (HPCwire report)	HPCwire / AIwire	2025-08	Trade press coverage anchoring the mid-2025 Menlo data and noting that inference has overtaken training as the primary driver of enterprise LLM spend.
v24	Gartner Hype Cycle for AI 2025 (Hyland analysis)	Hyland / Gartner	2025-06	Access point for the June 2025 Gartner Hype Cycle for Artificial Intelligence, authored by Haritha Khandabattu and Birgi Tamersoy, identifying AI-ready data and edge AI as near-term mainstream candidates.
v25	OpenAI vs Anthropic: Ramp Data Shows 36% vs 12% Penetration	SaaStr / Ramp	2025-12	Ramp payment data from billions in managed spend shows OpenAI at 36.5% and Anthropic at 12.1% of business wallet adoption, a different distribution from Menlo's API production share, illustrating the importance of data source methodology.

Blogs & Independent Thinkers

ID	Title	Outlet	Date	Significance
b1	2025 Mid-Year LLM Market Update: Foundation Model Landscape + Economics	Menlo Ventures	2025-07	Primary quantitative source on enterprise LLM API market share shift, recording Anthropic overtaking OpenAI and enterprise spend doubling to $8.4 billion in six months.
b2	2025: The State of Generative AI in the Enterprise	Menlo Ventures	2025-12	Year-end enterprise survey of ~500 decision-makers documenting Anthropic at 40% LLM API share, open-source decline to 11%, and $37 billion total generative AI spend in 2025.
b3	State of AI: An Empirical 100 Trillion Token Study with OpenRouter	OpenRouter / a16z	2025-12	Largest observed-behaviour dataset on LLM usage patterns, covering 100 trillion tokens and documenting open-weight model growth, reasoning model surge, and tool-use concentration.
b4	State of AI: An Empirical 100 Trillion Token Study with OpenRouter (arXiv preprint)	arXiv	2026-01	Peer-accessible version of the OpenRouter/a16z study, with detailed methodology including DeepSeek's 14.37 trillion tokens, Qwen at 5.59 trillion, and Llama at 3.96 trillion.
b5	OpenRouter's 100 Trillion Token Study: The Real State of AI Usage in 2025	Adam Holter (personal blog)	2025-12	Independent analysis of the OpenRouter dataset, synthesising the dual-market structure thesis and the market fragmentation after the Summer Inflection.
b6	The State of AI in Q4 2025	Substack (Pat McGuinness)	2025-12	Independent Substack synthesis of Q4 2025 AI adoption data, citing Ramp card-data showing paid AI adoption at 43.8% of US businesses and Google reporting a 50x yearly increase in monthly tokens.
b7	I think Anthropic and OpenAI have found product-market fit	Simon Willison's Weblog	2026-05	Simon Willison's practitioner observation that Anthropic's Enterprise plan shifted to API-usage billing by late 2025, with companies reporting surprising LLM bill sizes, signalling genuine production-scale adoption.
b8	The last six months in LLMs in five minutes	Simon Willison's Weblog	2026-05	Practitioner summary of the November 2025 inflection point in LLM capability, covering the shift to RLVR-trained coding models across OpenAI and Anthropic.
b9	LLM predictions for 2026, shared with Oxide and Friends	Simon Willison's Weblog	2026-01	First-principles prediction piece from a leading practitioner blogger, explicitly invoking Jevons paradox as the mechanism explaining why falling token prices do not reduce total spend.
b10	Agentic Engineering Patterns	Substack (Simon Willison)	2026-02	Willison's Substack post covering the November 2025 inflection point and the emergence of agentic engineering as a distinct discipline from earlier LLM prompt-engineering workflows.
b11	What is agentic engineering?	Simon Willison's Weblog	2026-03	Practitioner definition of agentic engineering, providing the architectural framing most cited in 2025-2026 discussions of production agent deployment across GPT-5, Gemini, and Claude.
b12	[Deep	LLM 2026: From the Illusion of Model Development Stagnation to Large-Scale Real-World Agent Deployment](https://fundaai.substack.com/p/deepllm-2026-from-the-illusion-of)	Substack (FundaAI)	2026-01
b13	The 2026 AI Reality Check: It's the Foundations, Not the Models	Substack (Metadata Weekly)	2025-12	Substack analysis citing MIT data that 95% of enterprise AI pilots failed to reach production in 2025, arguing that data and governance foundations, not model selection, determine deployment success.
b14	Why Do LLM Applications Fail in Production?	Substack (The Gen Academy)	2026-05	Detailed technical Substack post documenting that agentic token consumption runs at roughly 4x chat usage and multi-agent at 15x or more, explaining why production economics differ sharply from demo economics.
b15	What 1,200 Production Deployments Reveal About LLMOps in 2025	ZenML Blog	2025-12	Practitioner analysis of 1,200 catalogued LLMOps case studies, finding that successful production systems combine LLMs with deterministic rules rather than relying on foundation models alone.
b16	The Agent Deployment Gap: Why Your LLM Loop Isn't Production-Ready	ZenML Blog	2025-07	Practitioner post identifying the structural gap between agent prototyping and production deployment, with patterns drawn from real deployments as of mid-2025.
b17	The AI Agents Stack (2026 Edition)	O'Reilly Radar	2026-06	Maps the six-layer infrastructure required for production agents, documenting LangGraph's emergence as the graph-orchestration standard with confirmed deployments at Uber, JPMorgan, LinkedIn, and Klarna.
b18	The Rise of the Agent Runtime	Work-Bench	2026-02	Documents agentic infrastructure cost shock with a case study showing costs jumping 10x from prototyping to staging, illustrating budget risk from unoptimised RAG and agent orchestration.
b19	LLM Token Costs Benchmarked: What Engineering and FinOps Leaders Actually Need to Know	Cloudchipr	2026-05	Documents an approximately 80% drop in LLM API prices between early 2025 and early 2026 and argues for per-workload cost tracking over per-token pricing as the operative FinOps metric.
b20	FinOps for AI LLM Cost Governance	Rick Pollick (personal blog)	2026-06	Synthesises Stanford AI Index data on inference cost decline alongside Menlo spend figures and FinOps Foundation survey showing 98% of practitioners now managing AI spend, framing the Jevons-paradox dynamic explicitly.
b21	LLM FinOps: Per-Feature Cost Attribution and Token Budgets	Zop.dev	2026-05	Practitioner post documenting the per-feature attribution problem with a concrete example of a $48,000 monthly Anthropic bill that no one could break down by feature or customer.
b22	10 ML FinOps Habits to Right-Size Models, Right-Price Tokens	Medium (Nexumo)	2025-12	Medium practitioner post framing LLM budget leakage as the norm and arguing that model routing, token caps, and per-feature tagging are the core habits of mature ML FinOps.
b23	Open-Weight AI Models Are Catching Up: What It Means for Enterprise Automation	MindStudio	2026-05	Practitioner analysis comparing open-weight and closed models across production task categories, finding parity on coding, classification, and extraction but a persistent closed-model edge on complex multi-step reasoning.
b24	vLLM vs Ollama vs LocalAI: Best tools for self-hosting LLMs in 2025	eMasterLabs	2026-03	Practitioner comparison articulating the compliance-driven case for self-hosted LLMs in healthcare, legal, finance, and government under HIPAA, GDPR, and SOC 2 constraints.
b25	Self-Hosted LLM Guide: Costs, Architecture and Breakeven Point	Alpacked	2026-05	Documents the canonical Ollama-to-vLLM migration path and the total cost of ownership components most teams undercount when evaluating self-hosted versus API deployment.

Tech Industry & Practitioner

ID	Title	Outlet	Date	Significance
p1	2025 Stack Overflow Developer Survey	Stack Overflow	2025-07	First edition to ask about specific LLMs by name; 49,000+ respondents establish GPT models at 81%, Claude Sonnet at 43%, and Gemini Flash at 35% developer adoption, with 46% distrusting AI output accuracy.
p2	Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here	Stack Overflow Blog	2025-12	Detailed breakdown of LLM model usage by developer segment, showing Claude Sonnet more prevalent among professional developers (45%) than learners (30%), alongside new agentic AI tool data.
p3	[DORA	State of AI-assisted Software Development 2025](https://dora.dev/dora-report-2025/)	DORA / Google Cloud	2025-09
p4	How are developers using AI? Inside Google's 2025 DORA report	Google Blog	2025-09	Official Google summary of DORA 2025 findings: 80%+ of respondents report AI productivity gains, 59% report improved code quality, with the DORA AI Capabilities Model introduced as a prescriptive framework.
p5	AI Is Amplifying Software Engineering Performance, Says the 2025 DORA Report	InfoQ	2026-03	Practitioner-oriented analysis of DORA 2025 findings, framing AI as a multiplier of existing engineering conditions rather than a universal productivity gain - relevant to deployment decision-making.
p6	Thoughtworks Technology Radar Highlights The Rapid Evolution of AI Assistance in 2025	Thoughtworks	2025-11	Volume 33 of the biannual Radar documents the shift from RAG and prompt engineering (Volume 32) to context engineering, MCP, and agentic systems, signalling practitioner maturation in LLM adoption.
p7	[Macro trends in the tech industry	November 2025	Thoughtworks](https://www.thoughtworks.com/en-de/insights/blog/technology-strategy/macro-trends-tech-industry-november-2025)	Thoughtworks
p8	Technology Radar Volume 32: GenAI techniques and observability	Thoughtworks	2025-04	Volume 32 baseline against which Volume 33 shifts can be measured; identifies RAG retrieval techniques, LLM observability tools, and structured output as the leading practitioner concerns of early 2025.
p9	Agentic AI Architecture Framework for Enterprises	InfoQ	2025-07	Named-practitioner, case-study-grounded framework describing three production tiers for enterprise agentic AI, providing the most detailed public architecture guidance for regulated and complex deployments.
p10	The Architectural Shift: AI Agents Become Execution Engines While Backends Retreat to Governance	InfoQ	2025-10	Documents the structural shift where agents move from intent recognition to action execution via MCP, with Gartner data that 40% of enterprise applications will embed task-specific agents by 2026.
p11	Google's Eight Essential Multi-Agent Design Patterns	InfoQ	2026-01	Documents Google's official multi-agent design pattern taxonomy (sequential, loop, parallel and five derivatives) drawn from production Agent Development Kit experience, a key reference for practitioners.
p12	Agentic AI Patterns Reinforce Engineering Discipline	InfoQ	2026-03	Covers practitioner-derived engineering patterns for agentic AI, emphasising specification-driven development and automated traceability as responses to quality and reliability failures in agent deployments.
p13	What I Learned Building Multi-Agent Systems from Scratch (Shopify)	InfoQ	2026-05	Named-practitioner case study from Shopify describing the evolution from single-prompt AI to multi-agent microservices architecture, with concrete lessons on token efficiency and context engineering.
p14	What 1,200 Production Deployments Reveal About LLMOps in 2025	ZenML Blog	2025-12	Analysis of 1,200 real production LLM deployments identifies six patterns separating successful teams from those stuck in demo mode, with a documented example of cost escalating from $127 to $47,000 weekly due to an agent loop error.
p15	How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025	Andreessen Horowitz (a16z)	2026-02	Survey of 100 enterprise CIOs showing average LLM spend growing from $4.5M to $7M over two years, 37% now using five or more models, and multi-model deployment becoming the default pattern.
p16	Leaders, gainers and unexpected winners in the Enterprise AI arms race	Andreessen Horowitz (a16z)	2026-02	Follow-on a16z enterprise survey documenting that 54% of CIOs say reasoning models accelerated LLM adoption, 23% run OpenAI o3 versus 3% DeepSeek in production, and that reported ROI remains below narrative expectations.
p17	A16Z Report: Startup Spend Confirms LLMs Central to Enterprise Purchase Intent	MLQ.ai / a16z	2025-08	Uses verified transaction data from 200,000+ startups to confirm GPT and Claude as the most-purchased AI applications, offering payment-verified evidence rather than self-reported usage data.
p18	Token Economics and TokenOps: The Definitive Guide to FinOps for Tokens	Finout	2026-06	Defines TokenOps as an emerging discipline applying FinOps principles to LLM token consumption, with the key empirical observation that per-token prices are falling while total enterprise spend rises due to agentic volume growth.
p19	LLM API Pricing Comparison In 2026: Every Major Model, Ranked By Cost	CloudZero	2026-05	CloudZero State of AI Costs report data showing average monthly AI spend at $85,500 in 2025 (up 36% from 2024), with token price ranges from $0.10 to $30 per million tokens across current frontier models.
p20	FinOps for AI: LLM Cost Governance	Rick Pollick (practitioner blog)	2026-06	Practitioner-authored analysis citing Stanford AI Index 2025 and Menlo Ventures data to show inference costs fell 280x from 2022 to 2024 while enterprise spend rose from $2.3B (2023) to $37B (2025).
p21	Open-Weight Models H1 2026: DeepSeek, Qwen, Llama Recap	Digital Applied	2026-05	Tracks the diverging release cadences and enterprise adoption trajectories of the three main open-weight families through H1 2026, documenting sovereign-cloud deployment patterns and procurement-side adoption in finance, healthcare, and public sector.
p22	[vLLM Production Deployment	Introl Blog](https://introl.com/blog/vllm-production-deployment-inference-serving-architecture)	Introl	2026-02
p23	Open-Source vs Commercial LLMs: The Complete Guide (2026)	SitePoint	2026-04	Provides empirical breakeven analysis for self-hosted versus API deployment, estimating the crossover at 10–30M tokens per day and quantifying DevOps overhead at 0.5–1.0 FTE per self-hosted deployment.
p24	DORA Report 2025 Key Takeaways: AI Impact on Dev Metrics	Faros AI	2026-04	Triangulates DORA 2025 survey findings with Faros telemetry from 10,000 developers, identifying the AI Productivity Paradox: individual output rises (98% more PRs merged) while organisational delivery metrics remain flat.
p25	DeepSeek V4 Launch: 4 Specs That Make It the Most Disruptive Open-Weight Model of 2026	MindStudio	2026-05	Documents the commercial and compliance case for open-weight frontier models in regulated sectors, showing how healthcare, finance, and legal organisations use DeepSeek V4 weights to avoid third-party API compliance overhead.