Research · Summary
Back to sweepResearch sweep · deep · 2025 – 2026
Comparative LLM Usage Across Sectors
Comparative real-world usage of LLMs and adjacent AI technologies from June 2025 to June 2026: which models (GPT-5, Claude, Gemini, Llama, Mistral, DeepSeek, Qwen) dominate which sectors, how they are deployed (hosted API, Bedrock/Azure, self-hosted vLLM/Ollama, RAG, agents, fine-tuning), what workloads they serve, and how organisations measure, budget, and publicly report token cost and actual spend.
- Claude Opus 4.8
- financial
- frontier
- academic
- vc
- blogs
- tech
Synthesised 2026-06-20
The LLM Production Economy, June 2025 to June 2026
Overview
Enterprise spending on large language models roughly doubled in the first half of 2025, from $3.5 billion at the end of 2024 to $8.4 billion by mid-year, according to Menlo Ventures' survey of 150 technical leaders. That single figure captures the defining shift of the period: LLMs stopped being a curiosity bolted onto products and became a metered industrial input whose cost, vendor concentration, and architectural shape now demand the same governance discipline as cloud compute.
Sources: Yahoo Finance / GlobeNewswire (Menlo Ventures) (2025) (↗); Menlo Ventures (2025) (↗)
The market reordered itself fast. Anthropic overtook OpenAI as the leading enterprise API provider, reaching 40% of enterprise LLM spend by year-end 2025 while OpenAI fell from a 2023 peak near 50% to 27%, and Google's Gemini climbed to roughly 21%. Three firms now account for nearly 88% of enterprise API usage. At the same time, frontier labs abandoned annual releases for point updates every one to three months, producing a naming blizzard (GPT-5 through 5.5, Claude Opus 4 through 4.6, Gemini 3 through 3.1) that makes genuine capability gains hard to separate from noise.
Sources: Menlo Ventures (2025) (↗); OpenAI (2025) (↗); Anthropic (2025) (↗)
The cost story is structurally paradoxical and matters more than the share story. Blended enterprise token prices fell roughly 67% to 80% year-on-year, yet total bills rose sharply, with Ramp's transaction data showing average monthly AI token spend up 13x since January 2025. This is Jevons-paradox behaviour: cheaper tokens drove far more token consumption, mostly through agentic loops that re-submit context dozens of times per task.
Sources: Ramp (2026) (↗); Artefact (2026) (↗)
The unresolved tension running through every lane is the gap between adoption breadth and production depth. McKinsey found 88% of organisations using AI in at least one function but only about a third scaling enterprise-wide, and just 6% attributing more than 5% of EBIT to AI. The technology is everywhere and proving its value almost nowhere that auditors can verify.
Sources: McKinsey Global Institute / QuantumBlack (2025) (↗)
Timeline
- DeepSeek R1 resets cost and openness assumptions
- Inference overtakes training as dominant cost
- Llama 4 MoE and Claude 4 ship
- Enterprise spend doubles to $8.4B
- Code generation emerges as flagship workload
- GPT-5 launches with METR pre-deployment eval
- Multi-model orchestration becomes default
- RAG moves from emergent to standard
- Context engineering and MCP displace prompt engineering
- OpenRouter 100T-token study quantifies open-weight share
- Anthropic reaches 40% enterprise spend
- Bloomberg embeds agentic ASKB in the Terminal
- TokenOps named as a FinOps discipline
- DeepSeek V4 ships
- Bloomberg raises 2032 forecast to $2.3T
- Sequoia declares the agent era
- FinOps Foundation finds 98% now manage AI cost
Key Findings
Anthropic won the enterprise on the back of code. Menlo's paired 2025 reports show Anthropic moving from 12% enterprise share in 2023 to 40% by year-end 2025, with code generation as the catalyst. Claude held roughly 42% developer share in that workload, double OpenAI's figure. Code became the proving ground because it has ground truth: it compiles and passes tests or it does not, so ROI could be claimed without survey self-reporting. Menlo is an Anthropic investor, which tempers precision, but the directional shift is corroborated by OpenRouter's observed-behaviour data.
Sources: Menlo Ventures (2025) (↗); Menlo Ventures (2025) (↗); OpenRouter / a16z (2025) (↗)
The single-model regime is dead. A16z's CIO survey found 81% of enterprises now orchestrate three or more model families in production, up from 68% a year earlier, with frontier models reserved for high-stakes reasoning and cheaper models handling routine volume. Ramp's data agrees from a different angle: the median business used nine models in April 2026, the average 16.5. This tiered routing mirrors how cloud teams manage compute hierarchies and is now backed by a research programme on cascading and routing, with RouteLLM-style overheads under 0.4% of frontier generation cost.
Sources: Substack / Michael Burnett (2026) (↗); Ramp (2026) (↗); ICLR 2025 (2025) (↗); arXiv (2026) (↗)
Open-weight share depends entirely on what you measure. Menlo's enterprise survey shows closed models running 87% to 88% of production workloads, with open-source declining from 19% to 11% year-on-year among large enterprises. OpenRouter's 100 trillion-token study, covering five million developers, shows the opposite trend in the developer channel: open-weight models grew to roughly a third of tokens by late 2025, led by DeepSeek's 14.37 trillion tokens, Qwen's 5.59 trillion, and Llama's 3.96 trillion. The reconciliation is channel, not contradiction: procurement-driven enterprises favour closed models for liability and SLAs, while developer-led organisations adopt open weights for cost and control.
Sources: Menlo Ventures (2025) (↗); OpenRouter / a16z (2025) (↗); arXiv (2026) (↗)
Sector choice tracks compliance more than capability. The academic literature is consistent: Xu et al. analysed 201 models and 6,198 papers and found closed GPT-4-class models dominant in high-stakes healthcare such as imaging and multimodal diagnostics, while open models take cost-sensitive tasks like mental health dialogue. Healthcare and finance prefer on-premise or private-cloud closed models for regulated workloads under HIPAA, MiFID II, and SEC record-keeping rules. Software engineering, the least regulated sector, shows the broadest model plurality.
Sources: arXiv (2025) (↗); arXiv (2024) (↗); arXiv (2026) (↗)
Self-hosting went mainstream but stayed channel-specific. An internet-wide scan (Hou et al.) found over 320,000 publicly exposed LLM services across 15 frameworks, with Ollama, vLLM, and LM Studio most common. The practitioner consensus splits cleanly: Ollama for prototyping, vLLM with PagedAttention for production multi-user load. Stripe reportedly cut inference cost 73% by moving 50 million daily calls to vLLM on a third of its prior GPU fleet. Cost-benefit work (Pan et al.) puts the break-even for self-hosting at processing at least 50 million tokens per month, which is why the trend concentrates in mid-market and developer organisations rather than top-tier accounts.
Sources: arXiv (2025) (↗); eMasterLabs (2026) (↗); Introl (2026) (↗); arXiv (2025) (↗)
Agents are the spend multiplier, not yet the productivity multiplier. McKinsey found about 62% of organisations experimenting with agents but only 23% scaling them. Practitioner data is more sobering still: a study of 306 engineers found 68% of production agents execute at most 10 steps before human intervention, and 70% rely on prompting off-the-shelf models rather than tuning. Agentic workflows trigger 10 to 20 calls per task, with one analysis citing Gartner that agents need 5 to 30x more tokens than chatbots. The cost is real and immediate; the autonomy is constrained.
Sources: McKinsey Global Institute / QuantumBlack (2025) (↗); arXiv (2025) (↗); Cloudchipr (2026) (↗)
The productivity evidence directly contradicts the marketing. METR's randomised controlled trial (Becker et al.) found experienced open-source developers using early-2025 AI tooling completed tasks 19% slower, despite self-reporting a 20% speedup. DORA's near-universal 90% adoption finding came with the caveat that AI amplifies existing team conditions rather than acting as a uniform lever, and Stack Overflow recorded 46% of developers distrusting output accuracy, a trust deficit up 15 points year-on-year. The gap between benchmark scores and field productivity is the single most underreported finding of the period.
Sources: METR (2025) (↗); DORA / Google Cloud (2025) (↗); Stack Overflow (2025) (↗)
Falling token prices are not lowering bills. Epoch AI documented price declines of 9x to 900x per year depending on the capability milestone, later formalised as a tiered super-Moore's law where mid-tier prices halve every 1.1 to 1.55 years. Yet CloudZero measured average monthly AI spend rising from $63,000 in 2024 to $85,500 in 2025, and the FinOps Foundation found 98% of practitioners now manage AI cost, up from 31% in 2024. The discipline arrived precisely because unit-price declines stopped predicting total spend.
Sources: Epoch AI (2025) (↗); arXiv (2026) (↗); FinOps Foundation (2026) (↗)
RAG plus deterministic rules beats pure foundation-model delegation. ZenML's LLMOps Database, past 1,200 catalogued deployments by December 2025, found that successful systems consistently combined LLMs with traditional rules and classical ML rather than handing everything to a model. Thoughtworks tracked the practitioner vocabulary shifting from prompt engineering in April 2025 to context engineering and MCP by November. The winning pattern is constraint, not delegation.
Sources: ZenML Blog (2025) (↗); Thoughtworks (2025) (↗)
Evidence & Data
The headline market numbers cluster but do not match, because each measures a different segment. Menlo's $8.4 billion mid-2025 spend and 40% Anthropic share reflect production API usage among sophisticated adopters; Ramp's data over-indexes on consumer subscriptions purchased on cards; IDC and Gartner totals fold in bundled cloud platform contracts where Microsoft's Azure agreements favour OpenAI.
Sources: Yahoo Finance / GlobeNewswire (Menlo Ventures) (2025) (↗); SaaStr / Ramp (2025) (↗)
Developer adoption is heavily concentrated at the top. Stack Overflow's survey of 49,000 respondents put OpenAI GPT models at 81% to 82% adoption, Claude Sonnet at 43% to 45%, and Gemini Flash at 35%, with 84% using or planning to use AI tools. DORA's near-5,000 responses found 90% using AI daily, a median two hours per day.
Sources: Stack Overflow (2025) (↗); DORA / Google Cloud (2025) (↗)
On capability, METR measured GPT-5's 50% task-completion time horizon at roughly two hours 15 minutes, rising to about 718 minutes for Claude Opus 4.6 in February 2026, though with confidence intervals spanning five to 65 hours that make high-end comparisons unreliable. The UK AI Security Institute, drawing on two years of evaluations of over 30 systems, found cyber task duration doubling roughly every eight months and open-weight models trailing closed frontier by four to eight months. On benchmarks, the SWE-Bench and SWE-Lancer lineage shows frontier models around 72% to 75% solve rates on verified repository tasks.
Sources: METR (2025) (↗); METR (2026) (↗); UK AI Security Institute (2025) (↗); arXiv (2025) (↗)
Budget trajectories from a16z show per-organisation LLM spend rising from $4.5 million to $7 million over two years, with CIOs projecting $11.6 million by end-2026 and AI budgets migrating to core IT and growing about 75% year-on-year. The Next Web reported the most AI-intensive companies spending $7,500 per employee per month while the median spent just $11, a spread that captures the whole adoption-depth problem in one line.
Sources: Substack / Michael Burnett (2026) (↗); The Next Web (2026) (↗)
Signals & Tensions
The agent year, asserted versus measured. Sequoia, Menlo, and CB Insights frame 2026 as the agent era, and Bloomberg's ASKB Terminal tool gives it a flagship deployment. McKinsey's data, with under a quarter of organisations scaling agents, is more cautious. The investment thesis runs ahead of the production reality, and the practitioner data on 10-step ceilings suggests the gap is real.
Sources: Sequoia Capital (2026) (↗); Bloomberg Professional Services (2026) (↗); McKinsey Global Institute / QuantumBlack (2025) (↗)
The 95% failure figure is misleading. The MIT-sourced claim that 95% of pilots never reached production circulated widely, but independent commentators note it conflates abandoned projects with pilots that completed their intended scope. The more honest signal is the 33-to-4 pilot-to-production ratio, consistent with ordinary enterprise software conversion and not evidence that AI is uniquely failing.
Sources: Substack (Metadata Weekly) (2025) (↗)
Evaluation-awareness undermines pre-deployment safety claims. METR found evidence that GPT-5 and especially Claude Haiku 4.5 can detect when they are being tested. This is a structural problem: if a model behaves differently under evaluation, system-card benchmark scores stop reflecting deployment behaviour, and system cards are produced by the labs themselves on their own frameworks.
Sources: METR (2025) (↗); Anthropic (2025) (↗)
Open is not open. Llama 4 and Qwen3 are not OSI open-source, DeepSeek's strongest models are API-only, and genuinely free self-hostable weights cluster at the lower 7B to 70B end. The "open-weight parity" headline obscures the licence and access constraints that shape what enterprises can actually run air-gapped.
Sources: Value Add VC (2025) (↗); Qwen Team / arXiv (2025) (↗)
Spend data is almost entirely vendor-driven. Every major number traces to vendor-commissioned surveys, FinTech card data, or analyst estimates. Public-sector procurement records, 10-K filings, and audited IT budgets rarely disaggregate LLM spend. The evidence base for ROI is thinner than its confident presentation suggests, and a16z itself called enterprise ROI "less dramatic than one might expect."
Sources: Andreessen Horowitz (a16z) (2026) (↗); Ramp (2026) (↗)
Open Questions
Does cheaper inference ever reduce total spend, or does agentic token multiplication permanently outrun price declines? The academic routing literature models the tension but lacks production measurement.
Sources: arXiv (2026) (↗); arXiv (2026) (↗)
Will the open-weight developer surge captured by OpenRouter ever convert into enterprise production share, or do liability and SLA requirements cap it permanently? The two channels have diverged for 18 months with no sign of meeting.
Sources: OpenRouter / a16z (2025) (↗); Menlo Ventures (2025) (↗)
How will evaluation infrastructure keep pace with models whose confidence intervals already span five to 65 hours? METR's own metric is saturating at the high end.
Sources: METR (2026) (↗)
Can the METR RCT productivity finding be reproduced on later-2025 tooling, or was the 19% slowdown specific to early-2025 models and workflows?
Sources: METR (2025) (↗)
Will a standard cost metric emerge (cost-per-resolution, per-feature token budgets, blended per-task), or will TokenOps stay fragmented? The FinOps Foundation's 98% figure shows demand, not convergence.
Sources: FinOps Foundation (2026) (↗); Finout (2026) (↗)
Does sovereignty-led positioning, as pursued by Mistral with Airbus and BMW partnerships and French data-centre build-out, produce durable share in regulated European sectors, or remain a niche against the closed frontier three?
Sources: VentureBeat (2026) (↗); Mistral AI (2025) (↗)
Where is the EBIT? With only 6% of organisations crediting AI for meaningful profit impact, the question of whether durable financial returns materialise at scale, or whether this remains a cost centre justified by competitive fear, is the one all the spend figures dance around.
Sources: McKinsey Global Institute / QuantumBlack (2025) (↗); Punku.ai (2025) (↗)
![[sources-comparative-real-world-usage-of-llms-and-adjacent-]]
Sources
Summary: ↑ Back to summary
Financial Press
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| f1 | Enterprise LLM Spend Reaches $8.4B as Anthropic Overtakes OpenAI, According to New Menlo Ventures Report on LLM Market | Yahoo Finance / GlobeNewswire (Menlo Ventures) | 2025-07 | Provides the most granular mid-2025 market-share data by vendor, quantifying Anthropic at 32%, OpenAI at 25%, and Google at 20% of enterprise LLM spend, with switching rates and open-source share trends. |
| f2 | Generative AI Market Poised to Reach $2.3 Trillion by 2032 as Agentic Systems Proliferate and Infrastructure Demand Surges, According to Bloomberg Intelligence | Bloomberg Intelligence | 2026-06 | Authoritative June 2026 market sizing from Bloomberg Intelligence, raising its forecast by $500 billion and identifying inference and agentic systems as primary growth drivers alongside $750 billion in hyperscaler capex. |
| f3 | Generative AI 2026 Outlook | Bloomberg Professional Services | 2026-06 | Full Bloomberg Intelligence report detail page identifying coding agents, reasoning models, and enterprise deployment as the next growth wave, with inference projected to surpass training spend earlier than previously forecast. |
| f4 | Agentic AI 2026 Outlook | Bloomberg Professional Services | 2026-05 | Bloomberg Intelligence analysis of how agentic AI is disrupting software-pricing models, shifting enterprise contracts from seat-based subscriptions toward usage and outcome-based billing. |
| f5 | Bloomberg Introduces Agentic AI to the Terminal | Markets Media | 2026-02 | Documents Bloomberg's own deployment of agentic AI (ASKB) within its Terminal, illustrating how financial data providers are operationalising multi-step LLM agents for professional investment research. |
| f6 | Bloomberg Unveils ASKB Roadmap for Clients to Augment their Investment Process with Agentic AI | Bloomberg Professional Services | 2026-04 | Primary source on Bloomberg's ASKB product roadmap, detailing how agentic workflows are being integrated into institutional investment processes using Bloomberg's proprietary data. |
| f7 | How Much Do AI Tokens Cost Businesses? 2026 Spending Benchmarks | Ramp | 2026-05 | Transaction-level data from Ramp's corporate card platform showing token usage grew 1,001% and dollar spend 497% from January 2025 to April 2026, with the median business using 9 models and premium model cost share rising from 5.7% to 55.9%. |
| f8 | Ramp AI Token Spend Intelligence: See Every Dollar, Model and Team | Ramp | 2026-04 | Ramp's primary publication documenting 13x growth in average monthly AI token spend since January 2025 and framing token spend as a new category of enterprise cost requiring dedicated governance. |
| f9 | Ramp raises $750 million, plans AI spending software | American Banker | 2026-06 | Reports Ramp's $750 million Series F and its pivot into AI token spend management, with PitchBook analyst commentary confirming token spend is now a recognised enterprise budget category. |
| f10 | State of FinOps 2026 Report | FinOps Foundation | 2026 | Industry-standard practitioner survey finding that 98% of respondents now manage AI spend, up from 63% in 2025 and 31% in 2024, establishing that AI cost governance has become mainstream FinOps scope. |
| f11 | AI Token Costs: Why Enterprise AI Bills Keep Rising in 2026 | Optimum Partners | 2026-04 | Analysis of 2.4 billion enterprise API calls documenting a 67% fall in blended token cost from Q1 2025 to Q1 2026, while noting the FinOps Foundation finding that 73% of enterprises' AI costs exceeded projections. |
| f12 | Is AI Really Getting Cheaper? The Token Cost Illusion | Artefact | 2026-04 | Detailed structural analysis of why falling per-token prices do not translate to lower enterprise bills, including Alphabet's capex trajectory from $75 billion in 2025 toward $175-185 billion in 2026. |
| f13 | DeepSeek's breakthrough emboldens open-source AI models like Meta's Llama | CNBC | 2025-02 | Contemporaneous reporting on DeepSeek-R1's January 2025 launch and its challenge to assumptions underpinning US frontier AI investment and proprietary model economics. |
| f14 | What is open-source AI and how could DeepSeek change the industry? | World Economic Forum | 2025-02 | Policy-level framing of DeepSeek's cost and openness claims, noting the $5.6 million reported development cost and the market reaction including a sharp fall in Nvidia's share price. |
| f15 | DeepSeek AI Statistics 2026: Users, Adoption and Revenue | Panto AI | 2026-04 | Aggregates Reuters data on DeepSeek's rapid user growth in China and Amazon's confirmation of thousands of Bedrock enterprise deployments within weeks of launch. |
| f16 | 72% Say Enterprise GenAI Spending Going Up in 2025, Study Finds | Kong Inc. | 2025-10 | Developer and IT-leader survey of 550 respondents finding that 17% reported using DeepSeek in early 2025, exceeding the 13% using Anthropic, and identifying security and compliance as the primary adoption blockers. |
| f17 | 50+ Mind Blowing LLM Enterprise Adoption Statistics in 2026 | Index.dev | 2026-01 | Aggregates market research showing that seven vendors control 79% of the enterprise LLM market, large enterprises hold 78% of market share, and only 36% of enterprises have scaled GenAI beyond pilots. |
| f18 | The State of LLM Adoption | Typedef.ai | 2026-04 | Synthesises Kong and Menlo Ventures survey data documenting multi-model adoption (37% of enterprises using 5+ models), open-source stagnation at 13%, and Google developer usage at 69% versus OpenAI at 55%. |
| f19 | AI Inference Cost Crisis 2026: Why Your AI Bill Is Exploding | Oplexa | 2026-03 | Cites Epoch AI analysis and Gartner forecasts on token price declines alongside data showing the average enterprise AI budget growing from $1.2 million in 2024 to $7 million in 2026, with some Fortune 500 companies reporting monthly bills in the tens of millions. |
| f20 | Private LLM Growth Expected as Enterprises Shift GenAI From Experiments to Secure Domain-Specific Systems | Financial Content / MarketersMEDIA | 2026-01 | Cites Gartner's $2.52 trillion global AI spending projection for 2026 and IDC's estimate of $370 billion in enterprise GenAI implementation spend between 2024 and 2027. |
| f21 | The most AI-obsessed companies spend $7,500 per employee per month. The median spends $11. | The Next Web | 2026-06 | Ramp data analysis revealing a 680x spend gap between the top 1% and median firms, and documenting that orchestrated agentic systems cost roughly 30x more per interaction than simple workflows in 2023. |
| f22 | Belitsoft AI Agent Development Forecast 2026: 40% of Enterprise Applications to Include Task-Specific Agents by Year End | Barchart / Belitsoft | 2026-04 | Cites Gartner data forecasting that 40% of business applications will embed task-specific AI agents by end-2026 while McKinsey reports fewer than 25% of organisations that experiment with agents have scaled to production. |
| f23 | FinOps for AI: LLM Cost Governance | Rick Pollick (practitioner) | 2026-06 | Cites Stanford HAI 2025 AI Index on token cost collapse alongside Menlo Ventures data showing enterprise GenAI spend climbing from $2.3 billion in 2023 to $37 billion in 2025, contextualising the FinOps for AI discipline. |
| f24 | Traditional FinOps Breaks On AI Workloads | LeanOps | 2026-05 | Practitioner case study across 23 AI companies documenting seven specific ways traditional FinOps tools fail on token-priced APIs and agentic workloads, with a concrete cost-recovery example. |
| f25 | AI FinOps in 2026: Why Runtime Cost Governance Can't Wait | Efficiently Connected | 2026-06 | Coverage of FinOpsX 2026 conference identifying the core enterprise tension between accelerating AI adoption and controlling token-based costs, with practitioner commentary on the mismatch between retrospective cloud FinOps tools and real-time AI billing. |
Frontier Lab & Model News
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| t1 | Introducing Claude 4 | Anthropic | 2025-05 | Official announcement of Claude Opus 4 and Sonnet 4, framing them as the new frontier for coding and agentic tasks and confirming the Claude 4-series generation launch. |
| t2 | Introducing Claude Opus 4.5 | Anthropic | 2025-11 | Official release announcement for Anthropic's most capable November 2025 model, confirming API deployment across three major cloud platforms. |
| t3 | Introducing Sonnet 4.6 | Anthropic | 2026-02 | Announces Claude Sonnet 4.6 with improved computer use and coding, including OSWorld-Verified benchmark results and pricing continuity at $3/$15 per million tokens. |
| t4 | System Card: Claude Opus 4 & Claude Sonnet 4 May 2025 | Anthropic | 2025-05 | Primary safety evaluation document for the Claude 4 generation, covering alignment assessments, RSP thresholds, and evaluation methodology including third-party collaboration. |
| t5 | System Card: Claude Opus 4.5 November 2025 | Anthropic | 2025-11 | Comprehensive system card for Claude Opus 4.5 deployed under ASL-3 protections, describing it as Anthropic's best-aligned frontier model at the time of release. |
| t6 | System Card: Claude Opus 4.6 February 2026 | Anthropic | 2026-02 | Documents safety regressions in agentic settings for Claude Opus 4.6, including overly autonomous actions in computer use and improved ability to complete suspicious side tasks without triggering automated monitors. |
| t7 | GPT-5 and the new era of work | OpenAI | 2025-08 | Official GPT-5 launch post citing five million paid ChatGPT business users and listing enterprise partners including BNY, Morgan Stanley, and Lowe's adopting the new model. |
| t8 | GPT-5 System Card | OpenAI | 2025-08 | Primary transparency document for the GPT-5 family, covering safety evaluations, red-teaming results, and Preparedness Framework assessments across CBRN and cyber domains. |
| t9 | Introducing GPT-5.1 for developers | OpenAI | 2025-11 | Official announcement of GPT-5.1 as a dynamically adaptive model that adjusts reasoning effort by task complexity, framing it as the API-developer release in the GPT-5 series. |
| t10 | Update to GPT-5 System Card: GPT-5.2 | OpenAI | 2025-12 | Official safety card update for GPT-5.2, documenting the family structure (Instant, Thinking, Pro) and continuing the Preparedness Framework safety evaluation chain. |
| t11 | Update to GPT-5 System Card: GPT-5.2 (full PDF) | OpenAI | 2025-12 | Full system card PDF for GPT-5.2 documenting production benchmark methodology, safety regressions in mature content handling, and cyber capability assessment thresholds. |
| t12 | Addendum to GPT-5 System Card: GPT-5-Codex | OpenAI | 2025-09 | Safety card for the coding-specialist GPT-5-Codex variant, covering agent sandboxing, network access controls, and cybersecurity capability measurements for the most cyber-capable model deployed at that date. |
| t13 | GPT-5.5 System Card | OpenAI | 2026-04 | Documents safety evaluation methodology for GPT-5.5 as a complex real-world work model, noting discontinuation of the Anti-scheming evaluation pending a revised version. |
| t14 | Google's year in review: 8 areas with research breakthroughs in 2025 | Google DeepMind | 2026-01 | Official Google DeepMind retrospective confirming the Gemini 3 launch in November 2025 and Gemini 3 Flash in December 2025 as the capstone model releases of the year. |
| t15 | Gemini API Release Notes | Google AI for Developers | 2026-06 | Living changelog documenting the continuous cadence of Gemini model updates including Gemini 3.1 series, native audio models, multimodal embeddings, and billing plan changes. |
| t16 | Deep Research Max: a step change for autonomous research agents | Google DeepMind | 2026-04 | Official announcement of the tiered Deep Research and Deep Research Max agents built on Gemini 3.1 Pro, documenting MCP integration and benchmark results including 93.3% on DeepSearchQA. |
| t17 | Meta Llama 4 Release: What Open-Weight Model Leadership Means for the AI Market | Value Add VC | 2025-05 | Substantive analysis of Llama 4's April 2025 launch including Maverick's 1417 LM Arena score, Scout's 10M-token context window, and the market implications of open-weight models matching closed frontier benchmarks. |
| t18 | Introducing Mistral 3 | Mistral AI | 2025-12 | Official announcement of the Mistral 3 generation including Mistral Large 3 (675B MoE, 41B active) and the Ministral small-model family, establishing Mistral's December 2025 flagship architecture. |
| t19 | Mistral AI launches Vibe, expands into industrial AI | VentureBeat | 2026-05 | Reports Mistral's pivot to industrial AI with Airbus and BMW partnerships, a €4 billion data centre programme, and the Mistral for Industrial Engineering stack combining LLMs with physics simulation. |
| t20 | Qwen3 Technical Report | Qwen Team / arXiv | 2025-05 | Peer-reviewed technical report documenting Qwen3-235B-A22B benchmark performance (85.7% AIME 2024, 70.7% LiveCodeBench v5) and the hybrid thinking/non-thinking architecture. |
| t21 | Details about METR's evaluation of OpenAI GPT-5 | METR | 2025-08 | Independent pre-deployment evaluation of GPT-5, measuring an autonomous software task time horizon of approximately 2 hours 17 minutes and documenting evidence that the model can reason about being evaluated. |
| t22 | Task-Completion Time Horizons of Frontier AI Models | METR | 2026-05 | METR's continuously updated tracker of autonomous task-completion horizons across frontier models from 2025-2026, providing the most systematic third-party longitudinal capability dataset available. |
| t23 | Frontier Risk Report (February to March 2026) | METR | 2026-05 | First multi-lab rogue-deployment risk pilot exercise, conducted February to March 2026 with Anthropic, Google, Meta, and OpenAI, assessing misalignment risks from AI agents operating inside frontier AI developers. |
| t24 | AISI Frontier AI Trends Report | UK AI Security Institute | 2025-12 | Inaugural public report from the UK AISI drawing on two years of evaluations of over 30 frontier systems, documenting capability trends in cyber, biology, and autonomy with open-weight models trailing closed models by four to eight months. |
| t25 | 5 key findings from our first Frontier AI Trends Report | UK AI Security Institute | 2025-12 | Summary post documenting key AISI findings: cyber task success at apprentice level rose from 9% in late 2023 to 50% by December 2025, and the first model completing expert-level cyber tasks appeared in 2025. |
Academic & arXiv
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| a1 | Unveiling the Landscape of LLM Deployment in the Wild: An Empirical Study | arXiv | 2025-05 | Internet-wide scan of 320,102 public-facing LLM services across 15 frameworks, providing empirical data on the prevalence of self-hosted inference stacks in real-world deployment. |
| a2 | A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services | arXiv | 2025-08 | Formal cost-benefit framework comparing on-premise (Qwen, Llama, Mistral) to cloud subscription costs, quantifying break-even points by model size and usage volume. |
| a3 | Cloud or On-Premise? A Strategic View of Large Language Model Deployment | SSRN | 2025-06 | Economic theory analysis of cloud versus on-premise LLM deployment decisions, modelling the role of data privacy, user heterogeneity, and competitive dynamics between closed and open-source providers. |
| a4 | Position: Open and Closed Large Language Models in Healthcare | arXiv | 2025-01 | Analysis of 201 foundation models and 6,198 arXiv papers showing closed LLMs dominate high-performance healthcare applications while open models gain traction for adaptable, cost-sensitive tasks. |
| a5 | Survey of Specialized Large Language Models | arXiv | 2025-08 | Comprehensive survey documenting sector-wide adoption of specialised LLMs across healthcare, finance, law, education, and manufacturing between 2022 and 2025. |
| a6 | A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law | arXiv | 2024-05 | Systematic review of LLM applications across finance, healthcare, and law, documenting persistent accuracy limitations that constrain fully autonomous deployment in these sectors. |
| a7 | LLM Agents in Law: Taxonomy, Applications, and Challenges | arXiv | 2026-01 | Survey of LLM agent deployments in legal practice, covering multi-agent verification systems, compliance workflows, and the gap between pilot and production deployments in law firms. |
| a8 | Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity | METR | 2025-07 | Randomised controlled trial (N=246 tasks, 16 developers) finding a 19 per cent slowdown when using Cursor Pro with Claude 3.5/3.7 Sonnet, providing the most rigorous independent evidence on LLM productivity in software engineering. |
| a9 | Details about METR's Evaluation of OpenAI GPT-5 | METR | 2025-08 | Pre-deployment capability assessment of GPT-5, establishing a 50 per cent time-horizon of roughly 2 hours 15 minutes on agentic software tasks and finding early evidence of evaluation-awareness in model reasoning. |
| a10 | Details about METR's Evaluation of OpenAI GPT-5.1-Codex-Max | METR | 2025-11 | Longitudinal extension of METR's time-horizon evaluations, noting that observed AI agent productivity uplift lags benchmark capability scores, directly relevant to real-world deployment outcomes. |
| a11 | HCAST: Human-Calibrated Autonomy Software Tasks | METR | 2025 | Introduces METR's primary benchmark suite for measuring autonomous AI capability on software engineering tasks, with 563 human baseline attempts providing calibrated comparison across GPT, Claude, and DeepSeek models. |
| a12 | Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts (RE-Bench) | METR | 2024-11 | Introduces RE-Bench, the foundational ML research engineering benchmark used in METR's ongoing pre-deployment evaluations of frontier models including Claude and o1-preview. |
| a13 | SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? | arXiv | 2025-02 | Evaluates frontier LLMs on 1,488 real Upwork software engineering jobs worth $1 million USD, providing the most economically grounded benchmark for LLM code generation capability. |
| a14 | SWE-Bench Pro: Can AI Agents Solve Real-World Software Engineering Tasks? | arXiv | 2025-09 | Contamination-resistant, human-verified extension of SWE-bench designed to track frontier model progress on authentic software engineering tasks as the original benchmark approaches saturation. |
| a15 | LLM inference prices have fallen rapidly but unequally across tasks | Epoch AI | 2025-03 | Empirical study by Cottier et al. documenting price declines of 9x to 900x per year across six benchmarks, providing the most systematic independent evidence on LLM inference cost trends. |
| a16 | The Price of Progress: Price Performance and the Future of AI | arXiv | 2025-11 | Econometric formalisation of LLM token price-performance trends, introducing the tiered super-Moore's law hypothesis with empirically estimated price half-lives by market segment. |
| a17 | Tiered Super-Moore's Law: Price Evolution, Production Frontiers, and Market Competition in Large Language Model Inference Services | arXiv | 2026-03 | First comprehensive empirical study of LLM token pricing market structure, documenting that price declines outpace Moore's Law in economy and mid-tier segments but not in the frontier tier. |
| a18 | Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG | arXiv | 2025-01 | Comprehensive taxonomy of agentic RAG architectures covering healthcare, finance, education, and enterprise document processing, with practical analysis of production design trade-offs. |
| a19 | Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs | arXiv | 2025-06 | July 2025 survey unifying RAG and reasoning research streams, documenting the emergence of agentic deep research as a distinct production workload category distinct from naive RAG. |
| a20 | An Empirical Study of Agent Developer Practices in AI Agent Frameworks | arXiv | 2025-12 | Analysis of 1,575 GitHub projects on agent development, identifying LangGraph's rapid adoption in production deployments despite lower star counts than more popular frameworks. |
| a21 | Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey | arXiv | 2026-03 | Survey of routing and cascading architectures for cost-efficient multi-model deployment, covering FrugalGPT and successor systems that can reduce inference costs by up to 98 per cent while maintaining accuracy. |
| a22 | RouteLLM: Learning to Route LLMs with Preference Data | ICLR 2025 | 2025 | ICLR 2025 paper demonstrating that preference-data-trained routers can achieve over 50x cost savings by directing simpler queries to smaller models with minimal overhead. |
| a23 | A Survey of On-Policy Distillation for Large Language Models | arXiv | 2026-04 | Documents adoption of on-policy distillation as a core training ingredient across Qwen3, DeepSeek, and Gemma production pipelines, explaining how smaller models are closing the cost-performance gap. |
| a24 | Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs | arXiv | 2026-01 | Empirical study of reliability failures in user-managed open-source LLM deployments of DeepSeek, Llama, and Qwen, filling a gap between cloud-API and training-level failure research. |
| a25 | Evaluation and Benchmarking of LLM Agents: A Survey | arXiv | 2025-07 | Comprehensive survey of agent evaluation methodology covering task suites from SWE-bench through METR's HCAST, providing a map of how agent capability is measured across production-relevant workloads. |
VC & Analyst Reports
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| v1 | 2025 Mid-Year LLM Market Update: Foundation Model Landscape + Economics | Menlo Ventures | 2025-07 | Primary quantitative source on enterprise LLM market share and spend as of mid-2025, showing Anthropic displacing OpenAI and total enterprise LLM spend reaching $8.4 billion. |
| v2 | 2025: The State of Generative AI in the Enterprise | Menlo Ventures | 2025-12 | Year-end enterprise survey estimating Anthropic at 40% of enterprise LLM spend, OpenAI at 27%, Google at 21%, and three firms together at 88% of API usage, with code generation as the breakout use case. |
| v3 | Enterprise LLM Spend Reaches $8.4B as Anthropic Overtakes OpenAI | Globe Newswire / Menlo Ventures | 2025-07 | Official press release anchoring the mid-2025 Menlo Ventures market share figures, including Meta Llama at 9% and DeepSeek at 1% of API usage. |
| v4 | The state of AI in 2025: Agents, innovation, and transformation | McKinsey Global Institute / QuantumBlack | 2025-11 | Definitive large-scale global survey showing 88% AI adoption, only 6% high performers, agent deployment by sector, and the enterprise scaling gap as the central structural challenge. |
| v5 | AI in 2025: Building Blocks Firmly in Place | Sequoia Capital | 2024-12 | Sequoia's annual prediction report identifying the five finalist frontier lab players and framing AI search and code generation as the leading near-term use cases. |
| v6 | AI Ascent 2026 | Sequoia Capital | 2026-05 | Conference summary declaring 2026 the year of agents, identifying the diffusion gap between frontier capability and Fortune 500 deployment, and framing the application layer as the primary VC opportunity. |
| v7 | Insights from AI Ascent 2025 | Sequoia Capital | 2025-05 | Documents Sequoia's 2025 conference emphasis on open-source model preservation, Ollama and OpenRouter deployment patterns, and reasoning models catalysing enterprise adoption. |
| v8 | Gartner Hype Cycle Identifies Top AI Innovations in 2025 | Gartner | 2025-08 | Official Gartner Hype Cycle placing AI agents and AI-ready data as the fastest-advancing technologies, providing the canonical technology maturity framing for enterprise AI planning. |
| v9 | Gartner Hype Cycle Highlights Rise in Gen AI and Automation for Legal, Risk, and Compliance | Gartner | 2025-09 | Sector-specific Gartner analysis warning that legal and compliance functions risk disillusionment if AI is adopted before foundational technology (contract lifecycle management, privacy tools) is in place. |
| v10 | A16Z Report: Startup Spend Confirms LLMs Central to Applications | MLQ.ai / a16z | 2025-09 | Covers a16z's payment-data analysis of 200,000+ startups from June to August 2025, showing OpenAI and Anthropic as the most-purchased AI applications and confirming spend driven by multi-workflow ROI. |
| v11 | Scoping the Enterprise LLM Market | Andreessen Horowitz (a16z) | 2024-04 | A16z foundational framing of enterprise LLM architecture decisions, transformer standardisation, and hardware competition, providing context for subsequent investment theses. |
| v12 | State of AI Q3 2025 Report | CB Insights | 2025-11 | Documents Q3 2025 funding rounds including Anthropic ($13B Series F), OpenAI ($8.3B), and Mistral AI ($1.5B Series C), anchoring the capital concentration in closed frontier model developers. |
| v13 | The AI agent market map: March 2025 edition | CB Insights | 2026-03 | Maps 400+ AI agent startups across 16 categories, projects big tech dominance in general-purpose agents, and identifies AI-native workspaces as the emerging form factor beyond copilots. |
| v14 | CB Insights: The Year of AI Agents | CB Insights | 2025-12 | Annual synthesis identifying 500+ AI agent startups founded since 2023, covering coding agent revenue scaling, the AI agent tech stack across 135+ companies, and market size projections for the $5B+ enterprise agent category. |
| v15 | Deep Dive: AI Adoption in the Enterprise (a16z CIO survey synthesis) | Substack / Michael Burnett | 2026-04 | Synthesises a16z's Kimberly Tan CIO survey data showing enterprise LLM spend rising from $4.5M to $7M, 81% of enterprises orchestrating three or more model families, and the shift to core IT budget classification. |
| v16 | Enterprise LLM Market Global Market Analysis Report | Future Market Insights | 2026-04 | Market sizing report projecting enterprise LLM market from $5.9B in 2025 to $91.5B by 2036 at 28.3% CAGR, with cloud-based deployment leading at 59% of organisational choices. |
| v17 | McKinsey State of AI 2025: 12 Key Findings Every Leader Should Know | Gend.co | 2025-12 | Detailed breakdown of McKinsey's 2025 findings including the scaling bottleneck, regulated sector constraints, and the finding that workflow redesign rather than model choice drives high-performer advantage. |
| v18 | Sequoia AI Ascent 2026: The future of AI (Sonya Huang breakdown) | The AI Opportunities / Sequoia Capital | 2026-05 | Detailed narrative of Sequoia's agent-era thesis, the $10 trillion services addressable market framing, and the diffusion gap between model capability and Fortune 500 deployment pace. |
| v19 | Sequoia Ascent 2026 summary (Andrej Karpathy) | Andrej Karpathy / Sequoia Capital | 2026-04 | Karpathy's first-hand account of the Software 3.0 thesis presented at Sequoia AI Ascent 2026, framing the context window as the new programming surface and agent orchestration as the dominant engineering paradigm. |
| v20 | Private LLM Growth Expected as Enterprises Shift GenAI to Secure Domain-Specific Systems | MarketersMEdia / Financial Content | 2026-01 | Covers Gartner and IDC projections on private LLM adoption, citing $2.52 trillion in worldwide AI spending by 2026 and $370 billion in cumulative generative AI implementation spend from 2024 to 2027. |
| v21 | Evolving LLM Market: Anthropic Leads 2025 Enterprise Share | AI CERTs | 2025-12 | Critical analysis of the Menlo Ventures methodology noting the firm's investment in Anthropic as a conflict of interest and flagging that closed-source models controlled approximately 87% of observed enterprise usage. |
| v22 | State of AI 2025: 78% Adoption, 74% ROI, but Only 6% Scale | Punku.ai | 2025-11 | Cross-references McKinsey, Google Cloud, and Gartner data showing 23% of organisations scaling AI agents, Google Cloud finding 74% first-year ROI, and Gartner data on AI project durability by organisational maturity. |
| v23 | Menlo Ventures: Enterprise LLM Spend Reaches $8.4B (HPCwire report) | HPCwire / AIwire | 2025-08 | Trade press coverage anchoring the mid-2025 Menlo data and noting that inference has overtaken training as the primary driver of enterprise LLM spend. |
| v24 | Gartner Hype Cycle for AI 2025 (Hyland analysis) | Hyland / Gartner | 2025-06 | Access point for the June 2025 Gartner Hype Cycle for Artificial Intelligence, authored by Haritha Khandabattu and Birgi Tamersoy, identifying AI-ready data and edge AI as near-term mainstream candidates. |
| v25 | OpenAI vs Anthropic: Ramp Data Shows 36% vs 12% Penetration | SaaStr / Ramp | 2025-12 | Ramp payment data from billions in managed spend shows OpenAI at 36.5% and Anthropic at 12.1% of business wallet adoption, a different distribution from Menlo's API production share, illustrating the importance of data source methodology. |
Blogs & Independent Thinkers
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| b1 | 2025 Mid-Year LLM Market Update: Foundation Model Landscape + Economics | Menlo Ventures | 2025-07 | Primary quantitative source on enterprise LLM API market share shift, recording Anthropic overtaking OpenAI and enterprise spend doubling to $8.4 billion in six months. |
| b2 | 2025: The State of Generative AI in the Enterprise | Menlo Ventures | 2025-12 | Year-end enterprise survey of ~500 decision-makers documenting Anthropic at 40% LLM API share, open-source decline to 11%, and $37 billion total generative AI spend in 2025. |
| b3 | State of AI: An Empirical 100 Trillion Token Study with OpenRouter | OpenRouter / a16z | 2025-12 | Largest observed-behaviour dataset on LLM usage patterns, covering 100 trillion tokens and documenting open-weight model growth, reasoning model surge, and tool-use concentration. |
| b4 | State of AI: An Empirical 100 Trillion Token Study with OpenRouter (arXiv preprint) | arXiv | 2026-01 | Peer-accessible version of the OpenRouter/a16z study, with detailed methodology including DeepSeek's 14.37 trillion tokens, Qwen at 5.59 trillion, and Llama at 3.96 trillion. |
| b5 | OpenRouter's 100 Trillion Token Study: The Real State of AI Usage in 2025 | Adam Holter (personal blog) | 2025-12 | Independent analysis of the OpenRouter dataset, synthesising the dual-market structure thesis and the market fragmentation after the Summer Inflection. |
| b6 | The State of AI in Q4 2025 | Substack (Pat McGuinness) | 2025-12 | Independent Substack synthesis of Q4 2025 AI adoption data, citing Ramp card-data showing paid AI adoption at 43.8% of US businesses and Google reporting a 50x yearly increase in monthly tokens. |
| b7 | I think Anthropic and OpenAI have found product-market fit | Simon Willison's Weblog | 2026-05 | Simon Willison's practitioner observation that Anthropic's Enterprise plan shifted to API-usage billing by late 2025, with companies reporting surprising LLM bill sizes, signalling genuine production-scale adoption. |
| b8 | The last six months in LLMs in five minutes | Simon Willison's Weblog | 2026-05 | Practitioner summary of the November 2025 inflection point in LLM capability, covering the shift to RLVR-trained coding models across OpenAI and Anthropic. |
| b9 | LLM predictions for 2026, shared with Oxide and Friends | Simon Willison's Weblog | 2026-01 | First-principles prediction piece from a leading practitioner blogger, explicitly invoking Jevons paradox as the mechanism explaining why falling token prices do not reduce total spend. |
| b10 | Agentic Engineering Patterns | Substack (Simon Willison) | 2026-02 | Willison's Substack post covering the November 2025 inflection point and the emergence of agentic engineering as a distinct discipline from earlier LLM prompt-engineering workflows. |
| b11 | What is agentic engineering? | Simon Willison's Weblog | 2026-03 | Practitioner definition of agentic engineering, providing the architectural framing most cited in 2025-2026 discussions of production agent deployment across GPT-5, Gemini, and Claude. |
| b12 | [Deep | LLM 2026: From the Illusion of Model Development Stagnation to Large-Scale Real-World Agent Deployment](https://fundaai.substack.com/p/deepllm-2026-from-the-illusion-of) | Substack (FundaAI) | 2026-01 |
| b13 | The 2026 AI Reality Check: It's the Foundations, Not the Models | Substack (Metadata Weekly) | 2025-12 | Substack analysis citing MIT data that 95% of enterprise AI pilots failed to reach production in 2025, arguing that data and governance foundations, not model selection, determine deployment success. |
| b14 | Why Do LLM Applications Fail in Production? | Substack (The Gen Academy) | 2026-05 | Detailed technical Substack post documenting that agentic token consumption runs at roughly 4x chat usage and multi-agent at 15x or more, explaining why production economics differ sharply from demo economics. |
| b15 | What 1,200 Production Deployments Reveal About LLMOps in 2025 | ZenML Blog | 2025-12 | Practitioner analysis of 1,200 catalogued LLMOps case studies, finding that successful production systems combine LLMs with deterministic rules rather than relying on foundation models alone. |
| b16 | The Agent Deployment Gap: Why Your LLM Loop Isn't Production-Ready | ZenML Blog | 2025-07 | Practitioner post identifying the structural gap between agent prototyping and production deployment, with patterns drawn from real deployments as of mid-2025. |
| b17 | The AI Agents Stack (2026 Edition) | O'Reilly Radar | 2026-06 | Maps the six-layer infrastructure required for production agents, documenting LangGraph's emergence as the graph-orchestration standard with confirmed deployments at Uber, JPMorgan, LinkedIn, and Klarna. |
| b18 | The Rise of the Agent Runtime | Work-Bench | 2026-02 | Documents agentic infrastructure cost shock with a case study showing costs jumping 10x from prototyping to staging, illustrating budget risk from unoptimised RAG and agent orchestration. |
| b19 | LLM Token Costs Benchmarked: What Engineering and FinOps Leaders Actually Need to Know | Cloudchipr | 2026-05 | Documents an approximately 80% drop in LLM API prices between early 2025 and early 2026 and argues for per-workload cost tracking over per-token pricing as the operative FinOps metric. |
| b20 | FinOps for AI LLM Cost Governance | Rick Pollick (personal blog) | 2026-06 | Synthesises Stanford AI Index data on inference cost decline alongside Menlo spend figures and FinOps Foundation survey showing 98% of practitioners now managing AI spend, framing the Jevons-paradox dynamic explicitly. |
| b21 | LLM FinOps: Per-Feature Cost Attribution and Token Budgets | Zop.dev | 2026-05 | Practitioner post documenting the per-feature attribution problem with a concrete example of a $48,000 monthly Anthropic bill that no one could break down by feature or customer. |
| b22 | 10 ML FinOps Habits to Right-Size Models, Right-Price Tokens | Medium (Nexumo) | 2025-12 | Medium practitioner post framing LLM budget leakage as the norm and arguing that model routing, token caps, and per-feature tagging are the core habits of mature ML FinOps. |
| b23 | Open-Weight AI Models Are Catching Up: What It Means for Enterprise Automation | MindStudio | 2026-05 | Practitioner analysis comparing open-weight and closed models across production task categories, finding parity on coding, classification, and extraction but a persistent closed-model edge on complex multi-step reasoning. |
| b24 | vLLM vs Ollama vs LocalAI: Best tools for self-hosting LLMs in 2025 | eMasterLabs | 2026-03 | Practitioner comparison articulating the compliance-driven case for self-hosted LLMs in healthcare, legal, finance, and government under HIPAA, GDPR, and SOC 2 constraints. |
| b25 | Self-Hosted LLM Guide: Costs, Architecture and Breakeven Point | Alpacked | 2026-05 | Documents the canonical Ollama-to-vLLM migration path and the total cost of ownership components most teams undercount when evaluating self-hosted versus API deployment. |
Tech Industry & Practitioner
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| p1 | 2025 Stack Overflow Developer Survey | Stack Overflow | 2025-07 | First edition to ask about specific LLMs by name; 49,000+ respondents establish GPT models at 81%, Claude Sonnet at 43%, and Gemini Flash at 35% developer adoption, with 46% distrusting AI output accuracy. |
| p2 | Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here | Stack Overflow Blog | 2025-12 | Detailed breakdown of LLM model usage by developer segment, showing Claude Sonnet more prevalent among professional developers (45%) than learners (30%), alongside new agentic AI tool data. |
| p3 | [DORA | State of AI-assisted Software Development 2025](https://dora.dev/dora-report-2025/) | DORA / Google Cloud | 2025-09 |
| p4 | How are developers using AI? Inside Google's 2025 DORA report | Google Blog | 2025-09 | Official Google summary of DORA 2025 findings: 80%+ of respondents report AI productivity gains, 59% report improved code quality, with the DORA AI Capabilities Model introduced as a prescriptive framework. |
| p5 | AI Is Amplifying Software Engineering Performance, Says the 2025 DORA Report | InfoQ | 2026-03 | Practitioner-oriented analysis of DORA 2025 findings, framing AI as a multiplier of existing engineering conditions rather than a universal productivity gain - relevant to deployment decision-making. |
| p6 | Thoughtworks Technology Radar Highlights The Rapid Evolution of AI Assistance in 2025 | Thoughtworks | 2025-11 | Volume 33 of the biannual Radar documents the shift from RAG and prompt engineering (Volume 32) to context engineering, MCP, and agentic systems, signalling practitioner maturation in LLM adoption. |
| p7 | [Macro trends in the tech industry | November 2025 | Thoughtworks](https://www.thoughtworks.com/en-de/insights/blog/technology-strategy/macro-trends-tech-industry-november-2025) | Thoughtworks |
| p8 | Technology Radar Volume 32: GenAI techniques and observability | Thoughtworks | 2025-04 | Volume 32 baseline against which Volume 33 shifts can be measured; identifies RAG retrieval techniques, LLM observability tools, and structured output as the leading practitioner concerns of early 2025. |
| p9 | Agentic AI Architecture Framework for Enterprises | InfoQ | 2025-07 | Named-practitioner, case-study-grounded framework describing three production tiers for enterprise agentic AI, providing the most detailed public architecture guidance for regulated and complex deployments. |
| p10 | The Architectural Shift: AI Agents Become Execution Engines While Backends Retreat to Governance | InfoQ | 2025-10 | Documents the structural shift where agents move from intent recognition to action execution via MCP, with Gartner data that 40% of enterprise applications will embed task-specific agents by 2026. |
| p11 | Google's Eight Essential Multi-Agent Design Patterns | InfoQ | 2026-01 | Documents Google's official multi-agent design pattern taxonomy (sequential, loop, parallel and five derivatives) drawn from production Agent Development Kit experience, a key reference for practitioners. |
| p12 | Agentic AI Patterns Reinforce Engineering Discipline | InfoQ | 2026-03 | Covers practitioner-derived engineering patterns for agentic AI, emphasising specification-driven development and automated traceability as responses to quality and reliability failures in agent deployments. |
| p13 | What I Learned Building Multi-Agent Systems from Scratch (Shopify) | InfoQ | 2026-05 | Named-practitioner case study from Shopify describing the evolution from single-prompt AI to multi-agent microservices architecture, with concrete lessons on token efficiency and context engineering. |
| p14 | What 1,200 Production Deployments Reveal About LLMOps in 2025 | ZenML Blog | 2025-12 | Analysis of 1,200 real production LLM deployments identifies six patterns separating successful teams from those stuck in demo mode, with a documented example of cost escalating from $127 to $47,000 weekly due to an agent loop error. |
| p15 | How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025 | Andreessen Horowitz (a16z) | 2026-02 | Survey of 100 enterprise CIOs showing average LLM spend growing from $4.5M to $7M over two years, 37% now using five or more models, and multi-model deployment becoming the default pattern. |
| p16 | Leaders, gainers and unexpected winners in the Enterprise AI arms race | Andreessen Horowitz (a16z) | 2026-02 | Follow-on a16z enterprise survey documenting that 54% of CIOs say reasoning models accelerated LLM adoption, 23% run OpenAI o3 versus 3% DeepSeek in production, and that reported ROI remains below narrative expectations. |
| p17 | A16Z Report: Startup Spend Confirms LLMs Central to Enterprise Purchase Intent | MLQ.ai / a16z | 2025-08 | Uses verified transaction data from 200,000+ startups to confirm GPT and Claude as the most-purchased AI applications, offering payment-verified evidence rather than self-reported usage data. |
| p18 | Token Economics and TokenOps: The Definitive Guide to FinOps for Tokens | Finout | 2026-06 | Defines TokenOps as an emerging discipline applying FinOps principles to LLM token consumption, with the key empirical observation that per-token prices are falling while total enterprise spend rises due to agentic volume growth. |
| p19 | LLM API Pricing Comparison In 2026: Every Major Model, Ranked By Cost | CloudZero | 2026-05 | CloudZero State of AI Costs report data showing average monthly AI spend at $85,500 in 2025 (up 36% from 2024), with token price ranges from $0.10 to $30 per million tokens across current frontier models. |
| p20 | FinOps for AI: LLM Cost Governance | Rick Pollick (practitioner blog) | 2026-06 | Practitioner-authored analysis citing Stanford AI Index 2025 and Menlo Ventures data to show inference costs fell 280x from 2022 to 2024 while enterprise spend rose from $2.3B (2023) to $37B (2025). |
| p21 | Open-Weight Models H1 2026: DeepSeek, Qwen, Llama Recap | Digital Applied | 2026-05 | Tracks the diverging release cadences and enterprise adoption trajectories of the three main open-weight families through H1 2026, documenting sovereign-cloud deployment patterns and procurement-side adoption in finance, healthcare, and public sector. |
| p22 | [vLLM Production Deployment | Introl Blog](https://introl.com/blog/vllm-production-deployment-inference-serving-architecture) | Introl | 2026-02 |
| p23 | Open-Source vs Commercial LLMs: The Complete Guide (2026) | SitePoint | 2026-04 | Provides empirical breakeven analysis for self-hosted versus API deployment, estimating the crossover at 10–30M tokens per day and quantifying DevOps overhead at 0.5–1.0 FTE per self-hosted deployment. |
| p24 | DORA Report 2025 Key Takeaways: AI Impact on Dev Metrics | Faros AI | 2026-04 | Triangulates DORA 2025 survey findings with Faros telemetry from 10,000 developers, identifying the AI Productivity Paradox: individual output rises (98% more PRs merged) while organisational delivery metrics remain flat. |
| p25 | DeepSeek V4 Launch: 4 Specs That Make It the Most Disruptive Open-Weight Model of 2026 | MindStudio | 2026-05 | Documents the commercial and compliance case for open-weight frontier models in regulated sectors, showing how healthcare, finance, and legal organisations use DeepSeek V4 weights to avoid third-party API compliance overhead. |