Choosing a GenAI Deployment Strategy

Integrating Generative AI into our business processes is one of the most important imperatives of 2024. As we have covered previously, there are reasons to approach AI at a more fundamental level than a use case, and how the state of AI is on the cusp of transitioning from an engineering phase to a design phase.

Nonetheless, it seems timely to summarize some of the key areas, approaches and options available today, when putting theory into practice.

The tech stack for getting LLMs or extractive models working reliably with a corpus of documents and business domains is only a small part of the overall solution space. In approaching the challenge of building or integrating GenAI into our applications, we consider the following:

  1. Where will AI be used: what is the size of business, deployment approach, dev resources available?
  2. How will AI be used: foundation LLM, smaller model, or COTS point solution?
  3. What AI considerations apply: regulatory & legal compliance, governance, safety, technical risks?
  4. Why should we choose a particular AI deployment model and architecture over another?

This article focuses specifically on Generative AI using LLMs; there are related uses of AI for images, video and audio production which follow similar principles, but slightly different model technologies.

1. Where will AI be used?

The target environment into which AI will be deployed can broadly be summarized into the following three options:

  1. Building AI into a business application stack, often cross-department in an Enterprise. Usually this is undertaken by the office of the CIO or CTO in a larger organization, and involves significant transformation of underlying platform services and functions.
  2. AI-enabling specific functions within a particular silo or functional area of the business. This can be conducted within an Enterprise as an entry point for AI, or perhaps in a smaller mid-market or SMB organization where the scale of the work is smaller.
  3. Enabling the use of AI for individual team members, often for personal productivity use cases. Nearly all businesses will have some employees who have used AI in applications like ChatGPT, MS office, Windows, or Gmail – whether the company has policy or restrictions in place or not.

2. How will AI be deployed?

When choosing how to use AI, there are broadly the following options depending on build / buy / lease / reuse preferences, complexity of use case, regulatory constraints, and dev team capability and capacity:

  1. Building with a foundation LLM, perhaps in conjunction with a specific knowledge base of information or corpus of documents using components that can reduce hallucinations (e.g. RAG architectures or grounded generation, often with a vector database and/or graph database). These models are usually hosted by a hyperscaler (OpenAI, MS Azure, AWS Bedrock, Google Vertex etc.) and accessed via APIs. The application deployment architecture is often hybrid, for example connecting local databases, functional flows, serverless components that are hosted in our own private cloud (or even on-prem) or via a 3rd party AI cloud provider in a BYOC (Bring Your Own Cloud) environment.
  2. Use a smaller, often Open Source / Open Access, language model. This includes models such as Anthropic Claude, Mistral / Mixtral, Meta’s LLaMA, TII Falcon, Google Gemma and others. These models can practically be self-hosted, or accessed via 3rd party AI cloud providers (as well as via the hyperscalers themselves)
  3. Use an AI-enabled point solution as a standalone product. These are usually COTS SaaS products that solve specific problems, and are generally easy to integrate and deploy, without requiring bespoke development efforts

3. What AI considerations apply?

The above considerations focus on the where and how AI can be used. When considering what will be used, further considerations apply:

  1. What are the regulatory and legal compliance constraints that might apply? For example, if working in a highly regulated domain (e.g. federal government) or dealing with sensitive personal information (e.g. medtech, healthcare, financial services) then data residence / sovereignty may be important, as well as risk of leakage of IPR into future training sets. This may be a reason to choose a smaller, self-hosted or VPC deployed model where there is more control over the model, rather than using OpenAI natively. (MS Azure can offer more assurances re. data protection, but still with potential downsides as a hyperscaler, not least may be cost)
  2. What are the governance and safety obligations for your use case? Incorporating version management, guardrails, governance and observability into the actions of the AI are always important – and may be essential.
  3. How unique is your business domain and language used in your content? If particularly bespoke, then fine-tuning or training a smaller model may be required to achieve the performance needed, rather than using a foundation model + RAG (or other approaches). One side effect benefit may be that once a solution has been put into production, you may have a fine-tuned model that can be commercially licensed to others (or released as an open source model for the benefit of the wider community).
  4. What is the risk of temporal degradation or drift for your model? LLMs are not a ‘fire and forget’ solution; once in production, regular testing and validation, and potentially regular fine-tuning (or model substitution) may be needed. This is generally less of an issue when using the larger foundation models or point solution; the burden of model maintenance lies with the provider. But test and verification is always needed, and it is possible that the performance (accuracy / safety) may degrade over time.
  5. What are the OpEx and resource (energy, carbon footprint, water consumption etc.) targets? Smaller, particularly quantised models, can run on much lower GPU rigs (and in some cases can perform well enough on CPU-only infrastructure). Sending large quantities of tokens to a foundation LLM hosted by a hyperscaler is usually slow and expensive, particularly for high transaction rates. There are usually more options for resource optimisation when a smaller LLM is used, particularly on a private AI cloud provider
  6. What policies are needed to ensure access to AI is sufficiently controlled to safeguard end users, staff, and the core business? Companies may have policies which change over time, and being able to have multiple versions of an AI toolchain, for compliance with past, current and future policies, may be needed.

4. Why should we choose a particular AI deployment model?

With the above in mind, the reasons why we would choose a particular AI technology and deployment option can be varied. The above considerations would generate a large number of permutations, but we can summarize a few ‘serving suggestions’ as follows:

SituationDev team & CapEx budgetGovernanceOpEx budgetIPR / data sensitivitySpecificity of Business DomainExample deployment architecture
EnterpriseLargeHeavyLargeHighGeneralMS Azure VPC foundation LLM + integrated governance, guardrails, policy, version management
Mid-marketMediumLightMediumLowNarrowSmall LLM on VPC or private cloud
Personal productivityLowLowLowMediumGeneralVetted point solutions with consent of IT team


This is just scratching the surface of the rationale behind the where, what, how, and why to use a specific GenAI deployment model.For guidance on the GenAI application stack tool options available, feel free to get in touch.

Comments are closed.