1. Introduction
AI products are systems that exist at the intersection of human behavior, data constraints, model performance, governance, and business value. Building them well requires analytical discipline, architectural fluency, and a capacity to translate ambiguity into a sequence of testable decisions.
This document outlines, at a very high level, the end-to-end method I use whenever I am presented with a vague business problem and tasked with creating a solution, whether it involves Agentic AI, LLMs, classical ML, analytics, or traditional software. It reflects years of work across enterprises with varying levels of AI maturity, startups, and small- and medium-sized businesses. It's designed to show not only what I do, but how I think.
2. Phase One: Clarify and Frame the Problem
Before all else, I ask: What is the exact business problem to be solved? Framing the business problem and need clearly and accurately is the most critical part of the entire AI product development process. If this is not done correctly, the entire project will be off-base.
The reason is simple: every subsequent decision flows from the problem frame. If you misdiagnose the problem, you'll build the wrong solution, and no amount of technical excellence downstream will fix it. You can have a perfectly architected RAG pipeline, flawless evals, and robust monitoring, but if you're solving the wrong problem, it's all wasted effort.
2.1 Establish the Real Goal, Not the Stated Request
Every engagement begins with separating symptoms from causes. Stakeholders often present solutions disguised as problems ("We need an AI chatbot"), so I guide the conversation toward:
- What decision or behavior actually needs to change
- Who is affected, and why it matters now
- What failure looks like today
- What constraints define the solution space
I distill this into a crystal clear one-sentence problem hypothesis that becomes the anchor for all later decisions.
2.2 Validate Who the User Actually Is
AI often inserts itself into workflows without respecting human variation. I map:
- Primary users and their personas, contexts, journeys, etc.
- Proxy users and their personas, contexts, journeys, etc.
- Affected partners (legal, compliance, operations)
- Hidden constraints (skills, incentives, risk tolerance)
Clarity here prevents significant downstream rework.
2.3 Quantify the Cost of the Problem
The question is always "How expensive is the status quo?" I measure:
- Time lost
- Error rates
- Operational cost
- Revenue leakage
- Compliance risk
- Customer experience degradation
This creates a value envelope for the solution: what the AI must justify.
3. Phase Two: Diagnose Feasibility
3.1 Data Readiness Assessment
Before AI is even discussed, I evaluate:
- Whether sufficient labeled or unlabeled data exists
- Data accessibility (APIs, lakes, schemas, silos)
- Data quality and semantic consistency
- Gaps requiring data generation or synthetic augmentation
- Privacy and regulatory constraints
This determines whether the solution is feasible with AI, or whether traditional software or process redesign is more appropriate.
3.2 Technical Fit: LLM, Agentic AI, RAG, ML, Low-Code, or Software?
This is a highly simplified version, but my essential decision tree is:
3.2.1 Foundation LLM vs. LLM + RAG
I evaluate whether the task requires:
- World knowledge (LLM-only)
- Enterprise-specific knowledge (LLM + RAG)
- Factual precision requiring document-grounded retrieval
- Traceability for regulated workflows
RAG is selected when the model must remain tightly anchored to proprietary data with auditable provenance.
3.2.2 Agentic AI
I evaluate whether the task requires:
- Multi-step reasoning
- Tool use (APIs, search, calculators)
- Conditional branching
- Autonomous workflows
If so, an agentic architecture (single agent or orchestrated multi-agent) may be the correct pattern. I ensure safety boundaries: maximum thinking depth, tool restrictions, and deterministic checkpoints.
3.2.3 Low-Code or No-Code Automation
If the solution is primarily workflow automation, notifications, integrations between SaaS systems, or routing of structured events, then a combination of platforms like n8n, Make, Zapier Enterprise, or Airplane may deliver faster, cheaper value than custom engineering or DS involvement.
3.2.4 Classical ML vs. Software
Classical ML remains the best fit for prediction, ranking, scoring, classification, and anomaly detection. Traditional software is selected when processes are rule-bound, deterministic, or regulatory tolerance for probabilistic outcomes is low.
I choose the simplest technology that reliably achieves the desired outcome. Complexity is never the goal.
3.4 Build vs. Buy Evaluation
3.4.1 The Decision Logic
I recommend building when the capability:
- Is core to competitive differentiation
- Requires deep customization or domain-specific tuning
- Must integrate tightly with internal systems
- Carries compliance or auditability obligations vendors cannot meet
- Has long-term reuse across multiple lines of business
I recommend buying when:
- Time to value is critical
- The capability is commoditized (summaries, extraction, embeddings)
- Internal engineering or DS capacity is limited
- Vendor SLAs and governance controls meet enterprise needs
- Total cost of ownership is lower when amortized over time
3.4.2 Simplified 5-Criteria Decision Matrix
| Criterion | Build Favors | Buy Favors |
|---|---|---|
| Strategic Differentiation | Proprietary workflows, domain specificity | Commodity capabilities |
| Time to Value | Long runway allowed | Immediate delivery required |
| Integration Complexity | Deep integration with legacy systems | API-level integration sufficient |
| Data Sensitivity & Governance | Strict auditability, on-prem needs | Vendor meets compliance controls |
| Total Cost of Ownership | Reuse across products justifies investment | Vendor amortizes R&D cost |
4. Phase Three: Frame the Solution
4.1 Co-Create a Shared Vision with Engineering and Data Science
Before writing anything formal, I conduct short technical framing sessions to ensure:
- What we are building
- What we are not building
- What assumptions we are making
- Where uncertainty still exists
- What early experiments are possible
This reduces friction and creates psychological ownership across teams.
4.2 Architect the System Conceptually
For AI products, the architecture defines the product's behavior. I outline:
- Foundation model options
- Retrieval pipeline and vector database strategy
- Agent orchestration patterns (if applicable)
- Fine-tuning or prompt-engineering requirements
- Integration points with upstream and downstream systems
- Human-in-the-loop pathways
- Guardrails and safety layers (rate limiting, structured output, schema enforcement, deterministic checkpoints)
- Evaluation and monitoring loops
- Logging and traceability requirements
This architecture becomes the conceptual blueprint that anchors the PRD and accelerates feasibility discussion.
4.3 Decide What "Good" Looks Like
Reliable measurability is paramount to ongoing success. I define both user-centric and system-centric success metrics:
- Accuracy thresholds
- Drift tolerance
- Feedback loop performance
- Latency expectations
- Reliability SLAs
- Reduction in operational effort
- Expected ROI over time
5. Phase Four: Experiment and De-Risk
5.1 Build Fast, Instrumented Experiments
Before committing to a full PRD, I create 1 to 3 high-leverage experiments:
- A prompt-engineering sandbox
- A retrieval pipeline prototype
- A small labeled test set for early evals
- Synthetic data probes
- Error-mode analysis
- LLM-as-Judge evaluations to scale early testing when human annotation is too slow
These experiments validate the most fragile assumptions early.
5.2 Evaluate Against Reality, Not Hope
I test for:
- Failure patterns
- Hallucination types
- Robustness across edge cases
- Misalignment with user expectations
- Cost-performance ratios
- Latency risks
If the idea fails here, we pivot, saving months of downstream effort.
5.3 Present a Feasibility Assessment to Stakeholders
I summarize what is viable, what is risky, what must change, and what the architecture will require to scale. This is the moment when a stakeholder begins to "see" the product.
6. Phase Five: Formalize the PRD
6.1 Write Requirements That Bridge Worlds
A strong AI PRD translates ambiguity into structured intention. My PRDs include, at a very high level (please see actual PRDs at bensweet.ai for more details):
- Business context
- Problem framing
- Product vision
- Goals, success metrics, and KPIs
- User persona and user journeys
- Architecture overview (I create full architecture specs with Eng team and append)
- Data and model requirements/considerations
- RAG or agent design (if applicable)
- Acceptance criteria and test cases
- Evaluation methodology
- Monitoring and fallback design
- Nonfunctional requirements (latency, security, scalability)
- Benefits/impacts to non-technical groups in plain English
- Release plan and effort estimates
The PRD becomes a shared reference point rather than a static artifact.
6.2 Sequence Delivery Through Iterative Risk Reduction
I design the roadmap in ascending order of uncertainty:
- Baselines
- Retrieval and data pipelines
- Model selection or tuning
- Interaction design
- Monitoring instrumentation
- Closed-beta testing
- Progressive rollout
This ensures predictable progress even in uncertain domains.
7. Phase Six: Partner for Implementation
7.1 Work as a Translational Layer
I ensure alignment across:
- Engineering
- Data Science
- Architecture
- Security
- Compliance
- Operations
- Executive sponsors
I make decisions transparent and reduce ambiguity that slows execution.
7.2 Maintain a Live Model of System Behavior
Small deviations early become systemic failures later. I:
- Watch real-time logs
- Analyze model errors
- Monitor drift
- Refine prompts or tuning based on empirical feedback
- Ensure continuous alignment with user intent
The solution becomes a living system, one we observe, not merely deploy.
8. Phase Seven: Measure and Scale
8.1 Validate Impact Quantitatively and Qualitatively
I measure:
- Whether users actually adopt the product
- Whether the model behaves reliably under real conditions
- Whether operational time decreases
- Whether error rates recede
- Whether costs remain within target ranges
This is where AI stops being academic and becomes economic.
8.2 Create Feedback Loops for Continuous Improvement
Every AI system experiences concept drift (data distribution changes) and behavioral drift (model action changes). To maintain performance, I implement:
- Automated eval sets
- Human-in-the-loop review cycles
- Structured feedback collection
- Regression tests for prompt or model changes
- Telemetry and monitoring dashboards
Continuous tuning sustains real value.
8.3 Scale Horizontally Across Use Cases
If the product demonstrates stable value, I extend its architecture:
- Additional workflows
- New user groups
- Expanded retrieval corpora
- Additional agents
- Automated training data pipelines
Scaling is planned, not accidental.
9. My Guiding Principles
- Simplicity first, AI second.
- Every model is guilty until proven reliable.
- Architectures must be explainable to non-technical audiences.
- Human users determine whether the model is valuable, not metrics alone.
- AI product management is the craft of reducing uncertainty into a sequence of learning loops.
- A PM's job is to make better decisions possible, not to win arguments.
10. Conclusion
This process—diagnose, frame, experiment, design, deliver, measure, and scale—is how I navigate the intersection of human needs, engineering constraints, and model behavior, and how I consistently transform ambiguous business challenges into workable, trustworthy, economically sound AI products.