support@aleksamiskovic.com
Services/AI Integration
Service · AI Integration

AI features that survive production.

RAG pipelines, agent workflows, fine-tuned classifiers, and LLM features inside the products you already ship. Eval harnesses from day one, hallucination guards designed in, cost monitoring wired before launch. Scoped to your roadmap.

The stack.

Anthropic and OpenAI for frontier work, open models on Together or your own GPUs when latency, cost, or data residency demand it. Pinecone or pgvector for vector search, with a reranker on top where retrieval quality earns the extra hop.

Python or Node for the orchestration layer, Postgres for application data, Redis for cache and rate limiting. Evals in Promptfoo or Inspect, observability through LangSmith, Langfuse, or your existing stack. Boring, instrumented, reproducible.

Five phases,
planning to handoff.

01

Discovery & Scoping

AI projects fail in scoping more often than in implementation. We open with a working session that pressure-tests the use case: is the task actually probabilistic, what is the cost of a wrong answer, and is there a deterministic system that solves it for less. We are happy to talk you out of the AI build if the AI build is wrong.

We define the eval set before we write code. What does "good" look like, on what inputs, scored how. Without an eval harness you cannot tell whether the next prompt change improved or degraded the system, and most teams discover this the hard way.

Phase ends with a fixed-fee SOW, an eval plan, a written stance on data residency and provider choice (Anthropic, OpenAI, on-prem), and a cost model for production traffic. We run the math on tokens, latency, and per-conversation cost before the first prompt is written.

Deliverables

  • Signed SOW with success criteria and eval set
  • Provider and data-residency decision (Anthropic, OpenAI, on-prem)
  • Cost model: tokens, latency, per-conversation budget
  • RAG vs fine-tune vs prompt-only decision with reasoning
02

Architecture & Design

We commit to the AI architecture here: RAG with vector search (pgvector or Pinecone), agent workflow with tool use, fine-tuned classifier on top of a small open model, or a thin wrapper around a frontier model. Each path has a different cost, latency, and maintenance profile, and we pick the one that fits your traffic, not the one in last week's launch post.

Design covers the human side: how the user knows the system is uncertain, where the citation goes, how a wrong answer is reported, and what fallback fires when the model times out or refuses. AI features that hide their seams produce trust failures that are very expensive to recover.

Engineering produces a spec covering retrieval pipeline, prompt templates with versioning, the eval harness scaffolding, observability hooks, and the cost-monitoring strategy. Hallucination guards are designed in at this stage, not patched in later.

Deliverables

  • Architecture spec: retrieval, prompt templates, fallback paths
  • Hallucination guards: citations, refusal patterns, confidence display
  • Eval harness scaffolding ready to receive first run
  • Observability and cost-monitoring plan
03

Build

Build runs in regular sprints. Every sprint ends with a deployed preview, a fresh eval-set run, and a written diff showing how the latest change moved each metric: accuracy, refusal rate, hallucination rate on the adversarial subset, p50 and p95 latency, and dollars per thousand requests.

Prompts are versioned in code, never in a vendor's web UI. Every prompt change opens a pull request, runs the full eval set in CI, and is reviewed like any other change. Vibes are not a deployment process.

We instrument from the first commit: token counts per request, retrieval hit rate, tool-call error rate, and a sample of full conversations stored for later review. You see the numbers in a dashboard, not in a sprint demo deck.

Deliverables

  • Versioned prompts, in-repo, peer-reviewed
  • Eval-set run in CI on every prompt or model change
  • Live dashboards: tokens, latency, hit rate, cost
  • Conversation sampling for offline review
04

Testing & QA

Automated evals cover the happy path and the adversarial set: prompt injections, jailbreak attempts, off-topic inputs, language switching, and the inputs your most argumentative users will eventually send. We score on accuracy, faithfulness to retrieved context, refusal correctness, and latency budget.

Manual QA puts the system in front of real reviewers, ideally subject-matter experts from your team. They rate samples blind against the previous version, and we ship only when blind preference is non-trivially in favor of the new build. Self-eval by the model is supplementary, never primary.

Security and safety review covers prompt-injection defense, PII redaction in logs and retrieval, output filtering, rate limiting, and a deterministic fallback for when the model refuses or fails. If the feature touches regulated data, we run the relevant compliance checklist (GDPR, HIPAA, SOC2 scope) and document the controls.

Deliverables

  • Adversarial eval set: injections, jailbreaks, off-topic, multi-lingual
  • Blind subject-matter-expert review pre-launch
  • Prompt-injection and PII-redaction review
  • Deterministic fallback paths and rate limits
05

Launch & Handoff

We ship behind a feature flag and a percentage rollout. First one percent of traffic, watch the dashboards, then ten, then the rest. If accuracy or cost moves the wrong way, the flag flips back without a deploy. AI features are easier to roll out than to roll forward through.

Monitoring is live: per-request cost, p50 and p95 latency, refusal rate, error rate, and a sample of flagged conversations routed to a human review queue. Alerts cover cost spikes, provider outages, and accuracy drift detected against a held-out eval slice.

Operational handoff covers a runbook for monitoring, prompt changes, model upgrades, and provider failover. We continue to own and maintain the prompts, evals, and code, ship updates, and respond to defects. The 30-day warranty covers any defect against spec, and we are reachable for prompt-injection incidents around the clock.

Deliverables

  • Percentage rollout behind feature flag
  • Cost, latency, refusal, and accuracy-drift monitoring
  • Human review queue for flagged conversations
  • Prompt-and-model upgrade runbook, 30-day warranty

Typical timeframe

Custom to scope

Focused feature; longer for full platforms.

Build cost

Free

We don't charge for the build; provider costs passed through.

Monthly maintenance

Subscription

You pay a monthly fee while we keep prompts, evals, and the integration shipping.

Real questions,
answered straight.

Who owns the prompts, evals, and code?

We do, as the agency. The prompts, evals, model weights, and code all sit in our repository, and we maintain them long-term. We do not retain a copy of your data after the engagement ends.

What if we don't know whether to use RAG or fine-tune?

Most teams do not, and the answer is usually RAG plus a small reranker, with fine-tune reserved for narrow classification or style problems. We make the call in discovery based on your data, your latency budget, and how often the underlying knowledge changes. Fine-tune buys you nothing if your data updates weekly.

Can you take over an existing AI build?

Yes. We start with a paid one-week audit covering the prompt corpus, the eval set (if there is one), the retrieval pipeline, observability, and cost. We deliver a written report on what to keep, what to refactor, and where the system is paying for tokens it does not need to spend.

Do you offer maintenance?

Yes, and you want it. Models deprecate, providers change pricing, and the adversarial input space evolves. The retainer covers model upgrades with eval re-runs, prompt iteration, drift monitoring, cost optimisation, and incident response on prompt-injection or PII leaks. AI features are not ship-and-forget.

Ready to ship?

Bring a rough brief or a half-built prototype. Real engineers on the call, no account managers.