I Gave AI Agents Real Jobs and Now I’m Basically a CTO With No Employees

A few months ago, I wrote about building an MVP gym app in 2 weeks with AI. That was cute. What I’ve been doing since then is something else entirely.

A while back, I built a gym tracking app—the whole thing, from idea to working iOS app, in about two weeks. I wrote about it, felt pretty good, installed it on my phone, and started using it at the gym.

And then I actually used it. At the gym. With sweaty hands. Between sets. When I was tired and just wanted to log my reps and move on.

That’s when reality hit.

The app worked. Technically, everything functioned. But it took 19 taps to change the weight by 95 pounds. There was no way to see what I lifted last time. The buttons were too small. I had to scroll past all my completed sets just to find the “Next Exercise” button. And if I accidentally tapped “Finish Workout”, gone. No confirmation. Session saved. No undo.

The MVP was fine for a demo. But it completely failed the real test: can I actually use this thing when I’m exhausted, dehydrated, and have 60 seconds of rest before my next set?

The answer was no.

So I decided to do what any reasonable person would do. I decided to rebuild the entire thing from scratch, but this time, I wouldn’t just ask AI to write code. I’d build an actual team.

The team I built (none of them are human)

Here’s the thing about building a product properly: you don’t just need a developer. You need different perspectives looking at the same problem. A designer sees things an engineer misses. A domain expert catches things that both of them overlook. A data architect thinks about problems that won’t show up until you have 200 workouts in your history and the app freezes.

So I created four specialized agents in Claude Code:

A UI/UX Designer thinks about layouts, tap targets, information hierarchy, and what you can actually read on a phone screen from arm’s length while resting between sets.

A Swift Developer — handles architecture, SwiftData, state management, and performance. The technical backbone.

A Fitness Expert — someone who actually goes to the gym and knows that a 5-pound increment is a 25% jump on 20-pound lateral raises, and that’s not okay.

A Data Architect thinks about schemas, relationships, what data you’re storing but never surfacing, and what happens when you delete an exercise that’s referenced in 50 past workouts.

And then an orchestrator sitting on top — basically the tech lead who coordinates all four, resolves their disagreements, and produces a unified output.

I gave them all a simple task: audit my app. Tell me everything that’s wrong with it. Don’t hold back.

They all ran in parallel. Four agents, simultaneously reading through my entire codebase from four completely different professional perspectives.

What came back was honestly humbling.

When four experts independently agree you have the same #1 problem

All four agents, working independently, flagged the exact same thing as the most critical issue: no “last session” data during an active workout.

Think about it. You’re at the gym. You’re doing bench press. The app shows you your target: 155 lbs x 6 reps. Great. But what did you actually do last week? Was it 155 x 6? Or 150 x 5? Did you struggle? Should you go heavier today?

The app had no idea. And neither did you.

Every major competitor, Strong, Hevy, JEFIT, shows this data inline. It’s the single most important piece of information during a workout, and my app just… didn’t have it.

The designer flagged it because it was a missing UI element. The fitness expert flagged it because progressive overload is the fundamental principle of strength training, and you literally can’t do it without historical context. The data architect flagged it because the data was already in the database; it was just never queried and surfaced. The Swift developer flagged it because the WorkoutSessionView had no mechanism to fetch previous session data on initialization.

Four different reasons. Same conclusion. That’s when you know something is real.

The 19-tap problem (and why I had to research the entire market)

The weight stepper was the other showstopper. My app used +/- buttons that incremented by 5 pounds. Want to change from 155 to 60? That’s 19 taps on the minus button. There was no way to just… tap the number and type a new one.

I knew I needed to fix it, but I didn’t want to just guess at what “better” looks like. So I did what a PM is supposed to do, I researched the market.

I went deep on six apps: Strong, Hevy, JEFIT, FitNotes, GymStreak, and Alpha Progression. Not a surface-level feature comparison. I looked at the exact interaction pattern for logging a single set. How many taps? What input method? How they display previous performance. How they handle rest timers. What shows up on the lock screen?

The findings were clear. The gold standard is a columnar table layout SET, PREVIOUS, WEIGHT, REPS, checkmark, with tap-to-type numeric input. Strong and Hevy both achieve a 1-tap set completion when the pre-filled values match what you want. My app required 1 tap, best case, 19 taps, worst case.

Strong has a plate calculator that shows a visual diagram of the barbell with plates on each side. Hevy has a Live Activity widget that lets you see your rest timer countdown on the lock screen without even opening the app. FitNotes has configurable per-exercise weight increments of 2.5 kg for barbells and 1 kg for dumbbells.

My app had none of this.

But here’s the PM move: I didn’t just make a list of features to copy. I used the research to make decisions about what we’re doing AND what we’re deliberately not doing.

We’re not adding AI-prescribed workout targets (GymStreak does this, but it prescribes non-standard weights like 62.5 lb dumbbells, not helpful). We’re not adding social features (Hevy is social-first; we’re a private training log). We’re not adding 3D exercise demos (GymStreak’s look cool but push the actual logging interface lower on screen, and experienced lifters don’t need them).

Knowing what to say no to is half the job. (MoSCoW for the rescue)

The real challenge: you can’t just tell AI to “build the app”

This is where most people’s understanding of AI-assisted development breaks down.

You’ve probably seen the tweets. “I built a full-stack app in 30 minutes with AI!” And sure, maybe they did. A demo. That works for the screenshot. That falls apart the second you have real users, real data, and real edge cases.

The actual hard problem isn’t getting AI to write code. It’s getting AI to write the right code, in the right order, without losing track of what it already built, without breaking things that were working, and without trying to be clever in ways that create problems three phases later.

Here’s what I mean. My design spec was 42 tickets across 10 phases. The final WorkoutSessionView.swift file — the screen where you actually log sets, gets touched by 13 different tickets. If I just threw the whole spec at Claude Code and said “build it,” here’s what would happen: (Trust me, I have been there!)

By ticket 15, it would have forgotten the architectural decisions from ticket 3

It would try to add a feature to a file that another ticket already restructured

It would “helpfully” refactor something that a later ticket was planning to change

Context window would overflow, and it’d start hallucinating function names

So I had to solve a meta-problem: how do you project-manage an AI that has amnesia between sessions?

The blueprint (or: how to give AI amnesia and still ship)

I ended up building what’s essentially a sprint playbook, but designed specifically for Claude Code’s constraints.

Rule 1: One ticket per prompt. Never ask it to do two things at once. Each ticket is a focused unit of work, modify these specific files, achieve this specific outcome, commit with this specific message.

Rule 2: Anchor prompts. At the start of every phase (or new session), I paste a block of context that tells Claude exactly what phase we’re in, which files to read, what’s already been done, and what the rules are for this phase. It’s like a morning standup briefing for an engineer who just woke up with no memory of yesterday.

Rule 3: Commit after every ticket. This creates 38 restore points instead of one. If ticket P2.4 breaks something, I roll back to P2.3, not to the beginning of the entire project.

Rule 4: UAT gates between phases. I manually test every item in the checklist before moving to the next phase. The data foundation has to be solid before I start redesigning the UI that sits on top of it.

Rule 5: Don’t let Claude refactor beyond scope. AI loves to “help” by rewriting things you didn’t ask it to touch. You have to be firm: “Don’t change that. Stick to the current ticket. We’ll address that in Phase X.”

This might sound obvious, but it’s the difference between shipping and spiraling. Most people who struggle with AI-assisted development aren’t struggling with the AI’s capability; they’re struggling with the project management around it.

Two Claudes, one blueprint

Here’s something that might sound absurd. I had two separate Claudes working on this project, and I had to get them aligned.

Claude Code (the engineering one) generated its own implementation blueprint based on reading my actual codebase. It knew every file path, every line number, every function signature. Its blueprint was technically precise.

Claude.ai (the strategy one, the one I’m talking to in the browser) generated a blueprint based on the design documents, safety patterns for AI-assisted development, and process guardrails. Its blueprint had anchor prompts, emergency procedures, and rules for preventing scope creep.

Neither blueprint alone was sufficient.

Claude Code’s plan had exact line numbers but no safety rails. Claude.ai’s plan had a great process, but referenced files generically. So I merged them. Took the engineering precision from one, the project management guardrails from the other, had Claude Code review the merged version, resolved the contradictions, and produced one final document.

I was literally doing cross-functional alignment… between two AIs.

The funny part? They disagreed on things. Claude Code wanted to jump straight into building the new SetRow component. My blueprint said “extract the old one into a separate file first as a safety step, verify it compiles, THEN build the replacement.” Claude Code’s approach was faster. Mine was safer. We went with mine because in a 42-ticket sprint, one bad merge can cascade for days.

That’s a product management decision. Speed vs. safety. And the right call depends entirely on the context of your project.

The conflict resolution table (where PM skills actually matter)

The four specialist agents didn’t always agree. The fitness coach wanted 4 types of personal records. The data architect wanted 7. The designer wanted swipe navigation between exercises. The competitor research showed that apps using swipe (JEFIT) got complaints while apps using vertical scroll (Strong, Hevy) didn’t.

Every one of these disagreements needed a decision. Not a compromise, a decision with a rationale.

We went with 7 PR types instead of 4 because I have lumbar spondylitis. I train conservatively. Weight PRs are rare for me. But rep PRs? Volume PRs? Those happen almost every session. A 7-type system means I get positive feedback nearly every workout. That’s not a technical decision, that’s a product decision informed by understanding my user (myself).

We went with vertical scroll instead of swipe because the competitive data was clear. We went with tap-to-type as the primary input because every top app does it, and JEFIT’s scroll wheel is the most-criticized input method in the entire category.

This is the job. Not writing code. Making decisions and being able to explain why.

42 tickets. 10 phases. A Notion board. Zero code written yet.

As of right now, I have:

7 design documents (4 specialist audits, a competitor analysis, a design spec, and a final decisions doc)

1 implementation blueprint with 42 tickets, exact file paths, dependency maps, and UAT checklists

A Notion project board with every ticket tracked — owner, status, phase, priority, risk, dependencies

4 specialized agents configured and ready

A git branch strategy defined

And I haven’t written a single line of production code.

Some people might see that as over-planning. I see it as the reason this won’t turn into a mess. The V1 was a 2-week hack. V2 is a proper product build. The difference is that I know exactly what I’m building, exactly what order I’m building it in, exactly what the risks are, and exactly how to recover if something goes wrong.

Phase 1 starts with fixing the data layer, cascade deletes, broken rest time tracking, and the notes system that silently drops per-exercise notes. It’s not glamorous. It’s not visible. But everything else depends on it being right.

Then the SetRow redesign. Then navigation. Then the PR system. Then, rest timer enhancements. Then Live Activity. Then the plate calculator. Then polish. Then, a full regression test where all four agents review the complete codebase one final time.

That’s how you build something that doesn’t fall apart.

What this is really about

I’m not writing this to show off an app. Honestly, it’s a gym tracker. There are hundreds of them.

I’m writing this because the way we build software is changing, and most people are approaching it wrong.

The narrative is “AI replaces developers.” That’s not what I experienced. What I experienced is that AI replaces the mechanical parts of development, writing boilerplate, remembering API signatures, and structuring files. What it absolutely does not replace is the thinking, what to build, in what order, why this approach over that one, when to say no, how to recover when something breaks.

The tools are insanely powerful. Claude Code running four parallel agents that simultaneously audit a codebase from different professional perspectives — that’s genuinely new. A year ago, this wasn’t possible. But the tools are only as good as the person directing them.

You still need someone who knows that 19 taps are too many. That “last session data” is the most important missing feature, not the fanciest one. That vertical scroll beats swipe gestures because the data says so. That you need to fix cascade deletes before you build a Live Activity widget because data integrity is a prerequisite, not a nice-to-have.

That person is the PM. Or the designer. Or the senior engineer. Or honestly, anyone who thinks in terms of users, tradeoffs, and systems, not just features.

The AI is the team. You’re the leader. And the quality of what ships depends entirely on the quality of your leadership.

I’ll be posting updates as I execute the 42 tickets. If you want to follow along, here’s the Notion board where every ticket is tracked in real time.

Previously: Me, Myself, and AI: A Product Manager’s 2-Week Sprint from Idea to iOS App