Feed | Shreyas Prakash

How I started building softwares with AI agents being non technical

16 minutes read

agentic-engineering

At the start of the year, the CEO of Anthropic had made a prediction that 90% of code in enterprises would be written by AI by September. Now that we have crossed September, we now know that the prediction turned out to be false. As Ethan Mollick mentions, he only seems to have been off by a couple of months (this was recently posted by Boris, the creator of Claude Code) where he mentions 100% of his contributions to Claude Code, written by Claude Code!

Last year 2025, by no doubt has been the year of AI agents. And I was tempted right from the beginning to play with this shiny new toy. And as a “technically curious” person, I want to dive right in. Ended up spending most nights and weekends understanding how to build software with AI agents. It was super fun..

Out of the many side projects that I built this year, here are some memorable ones:

writing platform with proof-of-human authorship, 98% AI usage
voice dictation tool with dictionary support, and speaker diarization, 98% AI usage
personal website, customised to my own bespoke taste, 90% AI usage
dial international phone calls on the web browser, 98% AI usage

Apart from these heavy-duty apps, I also built various micro-tools that served various ad-hoc use cases which include a chrome extension to import X bookmarks as Trello cards, a Windows XP-esque wallpaper with dynamic 3D clouds, an AI chess coach for improving elo score through socratic dialogue, an Obsidian plugin to help me prioritize the worst rough draft of an essay to improve first, and even a Mac-native open source screen recorder.. (all open source, and free for use)

The beginning was quite benign. Early 2025, I felt initially that LLMs could only build toy apps, and were not capable of building anything truly substantial. So I was stuck to creating various one-off prototypes using Lovable and Claude Artefacts to make them. I still wasn’t sure about it’s usage in complex codebases. The only way I was using them was by copy-pasting code snippets to ChatGPT and feeding them back. Then it evolved to simple autocomplete on AI-native IDEs such as Cursor IDE. Then I started using the chat window on the IDEs directly to interact with my codebases. I now run a Ghostty terminal with multiple tabs open with Codex instances.

By this time, mid-2025, the scaling laws were kicking in, and the agents were becoming much more successful in performing longer operations without breaking or hallucinating in between. The recent charts show tasks that take humans up to 5 hours, and plots the evolution of models that can achieve the same goals working independently. 2025 saw some enormous leaps forward here with GPT-5, GPT-5.1 Codex Max and Claude Opus 4.5 able to perform tasks that take humans multiple hours—2024’s best models tapped out at under 30 minutes. With this equipped model capabilities, I was excited to try CLIs and had great success. I hardly look at any code nowadays, not even a code ditor. My current setup looks like this:

It’s all on the terminal with Codex with multiple tasks running on different terminal windows. All I do is, engaging in a socratic dialogue with the models on various aspects: is X more performant than Y? Have you researched on alternative to perform feature Y? Does the API provided by platform Z have any rate limits which need to be considered?…

To some extent, it almost feels like coding has evolved to a higher-level of abstraction, and like how Karpathy sensei mentions in this tweet, “here’s a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering”

Building such projects was giving me an intuitive understanding of how something as non-deterministic as a large language model can fit into a deterministic workflow of building software. I slowly moved from being a “no code” person, and with AI agents I moved to being a “some code” person. I still couldn’t write code and defined myself as a “non-technical rookie” to some extent. But with AI agents, it just changed the game, I was able to steer them towards what I wanted to achieve, and build great software.

How I use LLMs now

Here are some lessons I learnt while just jumping into the “AI waters” using LLMs and agents, and learning how to swim with them (as of Jan 2, 2025, things change really fast TBH):

Model usage

Model usage boils down to economics, with evaluations of tradeoffs between cost and intelligence being done for answering various questions on a frequent basis.. (you wouldn’t really use the most sophisticated model on the leaderboard to figure out how to center a div, for eg.). I now use gpt 5.2-codex-extra-high for complex problems, and gpt5.2-codex-medium for anything else.
In my initial explorations, I used to be very open ended in deciding which framework I should use. I was going with the defaults which codex gave. Especially when this gets subjective on a well-oiled, well contributed library, or framework which we can trust. I’ve arrived at a sensible set of defaults which I’m comfortable to understand, and almost always use them for various apps. For any web-app to be built, I use this starter kit which is basically Ruby on Rails in the backend, with Inertia for using React on the frontend. It does a good job of bringing together best of both worlds: React and Rails together, and also has great component libraries such as shadcn baked in. For anything mobile, I build on Expo, and for one-off frontend prototypes, I build React/vite apps. Over time, I’ve also gained an intuition on the prowess of each of these frameworks, so I can understand what their limitations are. language/framework and ecosystems are important decisions taken,and hard lessons have been learnt.
In terms of model selection, I almost always choose Codex over anything else, even Claude Code. Claude Code has great DX, and other utilities such as hooks, skills, commands etc, but Codex seems to just “get” it without any such charades. I was using Claude Code until I saw the brilliance of Codex from Peter Steinberger in his talk at the previous Claude Code Anonymous meetup in London. I’ve never really touched Claude Opus/Sonnet after that.
Another reason I use Codex is that they’re not as sycophantic as Claude, and pushback whenever necessary. When I make delirious requests.. Codex is like “are you sure you want to do Y, it might break X and Z… here are couple of alternate options a, b and c…”

UI prototyping

For UI prototyping, I almost by default ask the model to come up with multiple variants of the same problem. While building the landing page for Signify, I was exploring kinetic typography for expressing the message through type, so I prompted the LLMs to come up with three variants of the same. If the variant 2 “clicked”, I then asked it to branch further, and come up with 3 additional variants of variant 2, before finalising on the direction. I was diverging to converge, and then again, converging to diverge. All these variants I now store them on the /sandbox route for an app, before I integrate with the actual app..

Another technique for faster UI explorations in a “low fidelity” way is to ask it to generate ASCII diagrams of the UI layouts, and it cooks up something like this, making it easier to iterate on loop.

For the past three projects that I’ve shipped with AI agents I’ve never touched Figma to communicate anythng at all.. despite years of being ingrained in the Figma-way of building prototypes.. Now I just use excalidraw (to draw loose sketches), ascii diagrams (to generate lo-fi mockups) and prototype sandboxes with good design systems (to generate hi-fi mockups)

Workflows

I usually parallelize by running multiple tabs with Codex open. I don’t git worktrees or anything of that sort, but in a way I prevent the models from stepping into each other’s toes by means of atomic commits:

Keep commits atomic: commit only the files you touched and list each path explicitly. For tracked files run git commit -m "<scoped message>" -- path/to/file1 path/to/file2. For brand-new files, use the one-liner git restore --staged :/ && git add "path/to/file1" "path/to/file2" && git commit -m "<scoped message>" -- path/to/file1 path/to/file2

For debugging, I almost always copy+paste the dev/production tail logs to ChatGPT and it solves 99.99% of the problems. I’ve heard some of my friends have a much more advanced workflow where they integrate Sentry to log all the errors (I haven’t personally tried this yet, I wanted to cross this bridge when I have no other escape route)
With AGENTS.md or CLAUDE.md file, I give instructions only on a higher abstract level, as I’ve seen some Twitter folks mentioning that the models almost always bypass the code snippets which are added to the agents file. Think of this as higher level steering instructions. Not too detailed, and not too vague either. I use a variant of this gist for my own purposes.
With context window optimisation, I’ve been recently understanding that there is a great-dumbening of the model especially when the context window is more than 40% of it’s actual limit, and the best approach then is to start a new chat with the agents, instead of adding more to the same chat session.
I don’t do compaction of the chat windows as I view them as lossy. In case the chat is more than it’s 40% limit and if I haven’t been able to get the problem fixed yet, I instruct the agent to write a markdown file with the output of all the revisions, changes and decisions made. I then reference this file to a new chat
No more plan modes. Previously with Claude, I used to build very detailed product specs documents (following Harper’s LLM codegen hero’s journey guide, with Codex now, it’s changed. Instead I just write a “product vision” document. This helped set expectations on the vision I want to build the product towards. This was also 100% written by me without any AI agents help, as this was something I could uniquely contribute. Plan mode was just plain boring, as I was not so excited to create a 50-point to-do lists to build MVPs. I started feeling almost like a mechanical turk blindly pressing “continue continue continue..” ad infinitum without sharp thinking. This process of taking the complete idea as input and then delivering an output was stripping me of my creative process and wasn’t really working well for me. Now, I just start with the boilerplate starter kit and ask questions based on various user stories.. I would say something like “I want users to sign in with Google” and it builds it. then I’m like “I want users to be onboarded on how to use this service” and it builds an onboarding page. one by one, one user story at a time until I build an MVP.
break your app down into what users actually do. “a user can sign up with email and password.” “a user can create a new post.” “a user can see a feed of all posts.” this is the language the ai understands. this is how you communicate clearly.

For product vision drafting:

Ask me one question at a time so we can develop a thorough product vision for this idea. Each question should build on my previous answers, and our end goal is to have a detailed product vision, I can hand off to all of you (agents) to provide a direction of the north star. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.

Here’s the idea: [insert idea here]

No matter whatever code is written, TDD is still a must. LLMs can still make errors. I instruct them to write tests, and I read through the test scenarios to cross check if the user journey logic is intact.
As I’ve now started to build more projects, I have them all neatly organised within a /Projects folder with various projects under them. /Project 1, /Project 2 etc.. If I run into an error which I’ve encountered in a different project that I’ve solved, I reference feature X and it’s implentation from Project 1 into Project 2 and it does it neatly. Over time, as we accumulate exposure to more problems solved by means of LLMs, it almost becomes an art of “compound engineering”, where previous solutions, solve current problems
For more “harder” problems to implement, or for new feature implementations, I break the prompts into three parts. Act one would be to research potential ways to integrate the feature where I ask codex to come up with three directions, from which I pick one. Act two, involves asking Codex how it aims to build it, and the series of steps it would entail. Knowing this helps me steer Codex better. Act three involves executing it’s plan. While Act three is ongoing, I do keep an eye on what it’s doing. If something seems fishy, I either abort the operation, or ask followup questions for it to look closely. This was popularised by Dexter Horthy from Human Layer, and is a nice way to separate (research) (plan) and (execute) into different operations for clarity.
For vibe coding on mobile, I initially attempted to run a Tailscale server, where I install a headless Claude Code CLI on a VPS server with which I can text over phone via SSH.. however this was quitew slow, and I didn’t enjoy the experience as much. For now, I just use Codex web to chat and create PRs.. once I’m back at my desktop, I just code review the PRs and merge them with the codebase…
I’ve also been exploring skills. I recently built a “Wes Kao writing” skill to improve my executive communications at my day job. This was a custom skill fed on all the blog posts written by Wes Kao, and gives much more refined feedback on how I could improve my first drafts in business comms.. I’ve also been using Claude’s frontend-skill for instructing agents with building UI better.. I’ve seen tons of resources (such as this one, but haven’t caught up yet)

Most of these ideas I’ve learnt from Peter Steinberger, Ian Nuttall as well as Tal Raviv / Teresa Torres on Linkedin have also been inspirational to understand how to approach building with AI agents from a product lens. (I recently found Teresa’s “build in public” updates on her recent AI interviewing tool to be quite motivating)

What I haven’t explored yet (but would try soon)

To run pre-commit hooks. The robots are very eager with making elaborate number of changes, and pushing commits. And this behavior does pollute the Github actions with various linting, formatting, and type checking errors. One way to solve this is by means of a pre-commit hook. This can be easily installed via uv tools install pre-commit command and then just build out a nice .pre-commit-config.yaml file..
For instructing Codex on UI improvements such as “add padding”, “fix borders”, “change font” etc.. along with screenshots on where the change needs to happen. I’m planning to try the docs:list script to make these easier. It forces the model to read docs, and also stay up-to-date.
I’m also yet to try agentic coding on mobile. Top on the list is Happy Engineering which has a great UI..
Not really tried any form of MCPs yet, but wanting to try either the Playwright MCP, or Chrome Devtools MCP for browser automations. Screenshotting images of the UI for various tasks seem quite painful at the moment. Not sure if it should be an MCP, or a Skill that should be tasked to do this.
99% of the code in the codebase is written by the agents, but for the remaining 1% I would still need to manually edit the files using an IDE. For this 1%, switching windows across to Cursor IDE seems to be a lot of context switching which I hate. Wanting to try a native terminal editor that’s not having a steep learning curve. Tried neovim but was struggling to learn it well. Planning to try Fresh, which, as it’s name suggests, seems to be quite a fresh take on a terminal editor.
So far, I’ve been purely building apps with Codex CLI, and I suspect this might also evolve or change over time. I’ve been seeing folks doing various experiments with an abstraction layer on top of such Headless CLIs.. Steve Yegge in his recent essay, describes these increasing abstraction levels as a spectrum, and as follows:

Stage 1: Zero or Near-Zero AI: maybe code completions, sometimes ask Chat questions. Stage 2: Coding agent in IDE, permissions turned on. A narrow coding agent in a sidebar asks your permission to run tools. Stage 3: Agent in IDE, YOLO mode: Trust goes up. You turn off permissions, agent gets wider. Stage 4: In IDE, wide agent: Your agent gradually grows to fill the screen. Code is just for diffs. Stage 5: CLI, single agent. YOLO. Diffs scroll by. You may or may not look at them. Stage 6: CLI, multi-agent, YOLO. You regularly use 3 to 5 parallel instances. You are very fast. Stage 7: 10+ agents, hand-managed. You are starting to push the limits of hand-management. Stage 8: Building your own orchestrator. You are on the frontier, automating your workflow.

I haven’t ventured into these stages personally in 2025, and I’m also not sure if things would change in 2026. Stages 7 and 8 are still very controversial, and debate-able right now, and is still not ripe enough for even the early-adopter’s “adoption”. Agent orchestration seems be the hottest word right now in such AI-pilled dev circles and I’m curious how this would unfold..

Wrapping up 2025..

As 2025 ends, and a new year begins, it feels that everything is possible. It’s the “age of the builder” and understanding “how to write code syntax” is no longer the bottleneck. This is also likely going to be one of the most important decades in human history, and even ordinary actions like “putting an essay on the internet” can be extremely high leverage.

Aiming to think hard about what we’re doing, and post more, write more, participate more in 2026!

View original post →

Legible and illegible tasks in organisations

7 minutes read

product

I’ve been debating with my tech lead about the role of qualitative methods, continuous discovery, interviews, observational studies, in generating user insight. In his view: if we already collect vast amounts of data, isn’t it more efficient to analyse what we have rather than spend time talking to users? Quantitative data is more measurable, comparable, and easier to use as evidence. After all, what can one interview reveal that hundreds of tagged events, funnel charts, or aggregated survey scores cannot? An interview in contrast is like a needle-in-the-haystack, does it even matter when huge amounts of data is already available?

To me clearly, this was not true. In a nutshell, the reply I provided was that both “legible” and “illegible” work is important. To explain what this means, and how it applies, I need to take a long-winded detour into the world of James C. Scott, and help you see this through his eyes.

The key idea from his book ‘Seeing like a state’ translated to organisations is as follows: As companies scale, they prioritise legible work over illegible work. This makes it harder to incorporate “illegible” work through product, processes and people. Failing to incorporate the illegible leads to a lower efficiency and throughput for the organisations.

If we were to translate this into three principles:

Modern organizations exert control by maximising “legibility”: by altering the system so that all parts of it can be measured, reported on, and so on.
However, these organizations are dependent on a huge amount of “illegible” work: work that cannot be tracked or planned for, but is nonetheless essential.
Increasing legibility thus often actually lowers efficiency - but the other benefits are high enough that organizations are typically willing to do so regardless.

The other way to put this is that “what gets measured, gets managed”, and when this company prioritises the measurable (more manageable) work, the overall efficiency of the company gets affected.

I’ll highlight some examples from the book to illustrate these points:

Turning a forest into a simple grid of identical trees for easier measurement: government replaced complex topologies of natural forests into perfectly measureable, easily taxeable outputs, that are more straightforward when it comes to “calculating the inventory”. But it destroys the efficiency of the forests as the (soil fungi, insects, microclimates etc) are invisible to the planners, eventually making the overall forest sickly, and low-yield.
The USSR experiments on agriculture: under the centralised Soviet state, agricultural produce was made fully “legible”, everything could be seen, counted, directed, and reported upon. Efficiency again collapsed as paper work mattered more than crop quality, incentives encouraged false data, and not good harvests.
In the 19th/20th centuries, we had Scientific Management under ‘Taylorism’ that led to reducing skilled labour into measurable, timed micro-tasks. Factory workers who had a series of jobs where measured based on how much time each job takes. The modern day counterpart for Taylorist factory management would be the excel timesheets, quite common in consultancy companies where projects are commissioned based on the “time alloted” by the team. For the sake of legibility, output was prioritised over outcome.
Hunter gatherers to early-agrarian societies: anthropological evidence showed that early hunter gatherers were much more well-built, less disease-prone and nourished, compared to the early-agrarian societies. What would have been the case? I was of the impression that progress to agriculture might have lead to more dietary diversity, not the other way round. The answer here also lies in James C. Scott’s key idea around legibility and illegibility. Early farmers could only cultivate 2-4 crops at scale especially carbohydrates such as wheat, barley, rice and potatoes. This made food measurable and controllable for the farmers, or the union of farmers, but dietary diversity collapsed. Again, being “easy to tax”, “easy to store”, “easy to collect”, “easy to standardise” and “easy to distribute” took preference. Being legible came at a biological cost which was ignored.
This applies everywhere, even the introduction of standardised units such as the S.I unit system for measurement, led to a loss in efficiency. Previously, we had local, contextual units that were applicable only to certain regions. For example, we had this unit called the “Morgen” that was defined as the area which one farmer + one ox can plough in a morning. This was replaced by “0.086 hectares”, what do you think is more efficient and applicable for the farmers of that day?

In the world of corporate-speak, organisations and softwares, this is made so visible in terms of program status update calls, bi-weekly delivery syncs, monthly OKRs, CSATs, NPS scores etc. A company that is known for certain “illegible” qualities as a startup, as it grows to be a scale-up sometimes loses them for the sake of becoming more measurable, losing itself and it’s DNA in due course. But once in a while, you might also see exceptions to this rule where organisations consciously make the illegible work more visible:

In my previous product role at Noora Health, a healthcare non-profit, we were running a program that trained patients and caregivers in maternal child care. In this case, the number of “trained patients” was legible work. This was the figure given to the grants, philanthropists and other stakeholders who were financing the program in proportion to the scale of such impact. But you also had “number of trained patients where nurse empathetically listened to the patient, and spoke to them gently” which was very illegible to measure and manage. As a consequence, this didn’t get measured, and therefore, not managed. How can you even measure “care”? But that doesn’t discount the fact that illegible work is not important (it’s important but it’s difficult to measure)..

To ensure that the “illegible” work was still incorporated into the core program while it continued to scale and serve millions of patients and caregivers, the caregiving program had two components: an in-person component where nurses were trained to educate patients and caregivers in health facilities. And a digital component which provided just-in-time information through mobile phones for family caregiving.

If it was all about maximising “number of patients trained”, Noora Health could have very well just revamped the program focussing on JUST the digital component (as it’s easy to scale to 10s of millions this way, why bother to have an in-person component? even..). But they didn’t make that compromise, as there is a lot of “illegible work” when it comes to providing care. The way they have absorbed “illegible work” as a part of their core-program is a beautiful lesson on how it’s possible to also scale while ensuring the illegible work remains uncompromised.

Coming back to the original premise of this essay, where I was debating with my tech lead on why user interviews are important despite tons of quant data… it boils down to the same principle: illegible work should remain uncompromised.

Data can tell you who clicked on what, what they abandoned, how often an error happened etc.. but not “what users wish they could do”, “what they tried but didn’t know how”, “what workarounds were created” etc.. In that sense, data is a lagging indicator. It measures the past, and not what’s possible. What’s possible is not captured in “legible” ways. We have tacit knowledge. Lived experience. Informal behaviors. All these are “illegible”.

Data only shows the legible past. Continuous discovery makes the “illegible” work more visible, even before it becomes data. It’s not to say that this is a good replacement for quants. It merely completes it, it’s another lens we can adopt for truth-seeking, building products closer to what users really want.

View original post →

L2 Fat marker sketches

4 minutes read

design

What’s the best level of detail one should use to communicate an idea?

Let’s take the most simplest, benign format: the humble paper napkin sketch.

It makes the representation of the idea too crude, but on the flip side, it actually welcomes the audience, offering them a safe space to critique the idea. It suggests them to provide constructive feedback without any hesitance.

I would call this as an L1 sketch, as in “Level one fidelity”. This could very well just be a wireframe diagram.

While sketching ideas for storytelling, I spend some time grappling with the right level of abstraction so that I don’t slow things down, or lose the size/shape of the ideas appearing in the corner of our brains, or the tips of our tongues.

And my conclusion here is that not all ideas SHOULD be loose and to be represented as an L1 sketch. Here, it can also create room for misinterpretation, with the audience having a wrong picture of the idea in their minds, and you might spend more time clarifying what the “core” idea is.

So, sometimes you go further than an L1 sketch.

On the farthest extreme of fidelity as an L4 sketch, you have the Lovable, or Claude Code generated high-fidelity prototypes/mockups. Especially with the recent AI tools for prototyping being adopted, it’s become far easier to generate high-fidelity prototypes with a quick prompt. The user can click, drag, move, explore all possible interactions. And the earlier notion that “high fidelity prototypes” take time to make and test, is no longer valid in this day and age.

As L4 sketches are highly-polished, and gives the appearance of being “finished”. In that state, you might get lesser feedback, but they might be more exact and directed… The users might say, “hey when I click on this, I want it to be dragged from X to Y…”. They have much more affordances in the medium which they can direct and provide constructive criticism on..

Between L1 prototypes (paper-napkin sketches) and L4 (Lovable prototypes) you also have a spectrum in between. And I think there are still highly relevant use cases for an “L2 sketch”. For this, I remixed the idea from Jason Freid’s Shape Up Book where he talks about “Fat marker sketches”. It’s still a high-level drawing, but it goes one-level deeper. This balances the need to help people “get” the idea without going too far into detail.

There is some nuance here for an L2 sketch as one needs to ensure that the right balance of vagueness and concreteness is maintained.

Let’s say you’re describing the UI (payment form preview), in case of L1 sketch you still wouldn’t provide much detail on the journey (you’re just putting a box describing the function, not HOW the UI behaves or responds). That’s still a black box. To describe the journey as well, you might have to go two levels deeper (L2)

L2 Fat marker sketches can be very effective in such cases; you just need to take more care to label them cleanly on the UI, describing the payment form as well as how the user interacts with it.

As the saying goes, ‘Medium is the message’, and while we might put in efforts to communicate the idea, it’s also a prudent decision to put efforts in finding the medium that represents this desired fidelity.

So we might just have to nudge the slider on the fidelity scale, to arrive at the right one: if it’s a napkin sketch of a wireframe diagram, or a highly fungible interactive Lovable prototype. I’m still learning to choose my mediums more wisely..

View original post →

Writing as moats for humans

4 minutes read

writing

Most writing on the internet is AI-writing now. The dark forest theory of internet, a re-hash of a concept popularised by the sci-fi author, Cixin Liu referring to this hostile digital landscape where most content is written by the bots, and to escape from this cybernetic fake-ness, users retreat to hidden, invite-only “private” communities to escape this chaos. We live in this hostile. digital. landscape:

Every now and then when I read something on the internet, to some extent, I know it’s AI generated. It smells strange and I can sense it in my bloodstream. My AI-sniff-test radar’s sensory perceptions got heightened after some test-runs with the usual suspects: you had Claude with it’s opening-line, “You’re absolutely right!”. Or the incessant — use — of — em-dashes — everywhere (even after prompting it not to use them em’).

It’s still not so clearly describable as to why a particular writing was solely written by a human. This intangible essence has not been codified yet (perhaps, that’s also why the LLMs have not caught up to this mode of writing yet). The larger proportion of people might just fail at detecting AI writing online; giving marketeers more confidence in injecting more such premium-mediocre AI slop. And as a result, we’re more exposed to soul-less, spine-less writing.

This is why, I think actual writing would remain as some of the last remaining human-moats. actual writing. I’m talking about the kind of writing that makes you “feel” something. Makes you stir up and take action.

I don’t think the AI-writing can one-shot such responses anytime soon.

Even if we go with the scaling hypothesis, and assume GPU cluster size to increase, datasets to expand leading to a cambrian explosion in intelligence; i’m deeply sus about these RLHF’d responses by the LLMs. The writing is less likely to be spiky as it’s just a next-word predictor serving as an average gaussian-mean for all the possible responses splattered on the internet. The AI-writing “evens out the edges”, and by doing so, the writing loses it’s essence.

AI-writing is not completely useless though. They’re really good at blending things. Sure, they can do Dostoevsky in the style of Jane Austen. Or generalise a 100,000 word essay in novel ways, and come with various other blended styles and formats. What LLMs allow is recombination.

If good writing is what we call as mere combination, recombination of sequence of texts in harmless ways; then sure, AI writing could pass this test. But this is clearly not the larger umbrella of writing jobs. AI-writing only serves mundane writing-jobs such as: “business” writing, dry boring policies, condensing meeting action items, making exam notes, regurgitating legal text/s, literature reviews, market research, etc. As Venkatesh Rao describes it in his recent essay, “writing is now toy-making, and reading is now playing with toys.” As all these writing jobs are mix-and-match combinatorics are still “in-distribution” work. It could be delegated/or automated out.

True writing is still “out-distribution”¹ ² and cannot be automated away³. It’s about thinking hard, and it’s not just a combinatrics problem. You still need to know how to play with all the text toys available in order to make the writing more spiky and opinionated.

And when you’re truly spiky, it clicks. Instead of 10,000 people saying meh..ok, you might have a 5 saying “wow!!!”. Good human-writing achieves long-tail resonance.

As Alex Guzey puts it, there are various such out-distribution pieces of work, which includes research, learning deeply, real writing etc.. ↩
Simone argues that, in fact, truly novel innovations (proofs of unsolved theorems) would always be “out-distribution” for the LLMs. This would be the case as points would exist outside the “convex hull”of an LLM’s training data. But this is not just the case for LLMs, but humans work the same way by creating genuinely new things by just “remixing” ↩
Some say that this might not be the case, and might cite the existence of AI-girlfriends as proof that AI has cracked the human-ness problem of writing. For this, I would argue that a chat response is different from a meaty essay. A chat response resembles more of a next-word prediction, which AI is better adapted at. ↩

View original post →

Beauty of second degree probes

3 minutes read

decision-making

I’ve been noticing this pattern again and again—just about enough to make this pattern a heuristic for my own understanding. “No lie can overcome a second degree probe”

If you probe a lie, just about enough, it reveals itself in due course. At work, I see this all the time. When a statement is just made, in a matter-of-fact way in which it assumes everyone already knows it, and when you ask “why it’s so?”, often, there is an explanation that’s provided to this statement. This is the first line of defense for a lie (or an assumption), and it might most likely exist.

But if you then question the explanation as well, and ask “why is this explanation this way, and why not X?”, you start seeing some cognitive dissonance in action. There apparently is not explanation to the explanation!

Some examples on top of my mind:

For a statement, such as “We need to restructure the team for better efficiency.”

First probe: “Why will this structure be more efficient?”

Response: “Because it reduces communication overhead.”

Second probe: “How exactly does it reduce overhead, and have we measured current communication costs?”

Response: …

Reveals no analysis was done, reveals it’s just a knee-jerk reaction to a directive from a C-level executive.

“Our customers prefer the premium version.”

First probe: “What data shows this preference?”

Response: “Higher conversion rates on the premium product page.”

Second probe: “But isn’t the premium page the default landing page, and how do we know preference versus just path of least resistance?”

Response: …

One might ask, well, it makes sense to do second-degree probes, but why not third-degree or fourth-degree probes?

The short answer here is that one could probe to various deeper levels, but they often don’t pass the test of succinctness, which is highly needed in meetings. You want to question the assumptions deeply, without being too intrusive.

It’s the right level of depth to know if there are any assumptions worth investigating further (perhaps not on the same meeting call, as you don’t want to be dismissed as a modern-day braggart Socrates here)

The moment I realised that this is a good heuristic, I started noticing this pattern everywhere (not just at work). Let’s take even investment advice:

Statement: “Real estate always appreciates over time.”

First probe: “What’s the historical data on that?”

Response: “Look at housing prices over the last 50 years.”

Second probe: “But what about Japan from 1990-2010, or specific regions during economic downturns, and how do we account for survivorship bias in the data?”

Response: “Umm…well…”

The second-degree probe is not a magic wand to shatter inherent assumptions, what it does tell us is that we don’t question our assumptions deeply enough. Sometimes it’s just a castle of glass, waiting for the right question to act as a pickaxe to shatter this whole foundation based on a false premise.

View original post →

Read raw transcripts

3 minutes read

knowledge

I opened up Claude one day, and asked to summarise Dostoevsky’s Crime and Punishment into one sentence; and it said:

A young, impoverished ex-student named Raskolnikov murders an elderly pawnbroker to test his theory that extraordinary people are above moral law, only to be consumed by guilt and psychological torment until he confesses and finds redemption through love and spiritual awakening.

Would this mean that I’ve saved 20 hours of reading, 30 days of thinking about the plotline twists, and 5 years of reflecting on the storyline to finally “get” what Raskolnikov did in this book?

No, absolutely not. I don’t think this tight compressed sentence is even a cheap substitute for what I’ve read.

I’ll give another example: More recently, a few team members picked up on the Microsoft copilot usage; and have been using it to process raw transcripts from interviews by asking the agents to ; "summarise all the transcripts, and then come up with 10 key insights you've identified.."

It’s not bad to do this, and perhaps it could save some time (I did this once or twice, and I later realised I mistook signal for noise). As I’ve shown you earlier with the compressed one-liner quote on Dostoevsky’s Crime and Punishment, it does a poor job at unearthing insights. These “compressions” have this diabolic nature of making 0 sense, and also total sense at the same time.

We can’t just offload the insight generation process to the LLMs; they’re not the same.

Compare the copilot gibberish to actual humans instead. Let’s have 10 humans in a room, who read through the same set of 10 interview transcripts, synthesize them together with a facilitator in charge, and come up with a list, and then compare the result with the copilot gibberish, you will see what I mean. What this might generate, might actually make more “sense”. It could provocate, titilate, or make you do a deeper “hmmmm…” when you read it.

We, as humans might be limited in computation, limited in memory, as well as limited in certain types of intelligence that the LLMs possess, and yet, because of these very limitations, we have the potency to generate unique insights.

The way each of us distill the raw transcripts might be totally different. When we read, something unique happens in our funny little brains, it connects with all our key memories and gets situated in a way, that in total, we might have an unique interpretation.

Which is why, I see high value in reading through raw transcripts.

And for these reasons, whenever I see these up-and-coming apps/startups talking about “killing books”, “killing meeting notes”, “killing note taking” “killing podcasts” by summarising everything into a few paragraphs, or even a few sentences. I truly mock at them. These “AI is taking over your jobs” propogandists run into the mistake of confusing compression with distillation. It’s not the same.

Let the LLMs do the grunt work of compression; we can safely delegate them at this task.

And we should continue reading more raw transcripts; reading footnotes; reading meta-commentaries; the “devil lies in the details”, and how our brain interprets them. Distillation might be the only remaining moat for us.

View original post →

Boundary objects as the new prototypes

3 minutes read

prototyping

Citing from one of those work situations in the past: Our team had a catchup to discuss the problem statement; but then right after that; the UX jumped the gun and presented a “prototype” to the team for inputs. They didn’t co-create the user-journey, nor did any user story mapping, neither were there any wireframing/low fidelity mockups that were sketched.

This did upset various other disciplines (product leads, business analysts, tech leads) as to how the UX could have presented something internally without agreeing and aligning on the direction in the first place.

I’ve been thinking about this particular example more recently; and have been in two minds as to whether this was a right approach or not.

After all, what are “prototypes” anyways? Aren’t they just tools for enquiry? Is presenting a prototype to the team before presenting a user-story map, that much of a cardinal sin?

How does this sequence of actions matter anyways if we’re getting closer to the truth?… A year ago, I would have been the first to slam the brakes: “No prototypes before consensus!” But the perspective has shifted a bit.

If you’re plugged into the Twitter / Hacker News echo chambers, you might have already seen some of those vibe-coded apps being spun up in hours, not weeks. They’re slick, high-fidelity, and disposable. Tools such as Lovable/Bolt have vaporized the old sunk costs. Now, building an MVP is less like constructing a cathedral and more like tossing up a pop-up shop—quick, cheap, and easy to tear down (and build again right away)

With this added context, coming back to the original story where the team I worked with felt violated when we skipped a few steps of co-creative user-story mapping and headed straight to building a prototype: would the thinking be different if it costs nothing to build a high-fidelity MVP?

I think it’s okay to do so.

I would definitely add a huge caveat by saying, “hey team, this is just a tool for enquiry for us to think about the user story map better, and we’re not skipping any steps, and even if this is absolutely wrong, we can start from a blank slate, AS THIS COSTS US NOTHING TO BUILD, IT’S QUITE FAST!!!”

I wish I could call this object a “prototype” (for something tangible you build before define what the problem is; or before you define what the user journey is). It’s not about skipping a few steps, but about reducing the number of steps we should take to reach the ideal state.

The traditional design-thinking narrative is that, you don’t “solutionize”, and prematurely optimize; but that’s not the case. These are cheap, disposable apps that help us get closer to what the solution could be. While searching for a good definition, I found the term “boundary object” close enough.

Boundary objects help as we don’t always have to go linearly from understanding the problem to finalising the solution; there could be a co-evolution where problem and solution can evolve hand-in-hand and reach an equilibrium state. Boundary objects help gain faster traction with this co-evolution.

Not discounting (or) discarding the traditional design process; with it’s emphasis on low-fidelity prototypes. It’s not going obsolete anytime soon. But boundary objects can help capture the nuances of user experience, helping elicit better feedback than wireframe-y sketches (sometimes). It need not be used everytime; but can be used in some occasions (it’s a straight jacket, not a life jacket).

View original post →

One way door decisions

3 minutes read

product

There are moments in life when you hit slow-burn-max mode, when you know a big decision is coming, and you can feel the weight of it. You stop everything else and think deeply about the problem you’re about to face.

Jeff Bezos calls these “one-way doors.” Most decisions are two-way doors, you can go through, try it out, and walk back if it doesn’t work. But some aren’t like that. They’re irreversible, or feel that way from where you’re standing. These decisions look different for everyone, but for me, they’ve shown up at inflection points: moving countries, choosing a first job, getting married.

You could argue none of these are truly irreversible. But zoom out to the perspective of a full lifetime, say, 100 years, and the cost of reversing such choices starts to feel steep. By 30, you’re probably halfway through your productive output. So these one-way doors feel even harder. There’s no silver bullet. The best you get is a series of tradeoffs—some heavy on one side, some on the other.

I still struggle with these decisions. I’m not claiming mastery. But I do have a general approach that helps me move forward when I’m staring down one of these one-way doors.

I start with “explore” mode. I gather inputs, talk to people, look for frameworks, test small hypotheses. Only once I’m confident that I’ve mapped enough of the space do I switch to “exploit” mode—where I narrow down and commit.

This mindset mirrors the secretary problem:

You’re interviewing candidates one by one, in random order, for a single role. After each interview, you must decide on the spot—hire or move on. You can’t go back. The optimal strategy is to reject the first 37% of candidates to gather a baseline, and then hire the next one who’s better than everyone you’ve seen so far. This gives you the best chance of hiring the top candidate.

I use the same logic when seeking opinions or inputs. If I plan to speak to 20 people before making a call, I treat the first 7 as pure exploration. I gather everything, say yes to all the ideas, take detailed notes—but I don’t commit. Only after that first phase do I shift into selection mode, looking for the best fit that exceeds the baseline I’ve now formed.

You never really know what you don’t know. And you can’t learn everything, time’s limited. So this 37% rule offers a structure. If I’ve got 6 months before a major decision, I spend the first 1.5 months on deep exploration. I talk to people with different lived experiences and world views. I soak in their pros and cons, their assumptions, their logic.

And when I hear a new viewpoint, I try to hold it as “true,” just for a while. I let it challenge whatever internal direction I’ve started drifting toward. This takes effort. The temptation is to build a fortress around early ideas—to protect them from challenge. But the harder (and better) path is to seek disconfirmatory evidence. Poke holes in your own arguments before someone else does.

As you gather more input, you’ll start to feel a tension between conflicting views. That’s a good thing. You can stretch your thinking, add nuance, let your stance evolve.

Eventually, after enough of these conversations—human or otherwise (ChatGPT, Perplexity, Deepseek, etc.), you arrive at something stable. You marinate on it. For me, that settling phase takes 4–7 days. The emotional churn quiets down, and something clear starts to take shape.

View original post →

Finished softwares should exist

2 minutes read

product

Some blockbuster software products should just end the cycle of endless iteration.

The conversation around product development needs to take seriously the idea of “finished” software, alongside the more common belief that “software is never perfect and always needs improvement.”

These are two different schools of thought and have a lot of friction to coexist with each other. I lean toward the second, but I still wish more people explored the first.

I see three variations of the “finished software” idea:

Freeze model: a clean, final version is frozen and maintained by the community.
Modular architecture: a stable core with optional plugins, like Obsidian or Figma, allows for open-ended customization without bloating the base.
Dynamic software: the standard model, where new features are added continuously under the banner of ongoing development.

My guess is that it evolved through sales-driven development: Every time a potential enterprise client would have asked the JIRA team for one more additional feature, the team would have been a “yes man”, to close the deal. Over time, everything would have added up to become what it’s today. Personally, using JIRA feels like there are 10 different traffic lights across 10 different roads with cars honking at each other, and you’re exactly at the inter-intersection.

It’s a get death by a thousand cuts. My sincere hope in writing this, was a public plea that no software should end up like JIRA. It’s a version of Ashby’s Law: increased complexity demands more control, and eventually the system collapses under its own weight.

One product can’t serve all use cases forever. When you try to make the same software work for startups and Fortune 500s, you end up bloating it for one group and alienating the other.

To keep software “finished,” you need a philosophy of refusal. You need to say no — even when it costs you. There is an optimal stopping point.

But saying no is hard. Additions look like growth. Subtraction doesn’t. And almost no one wants to stop, because the economics of growth, customer segmentation, and sales strategies reward motion, not stasis.

I wish Slack, Gmail and Google had just frozen. Instead, it kept going, adding agents and LLMs into every nook and corner of the app it could find. I doubt that will stop.

View original post →

Essay Quality Ranker

3 minutes read

obsidian

Ever found yourself with dozens of draft essays in Obsidian but no clear idea which ones need the most editing work? I did, and that’s why I built EditNext - an AI-powered plugin that ranks your markdown files based on how much editing they need.

The EditNext plugin uses LLMs and linguistic analysis to evaluate your drafts, providing a prioritized list of which documents deserve your attention first. It’s like having an editorial assistant that helps you focus your efforts where they’ll have the most impact.

You can find the EditNext plugin for Obsidian for installation instructions and detailed usage. This tool helps writers:

Identify which drafts need the most work
Understand specific weaknesses in each document
Track improvement as you edit documents
Save time by focusing on high-priority edits
Leverage AI insights without leaving Obsidian

The plugin analyzes your documents using a combination of AI evaluation, grammar checking, and readability metrics to generate a comprehensive editing priority score.

Example output:

File	Score	LLM	Grammar	Readability	Notes
[[Virtual bookshelves.md]]	46.8	40	56.2	57.9	The draft lacks a clear conclusion and could benefit from a more cohesive structure.
[[How I blog blog with Obsidian, Cloudflare, AstroJS, Github.md]]	50.2	40	100.0	31.1	The draft is generally well-structured but could benefit from clearer transitions between sections.
[[Conceptual Compression for LLMs.md]]	51.0	70	32.2	12.9	The draft lacks a clear thesis and cohesive flow, making it difficult for readers to follow the main argument.
[[chatsnip.md]]	51.0	30	100.0	65.1	The draft is generally clear but could benefit from more detailed explanations of features and user benefits.
[[ascii-todo-cli.md]]	54.7	40	100.0	53.6	The draft lacks a clear introduction and conclusion, making it feel somewhat disjointed.
[[posts/Essay Quality Ranker]]	54.8	40	100.0	53.8	The draft is generally clear but could benefit from improved organization and more detailed examples.

Install it from Obsidian’s Community Plugins:

Open Obsidian Settings → Community plugins
Disable Safe mode if needed
Search for “EditNext Ranker” and click Install
Enable the plugin and enter your OpenAI API key in settings

If you’re serious about improving your writing, this plugin offers a systematic approach to tackling your editing backlog. It’s especially useful for managing digital gardens, notes collections, or any large set of drafts.

View original post →

How I started building softwares with AI agents being non technical

How I use LLMs now

Model usage

UI prototyping

Workflows

What I haven’t explored yet (but would try soon)

Wrapping up 2025..

Legible and illegible tasks in organisations

L2 Fat marker sketches

Writing as moats for humans

Footnotes

Beauty of second degree probes

Read raw transcripts

Boundary objects as the new prototypes

One way door decisions

Finished softwares should exist

Essay Quality Ranker