Fungus Foray

Building Fungus Foray - a mushroom identification app, end to end, in five weeks

Product StrategyAI EngineeringProduct DevelopmentMobileInfrastructureBrand

Published

Overview

https://www.fungusforay.com/

Fungus Foray is a mushroom identification app that walks foragers through a structured conversation about what they've found and identifies the most likely species from a dataset of 268 British fungi - with explicit safety annotations, confusion-species warnings and ambiguity warnings. It runs in the browser and as a native Android app.

It differs from other mushroom ID apps because it neither relies on the broad and messy training of frontier vision/language models nor the brittle structures of rigid rule sets. We have managed to find a way to merge the strong points of both approaches in a way that can be applied to many domains beyond mycology.

So the key takeaways from this project are twofold:

We have found a way to merge expert systems and LLMs to mitigate their individual weaknesses in domains that don't have broad training data.

  1. We have shown that a whole company from idea to design to research to code to test and release can be done by one person in at minimum a fifth of the time as in the past.
  2. The entire project — research, dataset construction, three architectural rewrites, mobile build, deployment, and the public marketing site — was completed in roughly five weeks by one person working with an AI co-developer.

This case study documents what that journey actually looked like, including the parts that did not work the first time.

The starting hypothesis

Field guides for mushrooms are excellent and almost completely useless to a beginner. They assume you already know at least what genus you are looking at, which is the bit beginners cannot do. Generic image-recognition apps muddle the problem in a different way: they answer with a confident species name from a single photo, with no way for the user to push back on partial information, ambiguous features, or the obvious safety question — am I about to poison myself?

The hypothesis was that an LLM (and we aimed to build this against cheaper than frontier models), properly grounded in mycological data and constrained by safety rules, could conduct the kind of structured conversation a knowledgeable forager has with a beginner: "What colour are the gills? Free or attached? What does it smell like? Where was it growing?" - narrowing the candidate set, refusing to commit when the features don't fit, and flagging confusion species explicitly.

Five weeks later, that is what was shipped. Getting there involved completely rewriting the system three times as we worked through different approaches.

Architecture v1 — the rule engine approach (built in 48 hours)

The first instinct was the obvious one: build an expert system. Encode mycological diagnostic features as rules. Match user-described features against the rule base. Score matches and rank candidates.

By the end of day two, the rule engine was complete. 507 passing tests across 36 test files. 1,677 feature rules covering 20 genera.

This is what the speed of AI-assisted development looks like when the problem is well-defined and the requirements are stable. A traditional solo build would have spent two weeks on what took two days here, and that compression was real engineering - fully tested, properly structured, not a prototype.

It also turned out to be the wrong solution.

The pivot - what the Woolly Milkcap revealed

The first realistic test was a Woolly Milkcap, Lactarius torminosus. A user describing it would not say "convex cap with depressed centre, decurrent gills, white milk that turns yellow on exposure" - the language a rule engine needs. They would say "a pink mushroom growing under birch trees with milky stuff coming out when I broke it."

The rule engine could not handle natural language. Translating "pink" into the controlled vocabulary it expected was a problem the engine fundamentally was not built to solve. Adding a translation layer would mean rebuilding comprehension in deterministic code - the moment the user thinks of a new word outside of the hardcoded words, everything breaks.

Architecture v2 — LLMs and the Death Cap problem

On day three, the entire system was rewritten LLM-first. A 35,000-line commit migrated 40,000 words of mycological knowledge from rules into a structured dataset the LLM could reason against. The rule engine became reference material that the LLM used.

The LLM itself is the identification system. The JSON data provides grounding and prevents hallucination, but it is the LLM's ability to interpret natural language that gives the power. Now a user can even invent new words for "pink", and as long as the LLM can plausibly infer what they mean, everthing will work. Some examples that pass are "a mushroom the colour you get when red admits defeat.", "a fungus coloured like Dawn's first apology" and "one the shade between shame and joy" - and yes, I asked the LLM what it would understand to mean pink.

This is the kind of pivot that, in a traditional team, takes a month at least. With one person and an AI co-developer, it took a day.

The LLM-first system worked beautifully on most species. Then it identified a Death Cap as a Field Mushroom.

Both have white gills. The Death Cap has an olive-green cap; the Field Mushroom is white or brown. Both features in the user's description directly contradicted the Field Mushroom identification. Yet the LLM, given the full 131,000-token dataset of 268 species in a single context window, picked Field Mushroom anyway.

Now we have a new problem that is purely practical, not architectural.

This is the lost-in-the-middle problem in long-context LLMs: information in the middle of a long prompt is reliably less well-attended-to than information at the ends. With a few hundred species in the prompt, the model misses things - and you do not know which things, because there is no intermediate artefact to inspect.

For a casual app this is annoying. For mushrooms this is potentially fatal.

Architecture v3 - the two-stage pipeline

The fix was structural, not a prompt tweak. The pipeline was split in two:

Stage 1 (broad recall). The LLM is asked, with no data constraints, what species this could plausibly be. It returns a shortlist using its training corpus of mycological literature. This stage plays to the model's strength at open-ended generative recall.

Stage 2 (focused comparison). Only the data for the shortlisted species is loaded into context — typically a few thousand tokens, not 131,000. The LLM is asked to apply rigorous diagnostic methodology against that small, focused dataset, and to refuse to commit when the features don't fit.

Token usage dropped 13x. Accuracy on the safety-critical test set went up. The Death Cap was no longer mistaken for a Field Mushroom.

The same model, the same domain knowledge, the same prompts. What changed was the architecture - specifically, the recognition that broad recall and rigorous comparison are different cognitive tasks and benefit from being done separately.

This is the kind of insight that comes from running the system, watching it fail, and being willing to rebuild the pipeline rather than tune around the failure. The cost of that rebuild, with AI assistance, was a day of work.

Compensating for what LLMs can't do

We should also admit to some further engineering tweaks, being open about some of the extra complexity we had to add to get more accuracy and safety.

Genus signal detection. Users often hint at the genus before they know they are doing it ("there was milk when I broke it" → Lactarius; "it was on a birch log" → narrows the genus dramatically). 53 regex patterns were built to detect these hints in user input and pass them as structured signals into Stage 1, sharpening the broad-recall step.

Synonym resolution. The dataset uses current scientific names. Users use folk names, old scientific names, and mistakes. A synonym resolver maps these to the canonical species before identification.

Safety annotations. Every species in the dataset carries explicit safety metadata: edibility level, poisonous status, confusion species, and required cooking treatment. When the LLM proposes a species, the safety annotations are surfaced unconditionally - never buried in the answer, never optional.

Test-driven safety. Ten safety-critical scenarios - including the Death Cap test - run on every change. A regression in one of these is a release blocker. This is the discipline that keeps the system honest as it evolves.

Mobile, infrastructure, and release

With the identification engine solid, the remaining work was straightforward.

The web app was deployed to Vercel, running the LLM calls through Vercel Functions with appropriate token-budget controls to various LLMs, tested on cost and performance. The mobile build was done with Capacitor, wrapping the existing web app in a native Android shell with the right safe-area handling, touch targets, and offline behaviour. The marketing site at fungusforay.com was a separate small build covering brand, product story, and download links.

Each of those would traditionally be a week or two of solo work. With AI assistance, each was a day or two.

What this case is actually about

Fungus Foray is a real product solving a real problem. It is also a case study in what one person plus an AI co-developer can now build end-to-end and how skilled programmers can mitigate some of the limtations that LLMs still have.

The breakdown of the work, by traditional discipline:

  • Research and dataset construction - typically a small team of domain experts and data engineers
  • Architecture and AI engineering - typically an ML team
  • Product development (web) - typically two or three engineers
  • Mobile development - typically a separate mobile engineer
  • Infrastructure and deployment - typically a DevOps engineer
  • Brand and marketing site - typically a designer plus a frontend developer
  • QA and safety testing - typically a QA function

One person did all of this in five weeks. Not because they are extraordinary. Because the right combination of expert judgement and AI assistance now collapses what used to be a small team into a single role - provided that one person has the taste to know what good looks like, the judgement to know when to throw work away and start again, and the discipline to test what matters.

That combination — expert + AI — is the new unit of company creation in software. Fungus Foray is one example of it. The pattern generalises.

Use cases this approach is suited to

This is not a brag about velocity. It is an observation about a specific class of product: focused, opinionated, single-purpose tools where the core value is in the quality of judgement encoded in the system, not in feature breadth. For those, the expert-plus-AI model is currently producing better products faster than any traditional team structure.

If you are an operator or investor thinking about where this changes the economics, the implications are large and not yet priced in. The right conversation is not "can we use AI to ship faster". It is "what kinds of company are now buildable that previously required teams of fifteen, and what does that do to our capital allocation?"

That is the conversation Magilium increasingly has with clients on the back of cases like this one.