Personal LLM Knowledge Bases: Strategy, Capture, and Why PageStash Belongs in the Stack

Andrej Karpathy published an "idea file" for what he calls an LLM wiki—a pattern where you drop sources into a raw/ directory, point a capable model at the pile, and let it incrementally compile a linked Markdown wiki: summaries, concept pages, backlinks, maintenance. Most of the commentary that followed fixated on Obsidian, scripts, and vibe-coded search tools. Fair enough—but the strategic move is simpler: treat personal knowledge as an asset you operate against, not a folder you curate for its own sake.^[1]

Why this matters strategically

Bookmarks, read-later queues, and podcast episodes you half-remember are dead knowledge: you paid attention once, then the signal dissolved. An LLM-maintained base inverts the economics. The same corpus can answer "what pricing frameworks showed up across everything I read this year?" by stitching a thread you never tagged—across formats, months, and sources—because the model has already done the slow work of reading, grouping, and linking. That is not vanity organization; it is optionality under uncertainty. When you brief yourself from your own reading instead of re-Googling generic SEO sludge, decisions get faster and more specific to your context.

The compounding loop is the point: new material gets ingested, connected to what is already there, and your questions produce artifacts (memos, slide outlines, diagrams) that you file back in. Memory grows from both sides—what you save and what you ask—so the base improves every time you use it. That is the same infrastructure instinct behind enterprise auto-research and Karpathy's autoresearch pattern: automate the expensive middle, keep humans on goals and review.

A practical framework (five moves)

You do not need to replicate Karpathy's exact stack on day one. You need a repeatable shape you can grow into:

1. Ingest to a canonical raw layer. Articles, papers, repos, datasets, images—everything lands in one place the model can enumerate. Consistency beats cleverness: one root, predictable paths, attachments where the model can see them.
2. Compile with an LLM, not by hand. Task the model to maintain the wiki: index pages, short summaries, concept articles, backlinks, and light taxonomy. Your job is to supply sources and constraints; the model does the filing and cross-linking.
3. Operate with questions, not browsing. Once the wiki is dense enough (Karpathy cites on the order of ~100 articles / hundreds of thousands of words for deep cross-cutting Q&A), you ask synthesis questions across the whole surface: themes across papers, forgotten connections, gaps on a topic. Prefer answers that cite where in your corpus they came from.
4. Reinvest outputs. Turn answers into Markdown pages, decks, or figures—and drop them back into the tree so the next query starts warmer. Exploration should add up.
5. Run periodic hygiene. Occasional "lint" passes—contradictions, stale summaries, missing links, candidate new articles—keep integrity from drifting as the base grows.

Web capture: Obsidian Web Clipper vs PageStash

In the gist workflow, web articles become Markdown via the Obsidian Web Clipper extension, plus discipline around images and attachment paths so multimodal material stays coherent for the model.^[1] Obsidian's clipper is excellent if you already live in Obsidian and enjoy wiring vault conventions yourself.^[2]

PageStash targets the same outcome for the browser slice—clean capture, structure, and retrieval—without asking you to own the whole Obsidian + extension + hotkey assembly as your ingestion front door. Where clipper is "save this tab into my vault as Markdown," PageStash is built around capture → structure → retrieve as a pipeline: normalize messy pages, keep metadata and text queryable, and feed downstream agents or a raw/ export you still compile with an LLM. For the strategic goal (a growing, model-addressable corpus), both paths land you at "structured text + context in one place." PageStash is the lever when you want that layer packaged instead of self-wired—especially if you are standardizing capture for a team or an agent that should not depend on a specific note app's extension surface.^[3] Obsidian remains a strong reader and graph UI on top; the question is how ruthlessly you want to industrialize what hits raw/.

I wrote the build story for PageStash here: PageStash: Capture, Structure, Retrieve. The through-line with this essay is the same thesis I use for agentic infrastructure and personal agents: make knowledge machine-addressable so work compounds.

The product-shaped hole

Karpathy's own closing is the market map: room for "an incredible new product" instead of a hacky collection of scripts.^[1] The wedge is not another prettier notes app—it is opinionated ingestion, structure, and agent-ready retrieval that runs quietly while you read. If you are serious about the five-move framework, fix the capture layer first. Everything downstream—compilation, Q&A, reinvestment—gets easier when raw/ is honest.

References

Karpathy, A. (2026). LLM Wiki — A pattern for building personal knowledge bases using LLMs (gist). gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Obsidian. Web Clipper. obsidian.md/web-clipper
PageStash. Capture, structure, retrieve. pagestash.app