Pencil-sketch hero image: an NVIDIA RTX-class GPU and a glowing purple gem (visual cue for the Obsidian vault), with file stacks in the background. ILP style.

Sorting a Filesystem Hoard With Local LLMs: What 2,300 Files Told Me About My Obsidian Vault

What do 2,300 random files on a knowledge worker’s laptop actually look like? I let a local LLM tell me.

TL;DR

I built a productized CLI called random-file-sorter from a 22-hour file-cleanup marathon. Ran it across my home directory on a Saturday afternoon. 2,332 files classified in 8 hours, 99.96% completion rate, fully local-LLM, zero file content sent to Anthropic. 1,614 of them (69%) belong in my Obsidian vault and have been quietly leaking out of it for 90 days.

The pipeline made 6,088 LLM classify calls across 17.8 GPU-hours of inference. Only 9% of files came back as DELETE. Routing better is the fix.


My home directory had 11,930 files needing cleanup. 84% of them were vendored noise (node_modules, .venv) that every cleanup tool I’d ever tried counted as real work and lied about scope. The other 2,332 files took a local LLM eight hours to classify, and the most interesting thing it found was that I’d been quietly building a second Obsidian vault in a ~/scratch/research/ directory for 90 days without noticing.


The Problem

Knowledge work creates file exhaust. Notes drafted during research, screenshots, PDFs of papers I might re-read, scripts written for one-shot probes, scraped HTML, exported calendars, photos for the company website, contractor invoices. Each one made sense at the time and none is individually a problem. Together they form a slow-moving disaster that consumes drive space and obscures the things I actually need.

The pipeline existed first as a one-shot 22-hour cleanup marathon. Productizing it (this post’s main subject) was about lifting hardcoded assumptions out and shipping a reusable CLI.

Two specific problems made the existing “I’ll get to it” approach untenable:

1. The vault leak. My Obsidian vault is the canonical place for notes. But ~/obsidian/ competes with ~/archive/, ~/scratch/research/, ~/scratch/reference-photos/, and a dozen other “data” directories. New notes land wherever I happen to be in the terminal, rather than where they should live. Over time the vault becomes a fraction of the actual note corpus.

2. The security floor. Everything I work on is at least slightly sensitive. Whatever tool I used had to read these files locally. Sending them to a cloud LLM, even temporarily, wasn’t an option.

So: local LLM only, recoverable trash, restartable from any stage, idempotent on re-run, zero file content off the laptop, and every decision observable in a human-readable CSV before any file moves.

What the Pipeline Has to Read

Different file types need different extraction techniques. A regex won’t tell you what’s in a PPTX slide or a PNG screenshot, and running a vision model on a plain-text markdown file is wasteful. Each format below demanded its own tool:

TypeExtraction techniqueWhy
.pdfpdftotext (Xpdf utility)Fast, deterministic, no LLM needed for text-layer PDFs
.docxpython-docxCleanly separates body text from author metadata
.pptxpython-pptxPer-slide extraction; one PPTX produces N snippet rows
.xlsxopenpyxlCell-level text plus sheet names
.htmlBeautifulSoup4Strips tags, keeps readable body
.png/.jpg/.heicVision LLM (gemma-3-12B multimodal)Image content can’t be regex’d; needs semantic understanding
.md/.txtDirect file readAlready structured text

Text formats get deterministic extraction: fast, cheap, offline. Images and scanned content go to a vision model because there is no other option.


How the Local LLM File Organizer Works (9 Stages, One SQLite DB)

random-file-sorter is a 27-script Python pipeline (about 6,580 lines of code) I built for my own use. It walks one or more directories, extracts a per-file text snippet locally, sends that snippet plus filesystem metadata to a local LLM for a routing verdict, and writes the verdict to a SQLite database. Verdicts get reviewed before any file is moved. Moves go through a recoverable trash archive.

Stages, each restartable, each idempotent:

StageWhat it does
inventoryWalks the configured scan roots, builds a files table with size, mtime, and sha256 prefix
extractPer-file text-snippet extractors (pdftotext, python-docx, EXIF for images, etc)
classify --stage primaryLLM verdict into one of 5 top-level buckets
entity-scanDeterministic regex sweep for known partner and vendor names
dedup --full-hashsha256 comparison across the inventory, picks canonical copy
enrich --slides --authors --exif --datesCPU-only metadata extraction (powers a planned semantic-search layer)
actions --source pass1Builds actions.csv from the verdicts (dry-run)
apply --executeExecutes the moves. Defaults to dry-run. Trash archive at ~/scratch/random-file-sorter-trash/<runid>/.
embed-screenshotsMatches Zoom screenshots to convo notes via calendar and filename timestamps

The full CLI has 14 subcommands. Each stage writes to its own DB table or sentinel file, so partial runs are normal and re-running is cheap.


The Decisions That Made It Work

Decision 1: Run the LLM Locally (llama.cpp + LM Studio)

Files get extracted to a snippet locally (Python, pdftotext, python-docx) and that snippet plus filesystem metadata goes to the classifier. The snippet itself goes to a local LLM running on hardware I own. Anthropic’s API never sees a single file body.

Requests go to an in-house router on a local VPN, which load-balances two local backends:

The router takes a task-type parameter (classify, extract, code, writing, vision) and chooses the right backend. The pipeline sends model: "local" and lets the router decide. I used qwen3.6-27B as the primary classifier (which I benchmarked previously in the local LLM model review) running at 4-bit quantization on the 3090 Ti.

Decision 1a: Why not just use OpenRouter or a cloud aggregator?

OpenRouter’s Zero Data Retention (ZDR, a per-request flag that instructs OpenRouter to not store prompt data server-side) feature is real and its per-provider logging policies are more transparent than most aggregators.

The architecture is what concerns me. ZDR governs which upstream providers get your traffic. OpenRouter’s own docs acknowledge the training opt-out “has no bearing on OpenRouter’s own policies” and defaults to assuming ambiguous endpoints “both retain and train on data.”

Every intermediary hop is a hop whose breach response, subpoena exposure, and training choices are outside my control. File contents from a home directory with partner names and financial data: local or nothing.

Decision 2: Bash Wrapper Over Package Manager

v0.2.0 had a pyproject.toml. v0.2.1 killed it. The CLI is now a plain bash script at ~/bin/random-file-sorter that execs python3 ~/code/random-file-sorter/cli.py "$@". No virtualenv, no install step, no version pinning, no .dist-info/. Edit the Python files in place and the changes go live.

Part of this is preference: I don’t want Node in my personal tooling stack. Part of it is a lesson from September 2025, when the npm ecosystem had one of its worst weeks. The chalk, debug, and ansi-styles packages (foundational dependencies pulled into much of the JavaScript ecosystem) were hijacked via a phished maintainer account and published with payloads that redirected in-flight cryptocurrency transactions.

Two months later, the Shai-Hulud worm began self-propagating through the registry, compromising 25,000+ repositories and exfiltrating developer credentials.

The Python ecosystem is not immune to supply chain attacks. But eliminating npm from personal projects cuts one large attack surface without any real tradeoff for solo tooling.

The reward: zero setup overhead, no 800MB-of-node_modules problem, no surprise breakage from a transitive dependency upgrade. The trade-off is that the tool isn’t packaged for reuse; it’s wired into my own paths and assumptions. That’s fine for a personal pipeline.

Decision 3: Wrap the Agent in Docker for Portability and Security

The Claude Code agent that wrote most of this code runs in a Docker container on my production homelab server, mounted into a working directory, OAuth-authenticated from a credential file the container can read but the network can’t reach. (I used the same Claude Code + Docker pattern for the anti-detect browser benchmark.)

The agent has shell access. It can read files, modify them, execute scripts, network out. But the container can’t reach my SSH keys, can’t see my browser sessions, can’t write outside its mount points, and can be torn down and rebuilt in seconds. The blast radius of an autonomous agent doing something unexpected is bounded by the container.

Decision 4: Make the Pipeline Restartable With a SQLite Cache

Every stage writes to SQLite, checks “already done?” before starting, and commits one file at a time. A pipeline interrupted mid-run loses at most one file’s worth of progress.

The seductive alternative: batch LLM calls, hold state in memory, write at the end. That gets 3x throughput and zero recoverability. I picked the boring choice and it paid off both times the LLM router went down mid-run.

Decision 5: Trash archive, never rm

Every “delete” goes to ~/scratch/random-file-sorter-trash/<runid>/ with its original path preserved. Restore is random-file-sorter restore <runid>. An irrecoverable wrong move is catastrophic, so I never make one.

This pattern is generalized later as a global rule: never rm, always mv to ~/trash/. It lives in my Claude Code config file as a Core Directive. For users who want a lighter-weight version of this rule without building a full pipeline, aliasing rm to mv ~/trash/ (or using trash-cli from homebrew/apt) captures most of the safety benefit.

Decision 6: Deterministic entity classifier runs BEFORE the LLM

Files whose path or filename matches a known entity (partner names, key vendor names, professional services firms, regulatory filing keywords, etc.) get routed by regex before the LLM ever sees them. About 136 patterns covering partners, vendors, financial and legal advisors, and regulatory filing keywords.

The LLM is good at semantic routing. It’s wasteful for “this filename contains a partner name so it’s client work.” Deterministic-first means the LLM only sees the genuinely ambiguous files.

Decision 7: Two-round audit pattern

After v0.3.2 shipped, I ran /4-code-test (a custom audit skill that runs the same code-review prompt through multiple LLMs from different model families, then surfaces only the findings that hit consensus), which found 3 blockers and 10 mediums. Fixed them, shipped v0.3.3. Ran /4-code-test again, expecting it to pass. It found 3 NEW blockers caused by the fixes. Shipped v0.3.4.

Fix waves introduce regressions. The second round found 3 new blockers caused by the v0.3.3 fixes. Plan for at least two rounds. This matters more under Opus 4.7 than under 4.6: the larger model catches subtle regressions the smaller one passes, so a second round under Opus 4.7 earns its cost.

Decision 8: Make Every Classifier Verdict Inspectable Before Any File Moves

The LLM emits a one-line why: field per file alongside a routing verdict and confidence score. All 2,332 verdicts land in actions.csv for review before apply --execute runs. Nothing moves until the CSV has been inspected.

A black-box classifier is unacceptable for files I might need to find later. If the routing logic is opaque, the only way to catch a mistake is to notice a missing file months down the line. The observable CSV makes the pipeline correctable in addition to fast.


Alternatives I Didn’t Use

Most existing tools are one-shot desktop apps or research demos. None of them are scriptable, restartable, or wirable into a cron.

  • AI File Sorter: Cross-platform desktop GUI, supports local and remote LLMs. Why I didn’t use it: GUI-only, no pipeline integration, no reviewable CSV before moves execute.
  • llama-fs: Self-organizing file system demo using Llama 3; renames and reorganizes files. Why I didn’t use it: research demo, Electron frontend, not designed for restartable runs or audit trails.
  • Local-File-Organizer: Ollama-based desktop tool, Llama3.2 + LLaVA for images. Why I didn’t use it: last release October 2024, no SQLite cache, no dry-run mode.
  • run-llama/file-organizer: CLI utility, organizes by folder but never renames files. Why I didn’t use it: no multi-stage pipeline, no entity pre-filter, no trash archive.
  • Hazel: Mac-only rules-based automation, commercial. Why I didn’t use it: no LLM, can’t handle semantic routing (“is this work or personal?”), requires manual rule authoring for every new file type.
  • Manual + cron + bash: Works fine under ~100 files. Why I didn’t use it: 2,332 files across 30 scan roots with mixed file types is past the point where hand-coded rules stay maintainable.

Local LLM File Organizer: By the Numbers

A single local-LLM file classifier ran 2,332 files across 30 scan roots over 8 hours, fully offline, achieving 99.96% completion at roughly 5 files/min averaged across the wall clock (6-8 files/min during active classify periods when both backends were online), with 64% of verdicts at >=0.9 confidence.

How Long It Took to Sort 2,300 Files

MetricValue
Initial inventory11,930 files
Vendored noise filtered out (node_modules, .venv)10,005 (84%)
Real corpus2,332 files
Classified2,332 / 2,333 (99.96%)
Wall clock~8 hours
Throughput5 files/min average across wall clock; 6-8 files/min during active classify (gap is router downtime)

How Many LLM Calls and GPU-Hours It Took

MetricValue
Successful classify calls in DB6,088
Average call latency10.5 seconds
Min / max call latency1.1s / 66.5s
Total LLM inference time1,067 minutes (~17.8 GPU-hours)
Primary backendubuntu1 (3090 Ti) running qwen3.6-27B Q4_K_M
Secondary backendmac-studio gemma-3-text-12B (4 parallel slots)
Backend latency, typicalubuntu1 17-22s, mac-studio 6-12s

File-type distribution (real corpus, post-noise-filter)

ExtensionCountShare
.md96041.2%
.json40417.3%
.html32113.8%
.py1325.7%
.png793.4%
.jsonl632.7%
.log482.1%
.sh451.9%
.csv431.8%
(no extension)411.8%
.txt391.7%
.patch271.2%

72% of the corpus is .md + .json + .html, overflow of written content, with the vault and the home dir competing for the same file type.

How Confident Was the Local LLM?

Most verdicts land at 0.9 confidence or higher.

ConfidenceCountShare
>= 0.91,49964.2%
0.7 – 0.983235.7%
0.5 – 0.710.04%
< 0.500%

The Pipeline’s Real Asset: The Metadata DB

The routing verdicts are the obvious output. The metadata database is the durable one.

While extracting content for routing, the pipeline accumulates a structured index of everything it touched: slide text, document authors, embedded dates, EXIF coordinates, file hashes. This turns the sorter from a one-time cleanup tool into a queryable corpus.

Once built, it enables retrospective queries: “show me all Q2 2026 files that mention a specific vendor,” “which PPTX decks list me as author,” “find all photos taken within 50km of a given city.” None of that requires re-running the LLM.

TableRowsWhat it indexes
files30,111All inventoried filesystem entries across all runs
slide_snippets55,880Per-page text from PPTX / PDF / DOCX (6x more granular than whole-file snippets)
body_dates39,076Date mentions extracted from snippet text
snippets8,749Whole-file extracted text snippets
doc_author8,067Author, last-modified-by, created-date, title from PPTX/DOCX/XLSX/PDF metadata
metadata6,089LLM classify verdicts
pass25,851Pass-2 sub-routes (28-leaf taxonomy)
image_exif2,397Camera model, GPS coords, datetime, dimensions for images
subroute720Work sub-routing decisions

55,880 slide snippet rows means the pipeline has read every slide in every deck I own, at page granularity. That’s the foundation of a local semantic search layer. The routing was the excuse; the index is the product.


What 2,332 Classified Files Revealed

Only 9% of Hoarded Files Were Worth Deleting

Only 9% of files came back as DELETE. The reflex “I’m being a packrat” framing was wrong. Routing better is the fix. The system was conservatively keeping things with value.

Most “Hoarded” Files Are From the Last 90 Days

51% of files dated March 2026, 33% May 2026. Only 10% older than February 2026. This is recent active work from the last 90 days that accumulated without getting filed properly.


What Broke and What I Learned

LM Studio auto-reloads models on any HTTP probe

Unloaded a model with lms unload. Sent a test curl to verify it was gone. LM Studio happily reloaded the model in response to the curl. Half an hour later, the router started routing classify requests to the wrong model.

# lms ps (reconstructed) -- output after unload + test curl
  Name                    Status     Context   Port
  gemma-3-12b-vision      loaded     8192      1234

The model I thought I’d unloaded was back. To verify “unloaded” you check lms ps, never an HTTP probe.

Claude Code sandbox PID namespace kills detached processes

Standard Unix daemon-detach (nohup setsid CMD & disown) does not survive when launched from inside Claude Code’s sandbox. Each Bash tool invocation gets its own PID namespace and when its PID 1 exits, all descendants die regardless of nohup.

# stderr (reconstructed) -- worker launched from sandbox, session ends
nohup: ignoring input
[1] 48291
# (session closes, PID 48291 and all children killed by PID namespace exit)

To launch a runner that should outlive the session, use dangerouslyDisableSandbox: true so the runner gets reparented to the actual host’s init.

Why ERROR rows silently killed pipeline retries

The pipeline’s per-file “already classified?” check matched on any prior metadata row, including ERROR rows from a previous failed run. This meant 277 files from a router outage couldn’t retry without manually purging them first.

# classify_primary.py output (reconstructed) -- re-run after outage
[INFO] file_id=1842 status=skip-exists (existing row: ERROR)
[INFO] file_id=1843 status=skip-exists (existing row: ERROR)
[INFO] file_id=1844 status=skip-exists (existing row: ERROR)

Patched: ERROR and PARSE_ERROR rows are now eligible for retry automatically. Commit 1428884.

The 1-slot GPU bottleneck (and why more workers made it worse)

The 3090 Ti running qwen3.6-27B with one inference slot serializes everything. Mac Studio with gemma-3-text-12B handles four parallel requests but it’s a smaller model. Combined throughput plateaued at 6-8 files/min with two workers. Four workers caused overload.

# nvidia-smi (representative -- during 4-worker overload)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08    Driver Version: 535.161.08   CUDA Version: 12.2   |
|-------------------------------+----------------------+----------------------+
| GPU  0    RTX 3090 Ti    On   | 00000000:01:00.0  On |                  N/A |
| 52%   82C    P2    387W / 450W|  23792MiB / 24564MiB |    100%      Default |
+-----------------------------------------------------------------------------+

GPU at 100%, inference queue backed up, workers timing out and recording 502s. Scaling this kind of job requires more inference slots on the big GPU; adding parallel client workers just overloads the same bottleneck.


Frequently Asked Questions

How do I clean up an Obsidian vault with a local LLM?

Walk every directory that competes with the vault (research dirs, archive folders, ephemeral scratch trees) through a local-LLM classifier that emits an OBSIDIAN / KEEP / DELETE verdict per file. Review the verdicts as a CSV before any moves. The pipeline this post describes did this for 2,332 files in 8 hours, with 69% routing to the vault.

How long does it take to sort 2,300 files locally?

The production run classified 2,332 files in approximately 8 hours of wall clock, averaging 5 files/minute end-to-end (6-8 files/minute during active classify periods, with the gap explained by two short router outages). LLM inference latency (avg 10.5s per call) dominates wall-clock time; I/O is negligible. Faster runs require more GPU inference slots.

Does any file content leave the machine?

No. File content stays entirely local. The local LLM runs on hardware I own, and Anthropic’s API is never invoked for file contents at runtime.

What runs locally vs. what goes to the cloud?

File extraction (pdftotext, python-docx), the SQLite cache, the trash archive, the LLM inference, and the routing decisions all run locally. Cloud LLMs play no role at runtime. Cloud was used only by Claude Code (the agent that wrote the pipeline code itself), running inside a Docker sandbox. Once the pipeline was built, it runs fully local.

When does this need an LLM vs. a regex vs. a CLI tool?

Use regex or a CLI extractor for structured metadata (timestamps, hashes, EXIF). Use a curated entity list for known names. Reserve the LLM for semantic judgments: “Is this work or personal?” and “Should it go in the vault?” Vision-only files (screenshots, photos) need a vision model; there is no other option.


Closing

The boring decisions are what made this work: trash archive instead of rm, SQLite for state, per-file commits, dry-run by default before any file moves, a Docker sandbox for the agent that wrote the code, and bash wrappers instead of dependency management.

Each choice individually is a coin-flip between “sensible” and “probably fine either way.” But they compound. The pipeline ran for 8 hours across two router outages and a model reload incident and still classified 2,332 out of 2,333 files. The boring decisions compounding is what produced that number.

A 22-hour cleanup marathon became an 8-hour re-run. The next will be shorter. By the third one it should be a weekly cron that takes 20 minutes.

Tools like this make accumulation observable. Once you can see it, you can decide whether to care.


Tools and Models Used

  • llama.cpp (github) – C/C++ inference engine for running quantized LLMs locally across Apple Silicon, NVIDIA GPUs, and CPU. Pricing: free, MIT license.
  • LM Studio (lmstudio.ai) – Desktop app for downloading and serving local LLMs with an OpenAI-compatible API endpoint. Pricing: free for personal and commercial use.
  • Claude Code (claude.com/download) – Anthropic’s official CLI for agentic coding tasks, integrated into the terminal rather than a browser. Pricing: included with Claude Pro/Max/Team/Enterprise plans.
  • Docker (docker.com) – Container platform that packages applications and dependencies into portable, reproducible images. Pricing: free personal tier; paid plans from $9/user/month.
  • Colima (github) – Minimal-setup container runtime for macOS that runs Docker or containerd without Docker Desktop. Pricing: free, MIT license.
  • SQLite (sqlite.org) – Embedded, serverless SQL database engine used here as the pipeline’s state store for file verdicts. Pricing: free, public domain.
  • Qwen 3 (huggingface.co/Qwen) – Alibaba’s open-weight model family with switchable thinking/non-thinking modes; the 27B variant handled primary classification. Pricing: free, Apache 2.0 license.
  • Gemma 3 (deepmind.google/models/gemma) – Google DeepMind’s open-weight model series; the 12B text and vision variants handled pass-2 classification and screenshot enrichment. Pricing: free, Google Gemma Terms of Use (permissive commercial use).

Open Questions Worth Your Input

Real questions for readers who run similar setups:

1. How do you keep a vault from leaking content into “data” directories? Convention? Tooling? Inotify watcher? 2. Do you run LLM-heavy automation on the same machine as your work, or in a separate box? 3. How do you handle the “vendored noise” problem (node_modules, .venv) in classifier pipelines? Skip-list? Sidecar .gitignore-style file? Inventory-time filter?

If you’ve solved any of these, I want to know.

How a CEO uses Claude Code and Hermes to do the knowledge work

A blank or generic config file means every session re-explains your workflow. These are the files I run daily as CEO of a cybersecurity company managing autonomous agents, cron jobs, and publishing pipelines.

  • CLAUDE.md template with session lifecycle, subagent strategy, and cost controls
  • 8 slash commands from my actual workflow (flush, project, morning, eod, and more)
  • Token cost calculator: find out what each session is actually costing you

One email when the pack ships. Occasional posts after that. Unsubscribe anytime.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *