Insights
Organising 50GB of files with vector embeddings — and an LLM safety net
Xoredge Engineering · Jun 1, 2026 · 5 min read

A leading European research university came to us with a familiar problem at an unfamiliar scale: roughly 50GB of files — papers, datasets, scanned forms, slide decks, images and years of project folders — accumulated across teams, with no consistent structure. Search was hopeless, duplicates were everywhere, and nobody could confidently say where a given document lived. We can't name the institution, but the engineering is worth sharing.
Folders don't scale; meaning does
The instinct is always to fix this with a tidier folder tree. But a hierarchy that makes sense to one team is a maze to another, and 50GB is far too much to sort by hand. The real problem isn't where files sit — it's that the system has no understanding of what they contain. So instead of moving files around, we taught the system to read them.
Vector embeddings: organising by meaning, not by filename
For every file we extract its text and generate a vector embedding — a numerical representation of its meaning. Files about similar topics end up close together in vector space regardless of what they were named or which folder they were dropped in. That single shift unlocks everything that follows:
- Semantic search — find documents by what they're about, not by guessing the exact filename.
- Automatic clustering — related material groups itself, surfacing structure no one had to design.
- Near-duplicate detection — almost-identical files sit on top of each other in vector space and can be merged or flagged.
Embeddings are stored in a vector database alongside each file's metadata, so search and classification run in milliseconds even across the whole archive.
Four categories every file lands in
On top of the embeddings we apply a small, deliberate taxonomy. Every file is classified into exactly one of four top-level categories — few enough to stay obvious, broad enough to cover everything:
- Research — papers, datasets, results, analysis and anything tied to a study or publication.
- Teaching — lecture notes, slides, exercises and course material.
- Administrative — proposals, contracts, forms, reports and records.
- Media & assets — images, scans, diagrams, recordings and raw assets.
The category is suggested automatically from the embedding and confirmed by the classifier, so files are sorted consistently no matter who uploaded them or what they happened to call it.
A lifecycle folder for every project
Research work is organised around projects, and projects have a life. So every project gets its own lifecycle folder that follows the work from start to finish — rather than scattering its files across the four categories with no sense of state:
- Proposal — early drafts, scoping and funding documents.
- Active — the live working set the team touches day to day.
- Published — finalised outputs and deliverables.
- Archived — closed-out work, retained for the record and out of the way.
A file therefore has two coordinates: what it is (one of the four categories) and where its project is in its lifecycle. That combination is what makes a huge archive navigable — you can ask for "the published research outputs of this project" and get exactly that.
When embeddings fail, the LLM is the safety net
At 50GB across years of formats, a meaningful share of files won't embed cleanly: scanned PDFs with no text layer, corrupt or password-locked documents, obscure formats, near-empty files, or images where the meaning is visual rather than textual. A pipeline that silently drops these is worse than useless — those are often the files people most need to find later.
So embeddings are the fast path, and an LLM is the fallback. When embedding a file fails or returns low-confidence results, the file is routed to a language model that reads what it can — running OCR on a scan, summarising a stubborn document, or describing an image — and produces a classification and a text description we can embed instead. The flow is simple and explicit:
- Try to extract text and generate an embedding — the cheap, fast path that handles the vast majority of files.
- If extraction fails or confidence is low, hand the file to the LLM fallback (OCR, summarisation or vision) to recover its meaning.
- Embed and classify the recovered description, and flag the file for a quick human review when even the LLM is unsure.
Embeddings handle the 90% cheaply and instantly. The LLM rescues the awkward 10% that a naive pipeline would have lost. Nothing falls through the cracks.
Private by default, on their own infrastructure
A university's archive is sensitive, so none of it leaves their network. The whole pipeline runs self-hosted, and the LLM fallback is served through our Xoredge AI Platform pointed at local models — the same gateway that gives us automatic failover, cost-aware routing and full observability. They get modern AI organisation without their documents ever touching a third-party API.
The engineering that keeps it trustworthy
- Idempotent and resumable — the pipeline can re-run over 50GB without re-processing what hasn't changed.
- Human-in-the-loop on low confidence — uncertain classifications are queued for a one-click human check, not guessed silently.
- Deduplication built in — near-duplicates are detected and consolidated instead of multiplying.
- Observable end to end — throughput, failures and fallback rates are tracked so we know exactly how the archive is being processed.
The takeaway
Organising data at scale isn't about a neater folder tree — it's about giving the system an understanding of what each file means. Vector embeddings provide that understanding cheaply and instantly; four clear categories and a per-project lifecycle folder make the result navigable; and an LLM safety net ensures the files that resist embedding are recovered rather than lost. That combination turned 50GB of chaos into something a whole institution can actually search — privately, on their own infrastructure.