Personal Knowledge System — Architecture & Engineering Spec (v0.1)
Owner: You
Goal: A fast, reliable, single‑user (optionally team‑scalable) system to capture → organize → index → retrieve → act on personal knowledge (URLs, notes, docs, code, emails, recordings) with minimal friction and strong recall.
Seed Links to Capture First:
1) Product Objectives
Zero‑friction capture from anywhere (browser share, quick hotkey, email forward, GitHub link, files).
Trustworthy recall within ~200ms for keyword search and <1.5s for semantic (vector+rerank) over 100k–3M chunks.
Useful organization without busywork: auto‑enrichment (title, summary, entities), light tags/sources, smart notebooks, saved queries.
Actionable results: answer‑oriented views (snippets + source), “Send to…” (task manager, doc, email), and recap/brief generation.
Private by default: local or self‑host, end‑to‑end encryption at rest/backup.
2) System Overview
Ingestion → Normalization → Indexing → Storage → Retrieval → UX → Automation & Analytics
Ingestion: Browser extension, Quick‑capture app/CLI, Email ingest, File drop, Connectors (GitHub/Notion/Drive optional), Recording transcripts.
Normalization: Boilerplate removal, HTML → Markdown, PDF OCR, language detect, dedupe, chunking.
Enrichment: Title, summary, keywords, entities, dates, URL canonicalization, link graph, embeddings.
Indexing & Storage: Qdrant (vectors+payload), Postgres (metadata/relations), S3/Local FS (raw), Redis (hot cache), Meilisearch/BM25 (optional hybrid).
Retrieval: Hybrid (keyword + vector), time‑decay & diversity, re‑ranking, source‑aware filters; Q&A via small/medium LLM.
UX: Web app + extension + global hotkey palette; notebooks; saved searches; daily brief; export.
Automation: Scheduled re‑crawl, broken‑link repair, freshness checks, quality metrics (recall@k, MRR), backups.
3) Core Requirements
Functional
Capture: URL, selection quote, file, image/PDF, audio; add note/tag; offline queue.
Organize: auto‑entities, tags, notebooks, people, projects, sources; merge/alias entities.
Search: instant keyword, semantic, boolean, filters (time, type, source, people, project).
Preview: side‑by‑side reader with highlights and matched spans.
Q&A: retrieve‑then‑read with source‑pinned answers; 1‑click export.
Share: link or bundle export (markdown/zip); redaction mode.
Non‑Functional
Latency: p50 <200ms keyword, <1.5s hybrid+rerank at 200k chunks.
Scale: 2M chunks / 200GB raw.
Reliability: 99.9% monthly, crash‑safe queues, idempotent ingestion.
Privacy: local‑first, key‑managed encryption; audit log.
Portability: all data exportable (JSONL + Markdown + parquet).
4) Architecture (Logical)
[Browser Ext/CLI/Email/Files] ─▶ [Ingest Queue (Kafka/Redpanda/NATS)]
└▶ [Ingest Workers]
├─ Fetchers (HTML/PDF/Drive/GitHub)
├─ Extractors (boilerplate, OCR)
├─ Normalizers (md, language)
├─ Chunker (semantic + structural)
├─ Enrichment (summary, entities)
└─ Embeddings
[Postgres] ◀────────── [Index Orchestrator] ─────────▶ [Qdrant]
▲ └─ store metadata/links └ vectors+payload
│
[Object Store (S3/local)] — raw/originals
[API Gateway (FastAPI)] ─▶ [Search Service] ─▶ [Hybrid (BM25+Qdrant)] ─▶ [Reranker]
└▶ [Answerer (RAG)] ─▶ [LLM]
[Web App + Extension + CLI]
5) Data Model
5.1 Entities
Item: top‑level captured thing.
Chunk: retrieval unit (e.g., ~600–1200 tokens).
Entity: person/org/project/topic; deduped with aliases.
Notebook: saved set (rule‑based or manual).
Annotation: highlight, comment, task.
5.2 Postgres (DDL sketch)
-- Items
captured_at TIMESTAMPTZ NOT NULL,
created_by TEXT,
hash_sha256 TEXT UNIQUE,
language TEXT,
size_bytes BIGINT,
storage_uri TEXT, -- s3://... or file://...
extra JSONB
);
-- Chunks
CREATE TABLE chunk (
id UUID PRIMARY KEY,
item_id UUID REFERENCES item(id) ON DELETE CASCADE,
ord INT, -- order within item
text_md TEXT,
token_count INT,
section_path TEXT, -- e.g., h1>h2>p
extra JSONB
);
CREATE INDEX chunk_item_idx ON chunk(item_id);
-- Entities & Links
CREATE TABLE entity (
id UUID PRIMARY KEY,
kind TEXT CHECK (kind IN ('person','org','project','topic','repo','site')),
name TEXT,
canonical_name TEXT,
aliases TEXT[],
extra JSONB
);
CREATE TABLE item_entity (
item_id UUID REFERENCES item(id) ON DELETE CASCADE,
entity_id UUID REFERENCES entity(id) ON DELETE CASCADE,
rel TEXT,
PRIMARY KEY(item_id, entity_id, rel)
);
-- Annotations
CREATE TABLE annotation (
id UUID PRIMARY KEY,
item_id UUID REFERENCES item(id) ON DELETE CASCADE,
chunk_id UUID REFERENCES chunk(id) ON DELETE SET NULL,
kind TEXT CHECK (kind IN ('highlight','note','task')),
body TEXT,
created_at TIMESTAMPTZ NOT NULL,
extra JSONB
);
-- Notebooks (smart collections)
CREATE TABLE notebook (
id UUID PRIMARY KEY,
name TEXT UNIQUE,
rules JSONB, -- e.g., {"must": {"entities":["Qdrant"], "type":["url"]}, "time": {"after":"2025-01-01"}}
extra JSONB
);
-- Audit
CREATE TABLE event_log (
id BIGSERIAL PRIMARY KEY,
at TIMESTAMPTZ NOT NULL,
actor TEXT,
action TEXT,
target UUID,
details JSONB
);
5.3 Qdrant Collections
Collection:
chunks_v1Vector size: depends on embedding model (e.g., 1024–3072).
Distance: cosine.
Payload schema (examples):
{
"chunk_id": "uuid",
"item_id": "uuid",
"source_url": "string",
"site": "string",
"type": "url|pdf|doc|...",
"language": "en",
"captured_at": "2025-10-06T00:00:00Z",
"section_path": "h1>h2>p",
"entities": ["Qdrant","MCP"],
"keywords": ["vector db","hybrid search"],
"tags": ["tutorial","architecture"],
"notebooks": ["Vectors"],
"token_count": 820
}
6) Ingestion Pipelines
6.1 Sources
Browser extension: capture page, selected text, or link; attach note/tags; offline queue.
Email: forward to
inbox@yourdomain→ parse into item/email‑thread, keep headers, inline images, and attachments.File drop: PDF/Doc/MD; OCR via Tesseract + pdfminer; images via TrOCR.
Code/GitHub: repo URL → pull README, issues, wiki; chunk by file with language‑aware heuristics.
Recordings: whisper‑x (or equivalent) → diarization → chapters.
Connectors (later): Notion/Drive/Slack read‑only.
6.2 Normalization
HTML → Markdown using readability + turndown.
Boilerplate removal (nav/ads/toast/footers).
Canonical URLs via rel=canonical + rules (strip UTM, session ids).
Deduping with URL canonical hash + content SimHash (80+ similarity).
Chunking: structure‑aware (H1/H2), target 600–1200 tokens, coalesce short siblings, keep section path.
Language detection (fastText) + routing to language‑specific embeddings.
6.3 Enrichment
Summaries (<= 5 bullets + 1‑sentence TL;DR).
Keywords/entities via NER + topicality scoring; merge to canonical entities.
Embeddings for each chunk; fallback model for long/short.
Link graph (outbound/inbound) + site/domain features.
7) Retrieval & Ranking
-
Candidate generation (hybrid):
Keyword (BM25/Meilisearch) top‑k=100 filtered by type/time/source.
Vector (Qdrant ANN) top‑k=200 with sparsity filters (language, site, entities).
Union with source diversity and time decay (e.g., 0.96^months since capture).
Reranking: Cross‑encoder (e.g.,
bge-reranker-largeor equivalent) → top‑20.Answering (optional): small LLM with grounding pins (must cite k sources; prevent fabrication).
Snippets: sentence‑window around matched spans; highlight terms.
Filters & Facets: time, type, site, entities, tags, notebook.
Saved Searches: named queries with auto‑notebook feed.
Metrics: recall@20 on curated queries; MRR; time‑to‑first‑token; click‑through; “was this helpful?”
8) Security, Privacy, Backup
Run‑anywhere: Docker compose; single‑tenant.
Auth: local account + passkey; optional SSO later.
Secrets: Vault/.env; rotateable.
Encryption: at‑rest for Postgres (pgcrypto), object store (SSE‑S3/age), Qdrant disk; TLS in transit.
Backups: nightly logical dump (pg), vector snapshot (Qdrant/Pinecone export), objects versioned; verify restores weekly.
Redaction: on export/share, mask emails/IDs; allow rules per notebook.
8.1 Cost‑aware Deployment Targets (Cloudflare‑first)
Goal: Max free‑tier leverage using Cloudflare + managed vector DBs you already have.
Cloudflare Workers: light ingest webhooks, fan‑out to Queues, presigned R2 writes, trigger batch jobs (Cron Triggers).
Cloudflare Queues: ingestion pipeline (URL→fetch→normalize), retry semantics, backpressure.
Cloudflare D1 (SQLite): relational metadata (items/chunks/entities/notebooks) where RPS+size fit; replicate Offload to Postgres later if needed.
Cloudflare R2: raw items (HTML, PDF, audio), normalized Markdown, and export bundles; bucket layout below.
Qdrant Cloud & Pinecone: pluggable vector backends; use one primary, one mirror.
Why D1? Cheap/global SQLite with HTTP access (good for single‑user scale). If we outgrow, we can mirror writes to Postgres via Workers.
9) Interfaces & UX
9.1 Web App
Global search bar with power‑syntax (
site:,type:,in:note,entity:,project:,before:/after:).Results list with facets sidebar; previews right‑pane; keyboard‑first.
Item page: reader, highlights, extracted outline; related items; entities; notebook membership.
Notebooks: smart rules + manual pins; daily/weekly briefs; export.
Capture: quick note, paste URL, drag‑drop file.
9.2 Browser Extension
1‑click Save Page / Save Selection / Save Link with note, tags, notebook.
“Save to Inbox” queue if offline.
9.3 CLI
pks add <url|file> --note "..." --tags qdrant,mcp --nb Researchpks search "hybrid index qdrant" --since 90d --type urlpks export --nb Qdrant --format md
10) Engineering Plan
10.1 Tech Stack (revised for Cloudflare)
Edge/API: Cloudflare Workers (TypeScript), Hono router.
Pipelines: Cloudflare Queues + Cron Triggers; optional self‑host worker for heavy OCR/embeddings.
DB: Cloudflare D1 (primary metadata); optional Postgres mirror (later).
Vector: Qdrant Cloud or Pinecone (selectable); adapter interface.
Object Storage: Cloudflare R2.
Cache: Cloudflare KV / Durable Object (session) / Redis (only if needed off‑edge).
Frontend: Next.js (deployed on Cloudflare Pages).
LLMs/Embeddings: pluggable—Cloudflare Workers AI or hosted/open models from a small VM.
10.2 Services & APIs
-
Worker API
POST /ingest/url→ enqueue to Queues with {url, note, tags, nb}.POST /ingest/file→ signed upload to R2, then enqueue.GET /search→ D1 facets + vector top‑k (Qdrant/Pinecone), optional rerank.POST /answer→ RAG with citations.
Vector Adapter Interface