Gpt5 Builds

Personal Knowledge System — Architecture & Engineering Spec (v0.1)

Owner: You
Goal: A fast, reliable, single‑user (optionally team‑scalable) system to capture → organize → index → retrieve → act on personal knowledge (URLs, notes, docs, code, emails, recordings) with minimal friction and strong recall.

Seed Links to Capture First:


1) Product Objectives

  1. Zero‑friction capture from anywhere (browser share, quick hotkey, email forward, GitHub link, files).

  2. Trustworthy recall within ~200ms for keyword search and <1.5s for semantic (vector+rerank) over 100k–3M chunks.

  3. Useful organization without busywork: auto‑enrichment (title, summary, entities), light tags/sources, smart notebooks, saved queries.

  4. Actionable results: answer‑oriented views (snippets + source), “Send to…” (task manager, doc, email), and recap/brief generation.

  5. Private by default: local or self‑host, end‑to‑end encryption at rest/backup.


2) System Overview

IngestionNormalizationIndexingStorageRetrievalUXAutomation & Analytics

  • Ingestion: Browser extension, Quick‑capture app/CLI, Email ingest, File drop, Connectors (GitHub/Notion/Drive optional), Recording transcripts.

  • Normalization: Boilerplate removal, HTML → Markdown, PDF OCR, language detect, dedupe, chunking.

  • Enrichment: Title, summary, keywords, entities, dates, URL canonicalization, link graph, embeddings.

  • Indexing & Storage: Qdrant (vectors+payload), Postgres (metadata/relations), S3/Local FS (raw), Redis (hot cache), Meilisearch/BM25 (optional hybrid).

  • Retrieval: Hybrid (keyword + vector), time‑decay & diversity, re‑ranking, source‑aware filters; Q&A via small/medium LLM.

  • UX: Web app + extension + global hotkey palette; notebooks; saved searches; daily brief; export.

  • Automation: Scheduled re‑crawl, broken‑link repair, freshness checks, quality metrics (recall@k, MRR), backups.


3) Core Requirements

Functional

  • Capture: URL, selection quote, file, image/PDF, audio; add note/tag; offline queue.

  • Organize: auto‑entities, tags, notebooks, people, projects, sources; merge/alias entities.

  • Search: instant keyword, semantic, boolean, filters (time, type, source, people, project).

  • Preview: side‑by‑side reader with highlights and matched spans.

  • Q&A: retrieve‑then‑read with source‑pinned answers; 1‑click export.

  • Share: link or bundle export (markdown/zip); redaction mode.

Non‑Functional

  • Latency: p50 <200ms keyword, <1.5s hybrid+rerank at 200k chunks.

  • Scale: 2M chunks / 200GB raw.

  • Reliability: 99.9% monthly, crash‑safe queues, idempotent ingestion.

  • Privacy: local‑first, key‑managed encryption; audit log.

  • Portability: all data exportable (JSONL + Markdown + parquet).


4) Architecture (Logical)

[Browser Ext/CLI/Email/Files] ─▶ [Ingest Queue (Kafka/Redpanda/NATS)]

└▶ [Ingest Workers]

├─ Fetchers (HTML/PDF/Drive/GitHub)

├─ Extractors (boilerplate, OCR)

├─ Normalizers (md, language)

├─ Chunker (semantic + structural)

├─ Enrichment (summary, entities)

└─ Embeddings

[Postgres] ◀────────── [Index Orchestrator] ─────────▶ [Qdrant]

▲ └─ store metadata/links └ vectors+payload

[Object Store (S3/local)] — raw/originals

[API Gateway (FastAPI)] ─▶ [Search Service] ─▶ [Hybrid (BM25+Qdrant)] ─▶ [Reranker]

└▶ [Answerer (RAG)] ─▶ [LLM]

[Web App + Extension + CLI]


5) Data Model

5.1 Entities

  • Item: top‑level captured thing.

  • Chunk: retrieval unit (e.g., ~600–1200 tokens).

  • Entity: person/org/project/topic; deduped with aliases.

  • Notebook: saved set (rule‑based or manual).

  • Annotation: highlight, comment, task.

5.2 Postgres (DDL sketch)

-- Items

captured_at TIMESTAMPTZ NOT NULL,

created_by TEXT,

hash_sha256 TEXT UNIQUE,

language TEXT,

size_bytes BIGINT,

storage_uri TEXT, -- s3://... or file://...

extra JSONB

);

-- Chunks

CREATE TABLE chunk (

id UUID PRIMARY KEY,

item_id UUID REFERENCES item(id) ON DELETE CASCADE,

ord INT, -- order within item

text_md TEXT,

token_count INT,

section_path TEXT, -- e.g., h1>h2>p

extra JSONB

);

CREATE INDEX chunk_item_idx ON chunk(item_id);

-- Entities & Links

CREATE TABLE entity (

id UUID PRIMARY KEY,

kind TEXT CHECK (kind IN ('person','org','project','topic','repo','site')),

name TEXT,

canonical_name TEXT,

aliases TEXT[],

extra JSONB

);

CREATE TABLE item_entity (

item_id UUID REFERENCES item(id) ON DELETE CASCADE,

entity_id UUID REFERENCES entity(id) ON DELETE CASCADE,

rel TEXT,

PRIMARY KEY(item_id, entity_id, rel)

);

-- Annotations

CREATE TABLE annotation (

id UUID PRIMARY KEY,

item_id UUID REFERENCES item(id) ON DELETE CASCADE,

chunk_id UUID REFERENCES chunk(id) ON DELETE SET NULL,

kind TEXT CHECK (kind IN ('highlight','note','task')),

body TEXT,

created_at TIMESTAMPTZ NOT NULL,

extra JSONB

);

-- Notebooks (smart collections)

CREATE TABLE notebook (

id UUID PRIMARY KEY,

name TEXT UNIQUE,

rules JSONB, -- e.g., {"must": {"entities":["Qdrant"], "type":["url"]}, "time": {"after":"2025-01-01"}}

extra JSONB

);

-- Audit

CREATE TABLE event_log (

id BIGSERIAL PRIMARY KEY,

at TIMESTAMPTZ NOT NULL,

actor TEXT,

action TEXT,

target UUID,

details JSONB

);

5.3 Qdrant Collections

  • Collection: chunks_v1

  • Vector size: depends on embedding model (e.g., 1024–3072).

  • Distance: cosine.

  • Payload schema (examples):

{

"chunk_id": "uuid",

"item_id": "uuid",

"source_url": "string",

"site": "string",

"type": "url|pdf|doc|...",

"language": "en",

"captured_at": "2025-10-06T00:00:00Z",

"section_path": "h1>h2>p",

"entities": ["Qdrant","MCP"],

"keywords": ["vector db","hybrid search"],

"tags": ["tutorial","architecture"],

"notebooks": ["Vectors"],

"token_count": 820

}


6) Ingestion Pipelines

6.1 Sources

  • Browser extension: capture page, selected text, or link; attach note/tags; offline queue.

  • Email: forward to inbox@yourdomain → parse into item/email‑thread, keep headers, inline images, and attachments.

  • File drop: PDF/Doc/MD; OCR via Tesseract + pdfminer; images via TrOCR.

  • Code/GitHub: repo URL → pull README, issues, wiki; chunk by file with language‑aware heuristics.

  • Recordings: whisper‑x (or equivalent) → diarization → chapters.

  • Connectors (later): Notion/Drive/Slack read‑only.

6.2 Normalization

  • HTML → Markdown using readability + turndown.

  • Boilerplate removal (nav/ads/toast/footers).

  • Canonical URLs via rel=canonical + rules (strip UTM, session ids).

  • Deduping with URL canonical hash + content SimHash (80+ similarity).

  • Chunking: structure‑aware (H1/H2), target 600–1200 tokens, coalesce short siblings, keep section path.

  • Language detection (fastText) + routing to language‑specific embeddings.

6.3 Enrichment

  • Summaries (<= 5 bullets + 1‑sentence TL;DR).

  • Keywords/entities via NER + topicality scoring; merge to canonical entities.

  • Embeddings for each chunk; fallback model for long/short.

  • Link graph (outbound/inbound) + site/domain features.


7) Retrieval & Ranking

  1. Candidate generation (hybrid):

    • Keyword (BM25/Meilisearch) top‑k=100 filtered by type/time/source.

    • Vector (Qdrant ANN) top‑k=200 with sparsity filters (language, site, entities).

    • Union with source diversity and time decay (e.g., 0.96^months since capture).

  2. Reranking: Cross‑encoder (e.g., bge-reranker-large or equivalent) → top‑20.

  3. Answering (optional): small LLM with grounding pins (must cite k sources; prevent fabrication).

  4. Snippets: sentence‑window around matched spans; highlight terms.

  5. Filters & Facets: time, type, site, entities, tags, notebook.

  6. Saved Searches: named queries with auto‑notebook feed.

Metrics: recall@20 on curated queries; MRR; time‑to‑first‑token; click‑through; “was this helpful?”


8) Security, Privacy, Backup

  • Run‑anywhere: Docker compose; single‑tenant.

  • Auth: local account + passkey; optional SSO later.

  • Secrets: Vault/.env; rotateable.

  • Encryption: at‑rest for Postgres (pgcrypto), object store (SSE‑S3/age), Qdrant disk; TLS in transit.

  • Backups: nightly logical dump (pg), vector snapshot (Qdrant/Pinecone export), objects versioned; verify restores weekly.

  • Redaction: on export/share, mask emails/IDs; allow rules per notebook.


8.1 Cost‑aware Deployment Targets (Cloudflare‑first)

Goal: Max free‑tier leverage using Cloudflare + managed vector DBs you already have.

  • Cloudflare Workers: light ingest webhooks, fan‑out to Queues, presigned R2 writes, trigger batch jobs (Cron Triggers).

  • Cloudflare Queues: ingestion pipeline (URL→fetch→normalize), retry semantics, backpressure.

  • Cloudflare D1 (SQLite): relational metadata (items/chunks/entities/notebooks) where RPS+size fit; replicate Offload to Postgres later if needed.

  • Cloudflare R2: raw items (HTML, PDF, audio), normalized Markdown, and export bundles; bucket layout below.

  • Qdrant Cloud & Pinecone: pluggable vector backends; use one primary, one mirror.

Why D1? Cheap/global SQLite with HTTP access (good for single‑user scale). If we outgrow, we can mirror writes to Postgres via Workers.


9) Interfaces & UX

9.1 Web App

  • Global search bar with power‑syntax (site:, type:, in:note, entity:, project:, before:/after:).

  • Results list with facets sidebar; previews right‑pane; keyboard‑first.

  • Item page: reader, highlights, extracted outline; related items; entities; notebook membership.

  • Notebooks: smart rules + manual pins; daily/weekly briefs; export.

  • Capture: quick note, paste URL, drag‑drop file.

9.2 Browser Extension

  • 1‑click Save Page / Save Selection / Save Link with note, tags, notebook.

  • “Save to Inbox” queue if offline.

9.3 CLI

  • pks add <url|file> --note "..." --tags qdrant,mcp --nb Research

  • pks search "hybrid index qdrant" --since 90d --type url

  • pks export --nb Qdrant --format md


10) Engineering Plan

10.1 Tech Stack (revised for Cloudflare)

  • Edge/API: Cloudflare Workers (TypeScript), Hono router.

  • Pipelines: Cloudflare Queues + Cron Triggers; optional self‑host worker for heavy OCR/embeddings.

  • DB: Cloudflare D1 (primary metadata); optional Postgres mirror (later).

  • Vector: Qdrant Cloud or Pinecone (selectable); adapter interface.

  • Object Storage: Cloudflare R2.

  • Cache: Cloudflare KV / Durable Object (session) / Redis (only if needed off‑edge).

  • Frontend: Next.js (deployed on Cloudflare Pages).

  • LLMs/Embeddings: pluggable—Cloudflare Workers AI or hosted/open models from a small VM.

10.2 Services & APIs

  • Worker API

    • POST /ingest/url → enqueue to Queues with {url, note, tags, nb}.

    • POST /ingest/file → signed upload to R2, then enqueue.

    • GET /search → D1 facets + vector top‑k (Qdrant/Pinecone), optional rerank.

    • POST /answer → RAG with citations.

  • Vector Adapter Interface