How I Built a RAG Assistant for Martech Zone With Cloudflare Workers, Vectorize, and Llama 3.3 for a Multilingual Chatbot

For years, I’ve watched AI chat interfaces land on sites with mixed results. Most are either a thin wrapper around a generic LLM — answering everything from broad training data with no specific knowledge of the site — or a glorified FAQ bot pretending to understand natural language. I wanted something between those: a proper retrieval-augmented generation (RAG) system that answers only from content I’ve actually published, cites its sources, and runs entirely on infrastructure I already trust.
Credit: This project wouldn’t exist without Joost de Valk’s ask-endpoint. Joost’s approach — indexing markdown source files at build time, bundling embeddings into a static site, blending keyword and semantic search — was the first end-to-end RAG implementation I could read cover-to-cover and actually understand.
I couldn’t ship his approach directly, though. Joost writes his blog in markdown-in-git. Martech Zone has 3,000+ posts in WordPress with active daily publishing. Rebuilding a static index on every save isn’t viable. So I flipped the architecture: WordPress pushes content to a Cloudflare Worker on save_post, the Worker chunks and embeds the text, and vectors land in Cloudflare Vectorize. Everything lives on Cloudflare and WordPress. No GitHub Actions, no external services, no cron-crawling my own sitemap.
The Chatbot Stack
- WordPress pushes post content to the Worker whenever a post is saved, published, or trashed. A companion plugin (Ask Martech Zone) handles the HTTPS call and signs it with HMAC-SHA256.
- Cloudflare Worker receives the push, chunks the text, generates embeddings via Workers AI (BAAI/bge-base-en-v1.5, 768-dim), stores chunk text in KV, and upserts vectors into Vectorize.
- Cloudflare Vectorize is the vector database — cosine similarity search across ~15,000 chunks from 4,000+ posts.
- Cloudflare KV holds the chunk payloads, an exact-title-match index, runtime tuning parameters, and feedback records.
- Workers AI → Llama 3.3 70B generates the final answer grounded in retrieved chunks, with bracketed citations pointing at the source posts.
On a cache-warm path, a question gets an answer in ~3–5 seconds. Total cost at my traffic runs to cents per thousand questions.
The Path of a Single Question
One round trip from the visitor’s keyboard back to a cited answer.
Visitor Types a Question
The chat widget sits on every page in a floating button or an inline embed inside articles.
Chatbot Sends the Request
The chatbot captures the question and posts it to the chat service with the visitor’s language tag.
Cloudflare Worker
A serverless function handles all orchestration — translation, retrieval, and answer generation — in a single call.
Vectorize + KV Indexes
The site is embedded as English vectors plus deterministic lookup indexes. Built upon publishing.
Scored Answer is Delivered
A concise reply in their language, with numbered citations linking to the source articles on their own subdomain.
What’s Under the Hood
A compact stack — one platform does most of the work.
WordPress Plugin
Renders the floating button and inline embed, pushes new and edited posts into the knowledge base, and owns the widget UI.
Cloudflare Worker
The chat service itself. Receives questions, orchestrates the retrieval and answer steps, and returns JSON. Runs close to every visitor.
Translation model
Meta’s m2m100 (Workers AI). Translates the inbound question and outbound source titles — 100 languages supported.
Embedding model
BAAI’s bge-base-en-v1.5 (Workers AI). Turns chunks of article text and incoming queries into 768-dimensional vectors.
Answer model
Meta’s Llama 3.3 70B (Workers AI). Reads the retrieved excerpts and writes the citation-grounded reply in the visitor’s language.
Vector database
Cloudflare Vectorize. Stores the embedded chunks of every martech.zone post and answers similarity queries.
Lookup indexes
Cloudflare KV. Two namespaces: deterministic title/topic indexes for exact-match retrieval, plus an admin tuning panel’s state.
GTranslate (UI only)
Translates the widget’s static UI strings — title, placeholder, buttons — alongside the rest of the page. The chat answer itself comes from the LLM.
Why translate the question instead of re-embedding the library? Embedding the full archive in a multilingual model would mean re-indexing a decade of content. Translating each question to English costs fractions of a cent and a fraction of a second, and leaves the library untouched. The large language model can already write in the visitor’s language natively, so no second translation hop is needed.
What happens when there’s no good answer? If retrieval can’t surface a confident match, the Worker short-circuits with an honest “I don’t have specific information about that” response rather than letting the language model invent citations. Visitors are nudged to rephrase or try a different topic.
What This Build Actually Does Well
- Real-time indexing: Every post save triggers an HMAC-signed push to the Worker within ~200ms. New articles are searchable almost immediately — no crawl-and-wait.
- Title-exact-match shortcut: Every post’s title and URL slug are stored as KV keys that short-circuit directly to the matching URL. A query of What is DMARC? or DMARC maps to the DMARC post before vector search runs, preventing the LLM from citing tangentially-related articles.
- Query prefix stripping: What is X queries are stripped to just X before embedding. Otherwise, every “What is X” post clusters tightly in vector space, and retrieval returns the wrong acronym.
- Recency decay: Newer posts outrank semantically similar older posts via exponential decay, with a configurable baseline and half-life.
- Per-post-type weights: I can boost acronym posts, demote podcast transcripts, or push tool reviews — all from the admin without re-indexing a single vector.
- Custom field indexing: Rank Math’s SEO description and the post tags are prepended to each chunk so the author-curated signal reaches the embedding.
- Score threshold fallback: Below a configurable score floor, the LLM call is skipped entirely, and a no specific info response is returned. Prevents fabricated citations when retrieval is weak.
- Live test in admin: A query box in WordPress admin that shows the top (variable) results with individual component scores — raw cosine similarity, content-type weight, recency factor, title-match bonus. Tuning is grounded in data rather than guesswork.
- Rebuild title index: One-click, JS-driven loop that walks every indexed post and rewrites the title-match entries. Essential when rolling out new features to a pre-existing index.
- Feedback collection: Thumbs-up/thumbs-down buttons under every answer. Thumbs-down entries are stored with full context (question, answer, sources, country, user-agent family) for private review in the admin, with 6-month auto-expiration and no IP address tracking.
- MailPoet integration: After a set number of questions, the widget offers an email opt-in tied to a specific list. Double-opt-in aware — if the visitor submitted an email but hasn’t confirmed, a persistent banner reminds them across page reloads until MailPoet reports the address as confirmed.
- Theme-aware styling: The widget reads
var(--brand-color)so the site’s existing light/dark mode flips the chat’s accent automatically with no JavaScript observers. - Three mount modes: Floating corner button on every page,
shortcode for embedding inside posts, anddata-martech-ask="open"attributes for any CTA button or link elsewhere on the site. - Expand-to-modal: Click the expand icon to grow the chat to 92% of the viewport with a dark backdrop — useful on desktop for a focused reading experience.
- Minimal Core Web Vitals impact: The widget renders no layout-shifting elements, loads no render-blocking resources, and only builds its panel HTML on first click. CWV is unaffected.
- Full GA4 observability: Eleven events cover open, submit, response, error, feedback rating, expand toggle, and every MailPoet subscribe state — all routed through
gtagordataLayertransparently. - Developer surface: A documented REST endpoint on the Worker (HMAC for writes, CORS-open for
/ask), aMartechAsk.open({query, submit})JavaScript API, WordPress filters for classification and content transformation, and a Usage tab in the plugin admin documenting every integration point.
This Is Absolutely a First Cut
Every piece of this is rudimentary. A production-grade build would:
- Use proper hybrid retrieval (BM25 + semantic) rather than pure vector with title-match bonuses
- Fine-tune the embedding model on martech-specific content (Workers AI’s managed models don’t support fine-tuning yet)
- Implement prompt-injection defenses beyond be nice to the LLM.
- Cache common queries at the edge
- A/B test retrieval strategies against feedback data
- Handle edge cases around multilingual content, streaming responses, rate limiting, and adversarial input
What I shipped is the minimum viable configuration that answers real questions about my content correctly — the floor, not the ceiling. The feedback tab exists precisely so I can see where it’s still wrong and iterate.
What’s Next?
The biggest latent improvement is hard-negative re-ranking: when a visitor thumbs down an answer that cited a specific post, demote that post for similar queries. The feedback infrastructure already captures the data; the ranker just isn’t using it yet.
I also want to explore content-type-aware prompt templates — answering a definition query differently from a comparison query — and experiment with swapping Llama 3.3 70B for a smaller 8B model on definition lookups where retrieval is strong and generation cost dominates latency.
Thanks again to Joost for lighting the path. Because this is highly customized to my theme and site, I doubt that open-sourcing it would be beneficial. If Joost wants it, I’ll send my plugin his way, though! For now, hit the button above, ask something, and send me the thumbs-down — you’re not breaking anything; you’re training me.







