Who is Daniel Astudillo?

Daniel Astudillo is a software engineer based in New York City. He currently works at S&P Global building data platforms and full-stack applications, and previously built payment and benefit systems at Visa.

What technologies does Daniel Astudillo work with?

Daniel works across the stack with React, TypeScript, and Next.js on the frontend, and .NET Core, C#, Spring Boot, and Java on the backend. He has deep experience with PostgreSQL, BigQuery, gRPC, and distributed messaging systems.

Where is Daniel Astudillo based?

Daniel is based in New York City, where he works as a software engineer at S&P Global.

What has Daniel Astudillo built?

Daniel has taken a data API from a 21-second worst case to roughly 200–300ms (Storage Write API, then PostgreSQL), built a real-time event pipeline processing 100K+ daily events at 99.99% uptime, and modernized Visa payment and eligibility APIs at production scale (including paths above 20M requests per month). He writes about this work on his blog.

What is Daniel Astudillo's educational background?

Daniel graduated from Williams College with a Bachelor of Arts in Computer Science and Mathematics.

December 2025Updated June 202610 min readCase study

Self-hosting Llama-2 13B GPTQ on a GPU pod — and why I should have used SSE

RunPod + vLLM + Llama-2 13B GPTQ: auth-on-first-frame WebSockets, batch generate() chunked for UX, and the transport I wish I had chosen — HTTP streaming or SSE like experiment A.

vLLM
LLM
GPTQ
WebSocket
SSE
Infrastructure

#TL;DR

Experiment B in the same AI arc as experiment A (Gemini + SSE BFF). I deployed TheBloke/Llama-2-13B-GPTQ with vLLM on a GPU cloud pod (RunPod-style TCP port mapping) and exposed inference over WebSocket.

Hindsight: I should have used SSE or OpenAI-style HTTP streaming on the pod—the same pattern I already had in experiment A on POST /api/chat/message/stream. WebSocket bought protocol complexity without any bidirectional win.

Product context: Lessons from building a mobile events social platform.

Hard lessons:

Model path discovery — HuggingFace cache directories are not vLLM model paths.
Chat templates — Llama-2 needs [INST] <<SYS>> … formatting or outputs look “broken.”
Fake streaming — llm.generate() is batch; my chunks were cosmetic.
The right server — vllm serve exposes OpenAI-compatible streaming; I hand-rolled FastAPI WS instead.

#Why experiment B exists

Experiment A answered “can we ship AI features this sprint?” Experiment B answered:

What does self-hosting a 13B quantized model feel like in engineering hours and GPU rent?
Can we route matchmaking blurbs and lightweight chat through our own weights?
Where does transport matter once inference is slow and bursty?

I targeted Llama-2 13B GPTQ (4-bit): large enough for short social copy, small enough to fit a single datacenter GPU with quantization. vLLM provides continuous batching and a production-serving story via vllm serve — I used the embedded LLM Python class inside FastAPI instead, which is fine for a spike but skips the maintained streaming APIs.

#Architecture (intended vs what the repo proves)

Mobile / BFF

intended consumer

GPU pod

vLLM · FastAPI

Intended: social backend calls the pod for inference on match/chat prompts.

What was actually wired:

The Fastify BFF (experiment A) talks to Gemini, not this pod — two parallel backends, not a chain.
The mobile client never opened /ws or called POST /generate.
The pod does ship POST /generate (non-streaming HTTP) for curl debugging alongside WebSocket.

So experiment B is a working inference service with no production client — still worth writing up because the failure modes (paths, templates, transport) are the learning.

#Bootstrapping the model (snapshot archaeology)

vLLM errors like “invalid repository ID or local directory” usually mean no config.json at MODEL_PATH.

HuggingFace Hub cache layout:

Text

models--TheBloke--Llama-2-13B-GPTQ/
  snapshots/
    <revision-hash>/
      config.json
      *.safetensors
      tokenizer.*

Pointing vLLM at the parent cache folder fails. Fixes:

Approach	When to use
Repo ID `TheBloke/Llama-2-13B-GPTQ`	Fresh pod; let vLLM download (quickstart)
Full snapshot path	Volume already populated
`snapshot_download` in entrypoint	Docker image with persistent `/workspace/models`

My entrypoint script runs huggingface_hub.snapshot_download when config.json is missing, then exports MODEL_PATH to the resolved snapshot — the right production habit.

GPTQ background: post-training quantization (GPTQ, Frantar et al., 2022) stores weights in low-bit grids so inference loads less VRAM at some accuracy cost — popular for running 13B-class models on single-GPU pods in 2025–2026 hobby and staging environments.

#Prompt formatting (the silent killer)

Llama-2 chat expects a template along the lines of:

Text

<s>[INST] <<SYS>>
{system}
<</SYS>>
 
{user} [/INST]

I implemented format_llama2_prompt() in Python and logged first bytes of prompts and completions during debugging. Empty or nonsense outputs often traced to template, not quantization.

vLLM’s docs emphasize that llm.generate does not apply chat templates automatically — you should use llm.chat or apply tokenizer.apply_chat_template (quickstart note). I learned that after manual string concatenation.

SamplingParams.stop included </s>, [INST], <<SYS>> to curb runaway generations — necessary for Llama-family decoding.

#WebSocket protocol (what I actually built)

After accept:

Auth frame: { "type": "auth", "api_key": "…" } — invalid key → close 1008.
Inference loop: { "type": "inference", "prompt": "…", "max_tokens": 256, "temperature": 0.7, … }.
Generation: outputs = llm.generate([formatted_prompt], sampling_params) — batch, returns full text.
“Streaming”: slice text into ~10-char JSON messages { "type": "chunk", "text": "…" } with asyncio.sleep(0.01) between sends, then { "type": "response", "done": true }.

Rate limit: per client IP, per minute bucket (MAX_REQ_PER_MIN, default 60).

Health returns RunPod pod id, internal/external ports, MODEL_PATH, CUDA visibility — essential when the platform maps symmetrical TCP ports (RUNPOD_TCP_PORT_* env vars).

There is also POST /generate with the same formatting and sampling — the endpoint I should have extended for streaming instead of WS.

#Hindsight: I should have used SSE (or vLLM’s OpenAI server)

On experiment A I already had the correct pattern:

HTTP POST with JSON body
Response text/event-stream
data: {"chunk": "…"}\n\n until data: [DONE]\n\n
Under the hood: real generateContentStream

For the GPU pod, the same shape would be:

http

POST /v1/chat/completions
Authorization: Bearer …
Accept: text/event-stream

…or a minimal FastAPI route that streams newline-delimited JSON or SSE frames from vLLM’s async engine.

#WebSocket vs SSE for LLM output

Use this when choosing what the phone (or your BFF’s public edge) speaks—not what vLLM supports internally.

Dimension	WebSocket (custom `/ws`)	SSE / HTTP stream (experiment A pattern)	vLLM OpenAI `stream: true`
Traffic shape	Bidirectional socket	One POST up, many events down	Same as SSE from client view if proxied
Time to first token	Only if inference streams; my pod faked chunks after batch `generate()`	Real when BFF uses `generateContentStream` (experiment A)	Real partial tokens from the engine
Mobile / corp networks	Upgrade + long-lived connection; proxies sometimes kill idle WS	Looks like normal HTTPS	Terminate at BFF; phone still sees SSE
Auth	Custom first-frame `{ type: "auth" }`	`Authorization` header on POST	API key on server; optional BFF proxy
Retries / idempotency	Reconnect + resync framing per message	New POST per user turn; simple replay	New completion request per turn
Load balancers	Often needs sticky sessions	Standard HTTP semantics	Run behind BFF or `vllm serve` with TLS
Ops surface	Hand-rolled protocol + rate limits	Fastify route + MDN-documented SSE	Maintained server; less FastAPI glue
Legitimate WebSocket case	Multiplexed bidirectional channels (games, CRDTs)	LLM assistant output, logs, progress	vLLM Realtime `/v1/realtime` for audio, not chat blurbs
This repo	Built for experiment B; no mobile caller	Already correct on the Gemini BFF	What I would deploy on the pod today

MDN’s SSE guide states plainly: SSE is for when the server pushes events to the front-end — “you can't send events from a client to a server” on that channel. LLM chat is exactly that for the response half: one prompt up, many tokens down.

WebSocket would be justified BFF → pod if I kept a persistent connection between my servers to amortize TLS — still not required to expose WS to the phone.

#What I should have deployed instead

vllm serve TheBloke/Llama-2-13B-GPTQ with --api-key (OpenAI-compatible server).
Client uses stream: true on chat completions — default JSON-SSE chunks with real partial outputs (online serving docs).
Optional thin FastAPI proxy if I need custom auth/logging — proxy streams, do not re-chunk batch output.

vLLM added a Realtime WebSocket at /v1/realtime in 2026 for incremental audio and multimodal streams (vLLM blog, Jan 2026). That is the legitimate WS case — not “print Llama match blurbs to a phone.”

That aligns experiment A and B on the wire while keeping inference backends swappable.

#Fake streaming vs real latency

Batch generate() waits for the full completion before my loop sent chunk frames. Users saw a typewriter effect; time-to-first-token did not improve. This is the difference between:

Transport streaming (SSE/WS framing), and
Inference streaming (model emits partial tokens as they are sampled)

vLLM’s serving stack is built for the second; my WebSocket layer only implemented the first.

#Operations on GPU pods (2025–2026 lessons)

Cold start: first snapshot_download can take tens of minutes — bake models into the image or attach a persistent volume.
Port mapping: public port ≠ internal 7860; health JSON should document both.
VRAM: 13B GPTQ still fails if another process holds the GPU or quant is mismatched.
Cost gate: compare GPU $/hour + engineer time against Gemini Flash per-million-token pricing before claiming savings.

#How the two experiments fit together

Text

Experiment A (Gemini BFF)     Experiment B (GPU pod)
─────────────────────────     ────────────────────────
Managed API                   Owned weights
Real SSE + real stream        WebSocket + batch generate
Matching + chat agents        Raw inference service
Ship-first                    Economics + control
Mobile client wired?          No                          No

The portfolio story is not “we use AI.” It is “I tried both managed and self-hosted paths, implemented streaming correctly on one, learned transport on the other, and can explain what ships next.”

#What I would rebuild today

Pod: vllm serve + OpenAI streaming client from the BFF.
Mobile: only ever sees SSE from the BFF (experiment A pattern).
Delete custom WS auth framing unless I need multiplexing.
Chat template: llm.chat or HF template — never hand-roll [INST] again.
Integrate or delete — a pod without a caller is a science project; a BFF route without a client is a sketch.

#GPU pod wire protocol (FastAPI + vLLM)

The RunPod service required API_KEY at boot and rejected connections without a first-frame auth handshake:

Python

auth_msg = await ws.receive_text()
auth_data = json.loads(auth_msg)
if auth_data.get("type") != "auth" or auth_data.get("api_key") != API_KEY:
    await ws.close(code=1008, reason="Invalid authentication")
    return

Rate limiting keyed on client_ip + minute bucket (MAX_REQ_PER_MIN, default 60) stopped runaway loops during load tests. Fake streaming — llm.generate() returns the full completion, then the server slices it into WebSocket chunks — was the main reason mobile clients should never have spoken WebSocket directly; SSE from a BFF can re-chunk real token streams from vLLM’s OpenAI-compatible endpoint instead.

#Manual Llama-2 chat template risk

format_llama2_prompt hand-builds [INST] / <<SYS>> markers. When vLLM’s tokenizer already applies a chat template, double-wrapping produces empty or repetitive outputs — the debug logs (Generated text length: 0) were the signal to migrate to llm.chat() or HF template APIs.

#Closing thought

Self-hosting pays off when you expose an OpenAI-compatible surface and hide transport behind the BFF. Raw WebSockets to mobile for one-way token streams are a design you will rewrite—match the wire protocol to the direction of data.

#Reader field guide

Experiment B is for economics and control—not for giving the phone a new socket protocol.

Pod boot checklist

MODEL_PATH points at a HuggingFace snapshot directory containing config.json, not the cache parent folder
Chat template via llm.chat() or tokenizer.apply_chat_template—not hand-built [INST] strings unless you enjoy empty completions
Health endpoint reports internal vs public ports (RunPod-style TCP mapping)
Cold start plan: baked model volume or snapshot_download in entrypoint—not download on first user request
Rate limit and API key enforced before inference work

Transport checklist (what I would do today)

Run vllm serve with OpenAI-compatible server (docs)
BFF calls stream: true; mobile only ever hits BFF SSE (experiment A wire format)
Delete custom WebSocket auth framing unless you need server↔server multiplexing
Do not confuse transport streaming (chunk frames) with inference streaming (partial tokens)—batch generate() + slice is the former only

Your situation	Choose
Ship AI features this sprint on a social app	Experiment A — managed API + SSE
Need owned weights / residency	GPU pod + OpenAI stream; phone still uses SSE via BFF
True bidirectional binary frames (CRDT, game state)	WebSocket—but not for one-way LLM text
Incremental audio with vLLM Realtime	vLLM `/v1/realtime` WebSocket—not this chat spike

If the pod has no caller and the BFF still points at Gemini, treat experiment B as ops homework, not product scope.

#On this site

Post	Why
Building a Gemini AI backend with SSE (experiment A)	Managed API + real SSE — the transport pattern this pod should have copied
Lessons from building a mobile events social platform	Why AI was optional infrastructure, not a launch blocker
Building a collaborative editor with CRDTs	Legitimate WebSocket use case (binary CRDT frames, not token streaming)

#References (curated)

vLLM’s OpenAI-compatible server is the rebuild target; the Realtime API post is a reminder that WebSocket on the pod is for audio, not chat copy.

Reference	Notes
vLLM OpenAI-compatible server	`stream: true` from the BFF—stop hand-rolling FastAPI WebSockets for text.
vLLM online serving	Deployment patterns (ports, health) on GPU hosts like RunPod.
vLLM Realtime API (Jan 2026)	WebSocket when the modality is realtime audio—not matchmaking blurbs.
MDN: Server-sent events	What the phone should still speak after you add a self-hosted backend.

Self-hosting Llama-2 13B GPTQ on a GPU pod — and why I should have used SSE

#TL;DR

#Why experiment B exists

#Architecture (intended vs what the repo proves)

#Bootstrapping the model (snapshot archaeology)

#Prompt formatting (the silent killer)

#WebSocket protocol (what I actually built)

#Hindsight: I should have used SSE (or vLLM’s OpenAI server)

#WebSocket vs SSE for LLM output

#What I should have deployed instead

#Fake streaming vs real latency

#Operations on GPU pods (2025–2026 lessons)

#How the two experiments fit together

#What I would rebuild today

#GPU pod wire protocol (FastAPI + vLLM)

#Manual Llama-2 chat template risk

#Closing thought

#Reader field guide

#Related reading

#On this site

#References (curated)