Building a Transformer-Based LLM with Effect

Building a transformer-based LLM from scratch is usually the domain of Python (PyTorch) or systems languages like Rust and C++. But Effect is surprisingly capable.

Repo link: Effect GPT

This project is a full, from-scratch LLM implementation in TypeScript: tokenization, multi-head attention, transformer blocks, cross-entropy loss, backpropagation, and an Adam optimizer with gradient clipping. It can pre-train on a corpus and then instruction-tune on chat-style data.

I didn’t just want it to work. I wanted it to be robust, testable, and disciplined. Effect became the backbone that let me write a systems-level implementation in TypeScript while keeping side effects, concurrency, and failures explicit.

The architecture at a glance

The model is assembled as a sequential network:

Embeddings -> TransformerBlock x3 -> OutputProjection

export const MAX_SEQ_LEN = 80;
export const EMBEDDING_DIM = 128;
export const HIDDEN_DIM = 256;

const network = [
	new Embeddings(vocabSize, EMBEDDING_DIM, MAX_SEQ_LEN, rng),
	new TransformerBlock(EMBEDDING_DIM, HIDDEN_DIM, rng),
	new TransformerBlock(EMBEDDING_DIM, HIDDEN_DIM, rng),
	new TransformerBlock(EMBEDDING_DIM, HIDDEN_DIM, rng),
	new OutputProjection(EMBEDDING_DIM, vocabSize, rng),
];
const llm = new LLM(vocab, network);

export class TransformerBlock implements ModelLayer {
	readonly _tag = "TransformerBlock";
	attention: SelfAttention;
	feedForward: FeedForward;
	norm1: LayerNorm;
	norm2: LayerNorm;

	constructor(
		embeddingDim: number = EMBEDDING_DIM,
		hiddenDim: number = HIDDEN_DIM,
		rng: Rng,
	) {
		this.attention = new SelfAttention(embeddingDim, rng);
		this.feedForward = new FeedForward(embeddingDim, hiddenDim, rng);
		this.norm1 = new LayerNorm(embeddingDim);
		this.norm2 = new LayerNorm(embeddingDim);
	}

	get parametersCount(): number {
		return (
			this.attention.parametersCount +
			this.feedForward.parametersCount +
			this.norm1.parametersCount +
			this.norm2.parametersCount
		);
	}

	forward(input: Tensor2D): Effect.Effect<Tensor2D, ShapeError> {
		return Effect.gen(this, function* () {
			const attentionOut: Tensor2D = yield* this.attention.forward(input);
			const norm1Out: Tensor2D = yield* this.norm1.forward(attentionOut);
			const ffnOut: Tensor2D = yield* this.feedForward.forward(norm1Out);
			const norm2Out: Tensor2D = yield* this.norm2.forward(ffnOut);
			return norm2Out;
		});
	}

	backward(dOut: Tensor2D, lr: number): Effect.Effect<Tensor2D, ShapeError> {
		return Effect.gen(this, function* () {
			let grad: Tensor2D = yield* this.norm2.backward(dOut, lr);
			grad = yield* this.feedForward.backward(grad, lr);
			grad = yield* this.norm1.backward(grad, lr);
			grad = yield* this.attention.backward(grad, lr);
			return grad;
		});
	}
}

Keeping the core pure

A common failure mode in ML systems is mixing model math with IO and orchestration concerns. In this codebase, logging, metrics, and seeded randomness are modeled as Effect services via Context.GenericTag.

export const Logger = Context.GenericTag<LoggerServiceId, LoggerService>("LoggerService");

export const Metrics = Context.GenericTag<MetricsServiceId, MetricsService>("MetricsService");

export const Seed = Context.GenericTag<SeedServiceId, SeedService>("SeedService");

That means core model code does not reach into the terminal, filesystem, or global mutable state directly. It asks for capabilities. Runtime wiring provides concrete implementations later.

Dependency injection without boilerplate

Instead of global singletons or giant config objects, dependencies are composed with Layers.

const LoggerLayer = PrettyLoggerLive("info");

const seedValue = parseSeedArg(process.argv);
const SeedLayerLive = SeedLayer(seedValue);

const AppLayer = Layer.mergeAll(
	BunFileSystem.layer,
	BunTerminal.layer,
	LoggerLayer,
	InMemoryMetricsLive,
	SeedLayerLive,
);

Training-related dependencies are layered the same way:

export const TrainingConfig = Context.GenericTag<TrainingConfigId, TrainingConfig>(
	"TrainingConfig",
);
export const LLMService = Context.GenericTag<LLMServiceId, LLM>("LLMService");
export const PreprocessSettings = Context.GenericTag<PreprocessSettingsId, PreprocessSettings>(
	"PreprocessSettings",
);

export const makeLLMLayer = (llm: LLM) => Layer.succeed(LLMService, llm);
export const makeTrainingConfigLayer = (config: TrainingConfig) =>
	Layer.succeed(TrainingConfig, config);
export const makePreprocessSettingsLayer = (settings: PreprocessSettings) =>
	Layer.succeed(PreprocessSettings, settings);

In tests, the same runtime can be replaced with test doubles:

export const TestServicesLayer = Layer.mergeAll(SilentLoggerLive, NoOpMetricsLive);

Two-phase training

Training is split into two distinct phases:

Pre-training on raw corpus text.
Instruction tuning on chat-style dataset examples.

yield * info("\n=== PRE-TRAINING MODEL ===");
yield * info(`Pre-training for ${PRETRAIN_EPOCHS} epochs with learning rate ${PRETRAIN_LR}`);
yield *
	trainStream(dataset.pretrainingStream).pipe(
		Effect.provide(llmLayer),
		Effect.provide(
			makeTrainingConfigLayer({
				epochs: PRETRAIN_EPOCHS,
				learningRate: PRETRAIN_LR,
			}),
		),
		Effect.provide(preprocessLayer),
	);

Both phases run through the same training machinery and share model weights, so fine-tuning refines the already-learned representation rather than replacing it.

Streaming training pipeline

Loading all examples into memory is not viable. Effect Stream lets the pipeline stay lazy and bounded.

const preprocessed = makeStream().pipe(
	Stream.mapError(TrainingError.fromUnknown),
	Stream.mapChunks(Chunk.chunksOf(batchSize)),
	Stream.flattenChunks,
	Stream.mapEffect(preprocess, { concurrency }),
	Stream.filterMap((value) => value),
);

const trainExample = ({ inputIds, targetIds }: { inputIds: number[]; targetIds: number[] }) =>
	Effect.gen(function* () {
		let input = T.fromArray(1, inputIds.length, inputIds);
		for (const layer of llm.network) {
			input = yield* mapShapeError(layer.forward(input));
		}

		const logits = input;
		const probs = yield* wrapThrowing(() => softmaxRows(logits), mapShapeUnknown);
		const loss = yield* wrapThrowing(
			() => crossEntropyLoss(probs, targetIds),
			mapShapeUnknown,
		);
		yield* Ref.update(totalLossRef, (current) => current + loss);
		yield* Ref.update(totalExamplesRef, (current) => current + 1);

		let grads = yield* wrapThrowing(() => dLogits(probs, targetIds), mapShapeUnknown);
		clipGlobalL2(grads, clipNorm);

		for (let i = llm.network.length - 1; i >= 0; i--) {
			grads = yield* mapShapeError(
				llm.network[i]!.backward(grads, config.learningRate),
			);
		}

		const tokens = Ops.argmaxRows(probs);
		const nextToken = tokens[tokens.length - 1];
		if (nextToken === endTokenId.value) {
			return;
		}
	});

yield *
	Effect.scoped(
		Stream.runDrain(
			Stream.mapEffect(preprocessed, trainExample, {
				concurrency: trainConcurrency,
			}),
		),
	);

Errors you can actually type

Unstructured exceptions make ML systems brittle. Here, failures are represented as explicit domain errors grouped into a TrainingError union.

export class TrainingDatasetError extends Data.TaggedError("TrainingDatasetError")<{
	readonly cause: DatasetLoadError | DatasetParseError;
}> {}

export class TrainingShapeError extends Data.TaggedError("TrainingShapeError")<{
	readonly cause: ShapeError;
}> {}

export class TrainingTokenizerError extends Data.TaggedError("TrainingTokenizerError")<{
	readonly message: string;
	readonly cause?: unknown;
}> {}

export class TrainingOptimizerError extends Data.TaggedError("TrainingOptimizerError")<{
	readonly message: string;
	readonly cause?: unknown;
}> {}

export class TrainingConfigError extends Data.TaggedError("TrainingConfigError")<{
	readonly message: string;
	readonly cause?: unknown;
}> {}

export class TrainingUnknownError extends Data.TaggedError("TrainingUnknownError")<{
	readonly cause: unknown;
}> {}

export type TrainingError =
	| TrainingDatasetError
	| TrainingShapeError
	| TrainingTokenizerError
	| TrainingOptimizerError
	| TrainingConfigError
	| TrainingUnknownError;

export const TrainingError = {
	dataset: (cause: DatasetLoadError | DatasetParseError): TrainingDatasetError =>
		new TrainingDatasetError({ cause }),
	shape: (cause: ShapeError): TrainingShapeError => new TrainingShapeError({ cause }),
	tokenizer: (message: string, cause?: unknown): TrainingTokenizerError =>
		new TrainingTokenizerError({ message, cause }),
	optimizer: (message: string, cause?: unknown): TrainingOptimizerError =>
		new TrainingOptimizerError({ message, cause }),
	config: (message: string, cause?: unknown): TrainingConfigError =>
		new TrainingConfigError({ message, cause }),
	unknown: (cause: unknown): TrainingUnknownError => new TrainingUnknownError({ cause }),
	fromUnknown: (error: unknown): TrainingError => {
		if (error instanceof TrainingDatasetError) return error;
		if (error instanceof TrainingShapeError) return error;
		if (error instanceof TrainingTokenizerError) return error;
		if (error instanceof TrainingOptimizerError) return error;
		if (error instanceof TrainingConfigError) return error;
		if (error && typeof error === "object") {
			const candidate = error as { _tag?: string };
			if (
				candidate._tag === "DatasetLoadError" ||
				candidate._tag === "DatasetParseError"
			) {
				return new TrainingDatasetError({
					cause: error as DatasetLoadError | DatasetParseError,
				});
			}
			if (candidate._tag === "ShapeError") {
				return new TrainingShapeError({ cause: error as ShapeError });
			}
		}
		return new TrainingUnknownError({ cause: error });
	},
};

const mapShapeError = <A, R>(effect: Effect.Effect<A, ShapeError, R>) =>
	effect.pipe(Effect.mapError(TrainingError.shape));

So instead of discovering hidden crash paths at runtime, the compiler forces explicit handling or intentional propagation.

Adam optimizer and gradient clipping

Each weight matrix has its own Adam state (m, v, timestep). Updates use bias-corrected moments with standard coefficients (beta1=0.9, beta2=0.999, epsilon=1e-8). Before backward updates, gradients are clipped by global L2 norm.

const clipNorm = config.clipNorm ?? 5.0;

import type { Tensor2D } from "../tensor/Tensor2D";
import * as T from "../tensor/Tensor2D";
import { ShapeError } from "../tensor/ops";

export class Adam {
	readonly beta1 = 0.9;
	readonly beta2 = 0.999;
	readonly epsilon = 1e-8;
	timestep = 0;
	m: Tensor2D;
	v: Tensor2D;

	private constructor(rows: number, cols: number) {
		this.m = T.zeros(rows, cols);
		this.v = T.zeros(rows, cols);
	}

	static make(rows: number, cols: number): Adam {
		return new Adam(rows, cols);
	}

	step(params: Tensor2D, grads: Tensor2D, lr: number): void {
		if (params.rows !== grads.rows || params.cols !== grads.cols) {
			throw new ShapeError(
				`Adam.step: params shape (${params.rows},${params.cols}) != grads shape (${grads.rows},${grads.cols})`,
			);
		}
		if (this.m.rows !== params.rows || this.m.cols !== params.cols) {
			throw new ShapeError(
				`Adam.step: optimizer shape (${this.m.rows},${this.m.cols}) != params shape (${params.rows},${params.cols})`,
			);
		}

		this.timestep += 1;
		const beta1 = this.beta1;
		const beta2 = this.beta2;
		const oneMinusB1 = 1 - beta1;
		const oneMinusB2 = 1 - beta2;

		const mData = this.m.data;
		const vData = this.v.data;
		const pData = params.data;
	}
}

Automatic resource management

Wrapping the training drain in Effect.scoped ensures acquired resources are released whether a run succeeds, fails, or gets interrupted.

yield *
	Effect.scoped(
		Stream.runDrain(
			Stream.mapEffect(preprocessed, trainExample, {
				concurrency: trainConcurrency,
			}),
		),
	);

Fiber-safe caching

With concurrent example processing, layer-level activation caches can collide if keyed poorly. Using Effect.fiberId as cache identity prevents cross-fiber contamination during forward/backward transitions.

Reproducible randomness

A SeedLayer provides seeded RNG when --seed is passed, and non-deterministic RNG otherwise.

const parseSeedArg = (argv: string[]): number | undefined => {
	const seedIndex = argv.findIndex((arg) => arg === "--seed");
	if (seedIndex >= 0 && seedIndex < argv.length - 1) {
		const asNum = Number(argv[seedIndex + 1]);
		return Number.isFinite(asNum) ? asNum : undefined;
	}
	return undefined;
};

const makeSeedService = (seed?: number): SeedService => {
	const rng = seed === undefined ? systemRng() : seeded(seed);

	return {
		rng,
		fork: () => {
			const nextSeed = Math.floor(rng.next() * 0xffffffff);
			return makeSeedService(seed === undefined ? undefined : nextSeed);
		},
	};
};

export const SeedLayer = (seed?: number): Layer.Layer<SeedServiceId> =>
	Layer.succeed(Seed, makeSeedService(seed));

Metrics without global state

Metrics (counters, gauges, histograms, timings) are provided through a metrics service backed by Effect primitives and emitted as snapshots at run end.

const epochCounter = yield * counter("epochs_completed");
const lossGauge = yield * gauge("epoch_loss");
const examplesCounter = yield * counter("examples_processed");

const epochResult =
	yield *
	timed(
		`epoch_${epoch}`,
		Effect.gen(function* () {
			const totalLossRef = yield* Ref.make(0);
			const totalExamplesRef = yield* Ref.make(0);
		}),
	);

const metrics = yield * snapshot();
yield *
	info("Training complete", {
		epochsCompleted: metrics.counters.find((c) => c.name === "epochs_completed")?.value,
		totalExamples: metrics.counters.find((c) => c.name === "examples_processed")?.value,
		finalLoss: metrics.gauges.find((g) => g.name === "epoch_loss")?.value,
	});

const noOpCounter: Counter = {
	inc: () => Effect.void,
	get: () => Effect.succeed(0),
};

const noOpGauge: Gauge = {
	set: () => Effect.void,
	get: () => Effect.succeed(0),
};

const noOpHistogram: Histogram = {
	observe: () => Effect.void,
	getStats: () => Effect.succeed({ count: 0, sum: 0, min: 0, max: 0, mean: 0 }),
};

const noOpMetrics: MetricsService = {
	counter: () => Effect.succeed(noOpCounter),
	gauge: () => Effect.succeed(noOpGauge),
	histogram: () => Effect.succeed(noOpHistogram),
	timed: (_, effect) => Effect.map(effect, (value) => ({ value, durationMs: 0 })),
	snapshot: () =>
		Effect.succeed({
			counters: [],
			gauges: [],
			histograms: [],
			timings: [],
		}),
};

export const NoOpMetricsLive: Layer.Layer<MetricsServiceId> = Layer.succeed(Metrics, noOpMetrics);

Fearless concurrency

Preprocessing and training steps are parallelized declaratively via stream concurrency options. Shared epoch accumulators use Effect Ref, so updates stay atomic.

const trainConcurrency = clampConcurrency(config.trainConcurrency, 4);

BunRuntime.runMain(program);

Closing

Effect made this TypeScript LLM project practical not just by running code, but by enforcing architecture: clear service boundaries, replaceable runtime layers, typed error channels, deterministic execution, and safe concurrency/cleanup.

The result is a stack that stays approachable while still being rigorous enough for serious systems and ML experimentation.