← All projects

Project

Local model experiments

Idea 2 min read ai, models, evaluation, self-hosted

Experiment with local or self-hosted AI models so the model stays stable and the evaluation results are easier to trust.

High-level concept

I want a setup for experimenting with local or self-hosted AI models so the model itself stops being a moving target. Hosted models can change over time, whether because of silent updates, quantization, routing, or other backend shifts. That makes it hard to tell whether a change in results came from my setup or from the provider.

The goal is to remove that variable and make comparisons more meaningful. With a stable model target, I can measure prompts, harnesses, workflows, and evaluation methods against something that does not keep drifting underneath me.

The work only matters if the measurements are concrete, so Defining success matters here. The first version should stay simple enough to trust and repeat, which is the same shape as Architecture for small tools.

Detailed steps

  • Pick one local or self-hosted model stack to start with.
  • Define the tasks I want to evaluate and the metrics that matter.
  • Build a small harness that can run the same prompts against the same model version.
  • Record the model version, parameters, hardware, and runtime so results stay reproducible.
  • Compare the local setup with a hosted baseline only as a reference point.
  • Write down where local models are good enough and where hosted models still win.