Inference source for the benchmarks + Model selection criteria

Source(s) for inferencing for the tests.

Something like OpenRouter? Perform some of the smaller models benchmarking locally?

Add models based on whats hot/chatter on the net? Open to suggestions!

When we add models, we can test just that model for the current period's benchmark dataset.