Improving AI Benchmark Reliability