Carrying out even the simplest performance benchmark requires considerable knowledge of statistics and computer systems, and painstakingly following many error-prone steps, which are distinct skill sets yet essential for getting statistically valid results. As a result, many performance measurements in peer-reviewed publications are flawed. Among many problems, they fall short in one or more of the following requirements: accuracy, precision, comparability, repeatability, and control of overhead. This is a serious problem because poor performance measurements misguide system design and optimization. We propose a collection of algorithms and heuristics to automate these steps. They cover the collection, storing, analysis, and comparison of performance measurements. We implement these methods as a readily-usable open source software framework called Pilot, which can help to reduce human error and shorten benchmark time. Evaluation of Pilot on various benchmarks show that it can reduce the cost and complexity of running benchmarks, and can produce better measurement results.