Abstract
There has been a great deal of recent advancements in end-to-end imitation and reinforcement learning for self-driving vehicles. Despite this, there is a severe lack of standardized metrics for evaluating the performance of autonomous self-driving agents. Existing metrics are generally lacking in their ability to capture a wide range of driving behaviors and compare the severity of different failure cases. In this work, we introduce the Quantitative Evaluation for Driving metric, or QED, which assigns a quantitative score from 0-100 that captures the quality of driving for any driving agent. Our QED metric assesses different aspects of driving behavior including the ability to stay in the center of the lane, avoid weaving and erratic behavior, follow the speed limit, and avoid collisions, and it can be used under a wide range of driving scenarios. To show the effectiveness of our QED metric, we compare the scores generated by QED against scores assigned by human evaluators on a total of 30 different drivers and 6 different towns in the CARLA driving simulator. In ``easy'' evaluation scenarios, where it is relatively straightforward to distinguish better drivers from worse drivers, QED attains an average Pearson correlation of 0.96 and average Spearman correlation of 0.97 when compared against human evaluators. In ``hard'' evaluation scenarios, where it is far more ambiguous how to rank/score different types of bad driving behavior, QED attains an average Pearson correlation of 0.82 and average Spearman correlation of 0.75 when compared against human evaluators, which are both slighter higher than when we compare human evaluators against each other. While QED may not capture every characteristic that defines good driving, we consider it an important foundation for reproducibility and standardization in the community.