In Table 9, the average success rate of 0.75 × d seems lower than 0.5 × d and 1.0, but paper state that “reducing the expert’s hidden size to 0.75 × d achieves a good balance between performance and efficiency.”
Could you clarify how you concluded that 0.75 × d is the best trade-off? Was it chosen based on other considerations (e.g., stability, memory usage, or consistency across tasks)?
Thanks!
