In this paper we propose a new hardware architecture of modular exponentiation, which is based on the optimized Montgomery multiplication. At CHES 1999, Tenca introduced a new architecture for implementing the Montgomery multiplication which was later improved by Huang et al. at PKC 2008. In this paper we improve the architecture of Huang and the improved one occupies less hardware resource, at the same time we add the final subtraction of the Montgomery algorithm into the architecture in order to do the exponentiation computation. Finally we use this improved architecture to build a RSA coprocessor. Compared with the previous work, the new 1024-bit RSA coprocessor saved nearly 50% of area, and the area utilization is greatly improved. This design is the smallest design as we know in the literature, and we verified the correctness by huge test data.