The analysis of Cosmic Microwave Background (CMB) observations is a long-standing computational challenge, driven by the exponential growth in the size of the data sets being gathered. Since this growth is projected to continue for at least the next decade, it will be critical to extend the analysis algorithms and their implementations to peta-scale high performance computing (HPC) systems and beyond. The most computationally intensive part of the analysis is generating and reducing Monte Carlo realizations of an experiment's data. In this work we take the current state-of-the-art simulation and mapping software and investigate its performance when pushed to tens of thousands of cores on a range of leading HPC systems, in particular focusing on the communication bottleneck that emerges at high concurrencies. We present a new communication strategy that removes this bottleneck, allowing for CMB analyses of unprecedented scale and hence fidelity. Experimental results show a communication speedup of up to 116x using our alternative strategy.