Advances in semiconductor technology are enabling designs with several hundred million transistors. Since building sophisticated single processor based systems is a complex process from design, verification, and software development perspectives, the use of chip multiprocessing is inevitable in future microprocessors. In fact, the abundance of explicit loop-level parallelism in many embedded applications helps us identify chip multiprocessing as one of the most promising directions in designing systems for embedded applications. Another architectural trend that we observe in embedded systems, namely, multi-voltage processors, is driven by the need of reducing energy consumption during program execution. Practical implementations such as Transmeta's Crusoe and Intel's XScale tune processor voltage/frequency depending on current execution load. Considering these two trends, chip multiprocessing and voltage/frequency scaling, this paper presents an optimization strategy for an architecture that makes use of both chip parallelism and voltage scaling. In our proposal, the compiler takes advantage of heterogeneity in parallel execution between the loads of different processors and assigns different voltages/frequencies to different processors if doing so reduces energy consumption without increasing overall execution cycles significantly. Our experiments with a set of applications show that this optimization can bring large energy benefits without much performance loss.