The analysis of high-throughput metagenomic sequencingdata poses significant computational challenges. Mostcurrent de novo assembly tools use the de Bruijn graph-basedmethodology. In prior work, a connected components decompositionof the de Bruijn graph and subsequent partitioningof sequence read data was shown to be an effective memory reducingpreprocessing step for de novo assembly of largemetagenomic datasets. In this paper, we present METAPREP, a new end-to-end parallel implementation of a similar preprocessingstep. METAPREP has efficient implementations ofseveral computational subroutines (e.g., k-mer enumerationand counting, parallel sorting, graph connectivity) that occurin other genomic data analysis problems, and we show thatour implementations are comparable to the state-of-the-art. METAPREP is primarily designed to execute on large shared memorymulticore servers, but scales gracefully to use multiplecompute nodes and clusters with parallel I/O capabilities. WithMETAPREP, we can process the Iowa Continuous Corn soilmetagenomics dataset, comprising 1.13 billion reads totaling223 billion base pairs, in around 14 minutes, using just 16 nodesof the NERSC Edison supercomputer. We also evaluate theperformance impact of METAPREP on MEGAHIT, a parallelmetagenome assembler.