Many applications call for learning to label individual objects in an image where the only information available to the learner is a dataset of images with their associated captions, i.e., words that describe the image content without specifically labeling the individual objects. We address this problem using a multi-modal hierarchical Dirichlet process model (MoM-HDP) - a nonparametric Bayesian model which provides a generalization for multi-model latent Dirichlet allocation model (MoM-LDA) used for similar problems in the past. We apply this model for predicting labels of objects in images containing multiple objects. During training, the model has access to an un-segmented image and its caption, but not the labels for each object in the image. The trained model is used to predict the label for each region of interest in a segmented image. MoM-HDP generalizes a multi-modal latent Dirichlet allocation model in that it allows the number of components of the mixture model to adapt to the data. The model parameters are efficiently estimated using variational inference. Our experiments show that MoM-HDP performs just as well as or better than the MoM-LDA model (regardless the choice of the number of clusters in the MoM-LDA model).