Learning Cross-modal Embeddings
for Cooking Recipes and Food Images

Amaia Salvador*
Nicholas Hynes*
Yusuf Aytar
Javier Marin
Ferda Ofli
Ingmar Weber
Antonio Torralba

♰ Universitat Politecnica de Catalunya
✦ Massachusetts Institute of Technology
✥ Qatar Computing Research Institute

CVPR 2017
* contributed equally

Download Paper


In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Using these data, we train a neural network to find a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Additionally, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M dataset and food and cooking in general.

Recipe1M dataset

We present the large-scale Recipe1M dataset which contains one million structured cooking recipes with associated images.

Follow this link to download the dataset.

Below are the dataset statistics:

Joint embedding

We train a joint embedding composed of an encoder for each modality (ingredients, instructions and images).


im2recipe retrieval

We evaluate all the recipe representations for im2recipe retrieval. Given a food image, the task is to retrieve its recipe from a collection of test recipes.

Comparison with humans

In order to better assess the quality of our embeddings we also evaluate the performance of humans on the im2recipe task.

Check out the paper for full details and more analysis.



[Coming soon !]

Check out our online demo, in which you can upload your food images to retrieve a recipe from our dataset.

Embedding Analysis

We explore whether any semantic concepts emerge in the neuron activations and whether the embedding space has certain arithmetic properties.

Visualizing embedding units

We show the localized unit activations in both image and recipe embeddings. We find that certain units show localized semantic alignment between the embeddings of the two modalities.


We demonstrate the capabilities of our learned embeddings with simple arithmetic operations. In the context of food recipes, one would expect that:

v(chicken\_pizza) - v(pizza) + v(lasagna) = v(chicken\_lasagna)

where v represents the map into the embedding space.

We investigate whether our learned embeddings have such properties by applying the previous equation template to the averaged vectors of recipes that contain the queried words in their title. The figures below show some results with same and cross-modality embedding arithmetics.

Image Embeddings Recipe Embeddings
Cross-modal (I2R) Cross-modal (R2I)

Code & Trained Models


  title={Learning Cross-modal Embeddings for Cooking Recipes and Food Images},
  author={Salvador, Amaia and Hynes, Nicholas and Aytar, Yusuf and Marin, Javier and 
          Ofli, Ferda and Weber, Ingmar and Torralba, Antonio},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},


This work has been supported by CSAIL-QCRI collaboration projects and the framework of projects TEC2013-43935-R and TEC2016-75976-R, financed by the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF).