huggingface
diff --git a/‎examples/accelerate/simple_cv_example.ipynb ‎examples/accelerate_examples/simple_cv_example.ipynb b/‎examples/accelerate/simple_cv_example.ipynb ‎examples/accelerate_examples/simple_cv_example.ipynb
diff --git a/‎examples/accelerate/simple_nlp_example.ipynb ‎examples/accelerate_examples/simple_nlp_example.ipynb b/‎examples/accelerate/simple_nlp_example.ipynb ‎examples/accelerate_examples/simple_nlp_example.ipynb
diff --git a/‎examples/language_modeling-tf.ipynb
+1,405-230 b/‎examples/language_modeling-tf.ipynb
+1,405-230
diff --git a/‎examples/language_modeling_from_scratch-tf.ipynb
+58-35 b/‎examples/language_modeling_from_scratch-tf.ipynb
+58-35
@@ -704,41 +704,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note that most models on the Hub compute loss internally, so we actually don't have to specify anything there! Leaving the loss field blank will cause the model to read the `loss` head as its loss value.\n",
+    "Next, we compile our model. Note that most Transformers models compute loss internally, so we actually don't have to specify anything for that argument! You can of course set your own loss function if you want, but by default our models will choose the 'obvious' loss that matches their task, such as cross-entropy in the case of language modelling. The built-in loss will also correctly handle things like masking the loss on padding tokens, or unlabelled tokens in the case of masked language modelling, so we recommend using it unless you're an advanced user!\n",
     "\n",
-    "This is an unusual quirk of TensorFlow models in 🤗 Transformers, so it's worth elaborating on in a little more detail. All 🤗 Transformers models are capable of computing an appropriate loss for their task internally (for example, a CausalLM model will use a cross-entropy loss). To do this, the labels must be provided in the input dict (or equivalently, in the `columns` argument to `to_tf_dataset()`), so that they are visible to the model during the forward pass.\n",
+    "We also use the `jit_compile` argument to compile the model with [XLA](https://www.tensorflow.org/xla). XLA compilation adds a delay at the start of training, but this is quickly repaid by faster training iterations after that. It has one downside, though - if the shape of your input changes at all, then it will need to rerun the compilation again! This isn't a problem for us in this notebook, because all of our examples are exactly the same length. Be careful with it when that isn't true, though - if you have a variable sequence length in your batches, then you might spend more time compiling your model than actually training, especially for small datasets!\n",
     "\n",
-    "This is quite different from the standard Keras way of handling losses, where labels are passed separately and not visible to the main body of the model, and loss is handled by a function that the user passes to `compile()`, which uses the model outputs and the label to compute a loss value.\n",
-    "\n",
-    "The approach we take is that if the user does not pass a loss to `compile()`, the model will assume you want the **internal** loss. If you are doing this, you should make sure that the labels column(s) are included in the **input dict** or in the `columns` argument to `to_tf_dataset`.\n",
-    "\n",
-    "If you want to use your own loss, that is of course possible too! If you do this, you should make sure your labels column(s) are passed like normal labels, either as the **second argument** to `model.fit()`, or in the `label_cols` argument to `to_tf_dataset`. "
+    "If you encounter difficulties when training with XLA, it's a good idea to remove the `jit_compile` argument and see if that fixes things. In fact, when debugging, it can be helpful to skip graph compilation entirely with the `run_eagerly=True` argument to [`compile()`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile). This will let you identify the exact line of code where problems arise, but it will significantly reduce your performance, so make sure to remove it again when you've fixed the problem!"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [
     {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.\n"
+     "ename": "NameError",
+     "evalue": "name 'model' is not defined",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
+      "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mtensorflow\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mtf\u001b[39;00m\n\u001b[0;32m----> 3\u001b[0m \u001b[43mmodel\u001b[49m\u001b[38;5;241m.\u001b[39mcompile(optimizer\u001b[38;5;241m=\u001b[39moptimizer, jit_compile\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m)\n",
+      "\u001b[0;31mNameError\u001b[0m: name 'model' is not defined"
      ]
     }
    ],
    "source": [
     "import tensorflow as tf\n",
     "\n",
-    "model.compile(optimizer=optimizer)"
+    "model.compile(optimizer=optimizer, jit_compile=True)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Finally, we need to convert our datasets to a format Keras understands. The easiest way to do this is with the `to_tf_dataset()` method. Because all our inputs are the same length, no padding is required, so we can use the DefaultDataCollator. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='tf'` argument to get Tensorflow tensors out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code!"
+    "Next, we convert our datasets to `tf.data.Dataset`, which Keras understands natively. There are two ways to do this - we can use the slightly more low-level [`Dataset.to_tf_dataset()`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset) method, or we can use [`Model.prepare_tf_dataset()`](https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset). The main difference between these two is that the `Model` method can inspect the model to determine which column names it can use as input, which means you don't need to specify them yourself. It also supplies a data collator by default which is appropriate for most tasks."
    ]
   },
   {
@@ -747,21 +747,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from transformers import DefaultDataCollator\n",
-    "\n",
-    "data_collator = DefaultDataCollator(return_tensors=\"tf\")\n",
-    "\n",
-    "train_set = lm_datasets[\"train\"].to_tf_dataset(\n",
-    "    columns=[\"attention_mask\", \"input_ids\", \"labels\"],\n",
+    "train_set = model.prepare_tf_dataset(\n",
+    "    lm_datasets[\"train\"],\n",
     "    shuffle=True,\n",
     "    batch_size=16,\n",
-    "    collate_fn=data_collator,\n",
     ")\n",
-    "validation_set = lm_datasets[\"validation\"].to_tf_dataset(\n",
-    "    columns=[\"attention_mask\", \"input_ids\", \"labels\"],\n",
+    "\n",
+    "validation_set = model.prepare_tf_dataset(\n",
+    "    lm_datasets[\"validation\"],\n",
     "    shuffle=False,\n",
     "    batch_size=16,\n",
-    "    collate_fn=data_collator,\n",
     ")"
    ]
   },
@@ -914,6 +909,20 @@
     "```"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Models trained from scratch on small amounts of data will generally not output useful text - you'll need a much bigger dataset and a much longer training time before it starts writing text that you'd want to read! If you want to see an example of inference with causal language models, see the `language_modeling-tf` notebook, where we start with a pre-trained model and get higher-quality output much sooner as a result."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
@@ -942,8 +951,7 @@
    },
    "outputs": [],
    "source": [
-    "model_checkpoint = \"bert-base-cased\"\n",
-    "tokenizer_checkpoint = \"sgugger/bert-like-tokenizer\""
+    "model_checkpoint = \"bert-base-cased\""
    ]
   },
   {
@@ -976,7 +984,7 @@
     }
    ],
    "source": [
-    "tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)\n",
     "tokenized_datasets = datasets.map(\n",
     "    tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"]\n",
     ")"
@@ -1074,7 +1082,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "And as before, we leave the `loss` argument blank to use the internal loss."
+    "And as before, we leave the `loss` argument blank to use the internal loss, and use `jit_compile` to enable XLA."
    ]
   },
   {
@@ -1093,7 +1101,7 @@
    "source": [
     "import tensorflow as tf\n",
     "\n",
-    "model.compile(optimizer=optimizer)"
+    "model.compile(optimizer=optimizer, jit_compile=True)"
    ]
   },
   {
@@ -1128,7 +1136,7 @@
     "id": "bqHnWcYC3l_d"
    },
    "source": [
-    "Now we pass our data collator to the `to_tf_dataset()` argument."
+    "Now we pass our data collator to the `prepare_tf_dataset()` argument."
    ]
   },
   {
@@ -1137,14 +1145,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "train_set = lm_datasets[\"train\"].to_tf_dataset(\n",
-    "    columns=[\"attention_mask\", \"input_ids\", \"labels\"],\n",
+    "train_set = model.prepare_tf_dataset(\n",
+    "    lm_datasets[\"train\"],\n",
     "    shuffle=True,\n",
     "    batch_size=16,\n",
     "    collate_fn=data_collator,\n",
     ")\n",
-    "validation_set = lm_datasets[\"validation\"].to_tf_dataset(\n",
-    "    columns=[\"attention_mask\", \"input_ids\", \"labels\"],\n",
+    "\n",
+    "validation_set = model.prepare_tf_dataset(\n",
+    "    lm_datasets[\"validation\"],\n",
     "    shuffle=False,\n",
     "    batch_size=16,\n",
     "    collate_fn=data_collator,\n",
@@ -1265,6 +1274,20 @@
     "```"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As with the causal LM above, masked language models trained from scratch on small amounts of data will generally not be very good at their job - you'll need a much bigger dataset and a much longer training time to make a truly useful one! If you want to see an example of inference with masked language models, see the `language_modeling-tf` notebook, where we start with a pre-trained model and get higher-quality output much sooner as a result."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -1293,7 +1316,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.0"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,