Is there a missing section in 05: Building Your First HF Dataset?

mklomo · January 27

In 05: Building Your First Hugging Face Dataset section of the course, the Cleaning Data page is skipped. Now that could have been normal, however, upon getting to the Building Dataset/DataPipe page of the notebook, the features morph from a list (in the datasets) to tensors. It is not clear how this was achieved. Can you kindly validate this? Thanks so much for the updated material.

dvgodoy · January 29

Hi @mklomo ,

Thank you for pointing this out. You're absolutely right, there was a missing paragraph and corresponding snippet of code, our apologies for the confusion.

The output of the dataset should be a dictionary of lists at that point.

{'label': [[14390.0], [17000.0]],
 'cont_X': [[2019.0, 8307.0, 145.0, 39.20000076293945, 1.399999976158142],
  [2018.0, 19566.0, 145.0, 54.29999923706055, 2.0]],
 'cat_X': [[109, 1, 4], [1, 1, 0]]}

The missing snippet (below) sets the output format, so whenever the data is retrieved, it produces the desired dictionary of tensors.

datasets = datasets.with_format('torch')
datasets['train'][:2]

{'label': tensor([[14390.],
         [17000.]]),
 'cont_X': tensor([[2.0190e+03, 8.3070e+03, 1.4500e+02, 3.9200e+01, 1.4000e+00],
         [2.0180e+03, 1.9566e+04, 1.4500e+02, 5.4300e+01, 2.0000e+00]]),
 'cat_X': tensor([[109, 1, 4],
          [ 1, 1, 0]])}

The content has already been corrected to include the missing snippet.

Let us know if you have any more questions.

mklomo · June 18

Thanks, @dvgodoy. I had to take a break from the course to finish my PhD comps, and this was really helpful.

A related question I have is why the material here does not cover Generative Adversarial Networks (GANs). If you can point us (myself and future learners) to any relevant resources on GANs, we would be grateful.

dvgodoy · June 26

Hi @mklomo ,

I'm glad you found it helpful!

Regarding GANs, they have been superseded by diffusion models in general. GANs were notoriously tricky to train, as one had to balance the training of two competing models.

Generative models, especially for images, are a big area on its own, so we would need a full course to cover so much material.

Having said that, back in 2022, I presented a short tutorial on GANs at ODSC Europe conference, you can find the materials here: https://github.com/dvgodoy/GANsNRoses_ODSC_Europe2022

And, if you're interested in diffusion models as well, there's a tutorial from 2023 here: https://github.com/dvgodoy/DiffusionModels101_ODSC_Europe2023

I hope it helps!

Is there a missing section in 05: Building Your First HF Dataset?

Best Answer

Answers

Welcome!

Welcome!

Quick Links

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)