Welcome to the Linux Foundation Forum!

Is there a missing section in 05: Building Your First HF Dataset?

In 05: Building Your First Hugging Face Dataset section of the course, the Cleaning Data page is skipped. Now that could have been normal, however, upon getting to the Building Dataset/DataPipe page of the notebook, the features morph from a list (in the datasets) to tensors. It is not clear how this was achieved. Can you kindly validate this? Thanks so much for the updated material.

Answers

  • Posts: 7

    Hi @mklomo ,

    Thank you for pointing this out. You're absolutely right, there was a missing paragraph and corresponding snippet of code, our apologies for the confusion.

    The output of the dataset should be a dictionary of lists at that point.

    1. {'label': [[14390.0], [17000.0]],
    2. 'cont_X': [[2019.0, 8307.0, 145.0, 39.20000076293945, 1.399999976158142],
    3. [2018.0, 19566.0, 145.0, 54.29999923706055, 2.0]],
    4. 'cat_X': [[109, 1, 4], [1, 1, 0]]}

    The missing snippet (below) sets the output format, so whenever the data is retrieved, it produces the desired dictionary of tensors.

    1. datasets = datasets.with_format('torch')
    2. datasets['train'][:2]
    1. {'label': tensor([[14390.],
    2. [17000.]]),
    3. 'cont_X': tensor([[2.0190e+03, 8.3070e+03, 1.4500e+02, 3.9200e+01, 1.4000e+00],
    4. [2.0180e+03, 1.9566e+04, 1.4500e+02, 5.4300e+01, 2.0000e+00]]),
    5. 'cat_X': tensor([[109, 1, 4],
    6. [ 1, 1, 0]])}

    The content has already been corrected to include the missing snippet.

    Let us know if you have any more questions.

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Welcome!

It looks like you're new here. Sign in or register to get started.
Sign In

Categories

Upcoming Training