Welcome to the Linux Foundation Forum!

Is there a missing section in 05: Building Your First HF Dataset?

In 05: Building Your First Hugging Face Dataset section of the course, the Cleaning Data page is skipped. Now that could have been normal, however, upon getting to the Building Dataset/DataPipe page of the notebook, the features morph from a list (in the datasets) to tensors. It is not clear how this was achieved. Can you kindly validate this? Thanks so much for the updated material.

Answers

  • dvgodoy
    dvgodoy Posts: 7

    Hi @mklomo ,

    Thank you for pointing this out. You're absolutely right, there was a missing paragraph and corresponding snippet of code, our apologies for the confusion.

    The output of the dataset should be a dictionary of lists at that point.

    {'label': [[14390.0], [17000.0]],
     'cont_X': [[2019.0, 8307.0, 145.0, 39.20000076293945, 1.399999976158142],
      [2018.0, 19566.0, 145.0, 54.29999923706055, 2.0]],
     'cat_X': [[109, 1, 4], [1, 1, 0]]}
    

    The missing snippet (below) sets the output format, so whenever the data is retrieved, it produces the desired dictionary of tensors.

    datasets = datasets.with_format('torch')
    datasets['train'][:2]
    
    {'label': tensor([[14390.],
             [17000.]]),
     'cont_X': tensor([[2.0190e+03, 8.3070e+03, 1.4500e+02, 3.9200e+01, 1.4000e+00],
             [2.0180e+03, 1.9566e+04, 1.4500e+02, 5.4300e+01, 2.0000e+00]]),
     'cat_X': tensor([[109, 1, 4],
              [ 1, 1, 0]])}
    

    The content has already been corrected to include the missing snippet.

    Let us know if you have any more questions.

Categories

Upcoming Training