[Temporal Fusion Transformer (TFT)] Large dynamic ranges: Only time series with large values are trained. FP64 needed?

Related to **Temporal Fusion Transformer (TFT)** in **Time-Series Prediction Platform (TSPP)**

**Use case:**
Predict time series of rankings of items within categories.
For example, item 1 occupies place 10 of the most viewed items in category A. Item 2 occupies place 100,000 of the most viewed items in category A and place 500 in category B. The places shift freely with each day. Item 2 could occupy place 10 of category A the next week.

**Describe the bug**
- Over half a million of concurrent time series
- Large dynamic ranges
- Only time series with large target values are trained

When training over half a million of concurrent time series, which have vastly different dynamic ranges, only time series with comparably large values are trained. For example, time series with values from 10,000 to 500,000 are trained perfectly well with very good predictions, but for time series with values between 0 and 500 the predictions are far off, look alike, and seem to not be specific to any of the time series.

**What I tried:**
- Training with reciprocal values turns this behavior around. Then, time series with small values are trained but not time series with large values.
- The scale_per_id feature of TSPP, which scales every time series separately, resulted in predictions for all time series (small and large) to look like straight lines close to the average values of each time series.
- Modifying the loss function, scaling the loss larger for small target values, resulted in a mixed bag, where neither time series with small nor time series with large values were trained adequately.
- When training with AMP enabled, gradients explode after ~14 Epochs (loss nan). Training without AMP has been running for over 60 Epochs.

All this leads me to believe, that the dynamic range of FP32 might be too small in some parts of the network to represent the large dynamic ranges of my use case. If that is the case, which parts of the network would need to use FP64 instead?
I used a network with n_head: 10, hidden_size: 320, dropout: 0.1 and attn_dropout: 0.01. Is this to small for my use case?

**To Reproduce**
1. Generate time series with vastly different dynamic ranges
2. Train TFT network with TSPP
2. Inspect predictions of time series with only small and time series with only large values. Only large values are trained.

**Expected behavior**
Adequate predictions of values for all time series, not just time series with comparably large values.

**Environment**
* Container version: pytorch:22.04-py3
* GPUs in the system: 2x V100
* CUDA driver version: 510.85.02


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Temporal Fusion Transformer (TFT)] Large dynamic ranges: Only time series with large values are trained. FP64 needed? #1222

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Temporal Fusion Transformer (TFT)] Large dynamic ranges: Only time series with large values are trained. FP64 needed? #1222

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions