Spaces:
Running
Running
You can see that we added 2 of these and now we track if inf or nan for forwarded_states was detected | |
somewhere in between. | |
Actually, the detector already reports these because each of the calls in the example above is a nn.Module, but | |
let's say if you had some local direct calculations this is how you'd do that. | |
Additionally, if you're instantiating the debugger in your own code, you can adjust the number of frames printed from | |
its default, e.g.: | |
thon | |
from transformers.debug_utils import DebugUnderflowOverflow | |
debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100) | |
Specific batch absolute min and max value tracing | |
The same debugging class can be used for per-batch tracing with the underflow/overflow detection feature turned off. | |
Let's say you want to watch the absolute min and max values for all the ingredients of each forward call of a given | |
batch, and only do that for batches 1 and 3. Then you instantiate this class as: | |
python | |
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3]) | |
And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does. | |
Batches are 0-indexed. | |
This is helpful if you know that the program starts misbehaving after a certain batch number, so you can fast-forward | |
right to that area. Here is a sample truncated output for such configuration: | |
*** Starting batch number=1 *** | |
abs min abs max metadata | |
shared Embedding | |
1.01e-06 7.92e+02 weight | |
0.00e+00 2.47e+04 input[0] | |
5.36e-05 7.92e+02 output | |
[] | |
decoder.dropout Dropout | |
1.60e-07 2.27e+01 input[0] | |
0.00e+00 2.52e+01 output | |
decoder T5Stack | |
not a tensor output | |
lm_head Linear | |
1.01e-06 7.92e+02 weight | |
0.00e+00 1.11e+00 input[0] | |
6.06e-02 8.39e+01 output | |
T5ForConditionalGeneration | |
not a tensor output | |
*** Starting batch number=3 *** | |
abs min abs max metadata | |
shared Embedding | |
1.01e-06 7.92e+02 weight | |
0.00e+00 2.78e+04 input[0] | |
5.36e-05 7.92e+02 output | |
[] | |
Here you will get a huge number of frames dumped - as many as there were forward calls in your model, so it may or may | |
not what you want, but sometimes it can be easier to use for debugging purposes than a normal debugger. For example, if | |
a problem starts happening at batch number 150. So you can dump traces for batches 149 and 150 and compare where | |
numbers started to diverge. | |
You can also specify the batch number after which to stop the training, with: | |
python | |
debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3) |