Allows to change the utilized attention implementation. 

- **Auto** selection will automatically choose the implementation based on system availability.
- **Eager** relies on vanilla attention implementation in Python.
- **SDPA** uses scaled dot product attention in PyTorch.
- **Flash Attention 2** explicitly uses FA2 which requires the flash_attn package.