DLA

NVDLA based
High-performance convolution core with 2048 MACs
Support various image input formats
Dedicated Depth-wise Convolution engine
Acceleration engine for Activation functions
Acceleration engine for Pooling
Acceleration engine for advanced Normalization functions
Memory-to-memory transformation acceleration for tensor reshape and copy operations
2MB local on-chip SRAM, shared by AXI slave port accessed by other BUS master