Projet_SETI_RISC-V/neorv32/docs/datasheet/cpu_cfu.adoc

<<<
:sectnums:
=== Custom Functions Unit (CFU)

The Custom Functions Unit is the central part of the <<_zxcfu_custom_instructions_extension_cfu>> and represents
the actual hardware module, which is used to implement _custom RISC-V instructions_. The concept of the NEORV32
CFU has been highly inspired by https://github.com/google/CFU-Playground[Google's CFU-Playground].

The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or
program memory requirements when implemented entirely in software. Some potential application fields and exemplary
use-cases might include:

* **AI:** sub-word / vector / SIMD operations like processing all four bytes of a 32-bit data word in parallel
* **Cryptographic:** bit substitution and permutation
* **Communication:** conversions like binary to gray-code; multiply-add operations
* **Image processing:** look-up-tables for color space transformations
* implementing instructions from **other RISC-V ISA extensions** that are not yet supported by the NEORV32

[NOTE]
The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators
(like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped
<<_custom_functions_subsystem_cfs>>.
A comparison of all NEORV32-specific chip-internal hardware extension options is provided in the user guide section
https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].


:sectnums:
==== CFU Instruction Formats

The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction
space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("_Guaranteed Non-Standard
Encoding Space_"). The NEORV32 CFU uses the `custom-x` opcodes to identify the instructions implemented
by the CFU and to differentiate between the different instruction formats.
The according binary encoding of these opcodes is shown below:

* `custom-0`: `0001011` (R3-type instructions, RISC-V standard)
* `custom-1`: `0101011` (R4-type instructions, RISC-V standard)
* `custom-2`: `1011011` (R5-type instruction A, NEORV32-specific)
* `custom-3`: `1111011` (R5-type instruction B, NEORV32-specific)

.CFU Instructions - Exceptions
[IMPORTANT]
The CPU control logic only analyzes the opcode of the custom instructions to check if the _entire_
instruction word is valid. All remaining bit-fields are **not checked** at all.
This also means that the MSBs of the register fields are **not checked** even if the `E` ISA extension
is enabled (for standard RISC-V instructions this would cause an exception).
Hence, a custom CFU instruction can never raise an illegal instruction exception. If the CFU is not
implemented at all (`Zxcfu` ISA extension is not enabled) any instruction with `custom-x` opcode
will raise an illegal instruction exception.


:sectnums:
==== CFU R3-Type Instructions

The R3-type CFU instructions operate on two source registers and return the processing result to the destination register.
The actual operation can be defined by using the `funct7` and `funct3` bit fields. These immediates can also be used to
pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual
functionality is entirely user-defined.

Example operation: `rd <= rs1 xnor rs2`

.CFU R3-type instruction format
image::cfu_r3type_instruction.png[align=center]

* `funct7`: 7-bit immediate (further operand data or function select)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `funct3`: 3-bit immediate (further operand data or function select)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `0001011` (RISC-V "custom-0" opcode)

.RISC-V compatibility
[NOTE]
The CFU R3-type instruction format is compliant to the RISC-V ISA specification.

.Instruction encoding space
[NOTE]
By using the `funct7` and `funct3` bit fields entirely for selecting the actual operation a total of 1024 custom R3-type
instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).


:sectnums:
==== CFU R4-Type Instructions

The R4-type CFU instructions operate on three source registers and return the processing result to the destination register.
The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediate can also be used to
pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual
functionality is entirely user-defined.

Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`

.CFU R4-type instruction format
image::cfu_r4type_instruction.png[align=center]

* `rs3`: address of third source register (32-bit source data)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `funct3`: 3-bit immediate (further operand data or function select)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `0101011` (RISC-V "custom-1" opcode)

.RISC-V compatibility
[NOTE]
The CFU R4-type instruction format is compliant to the RISC-V ISA specification.

.Unused instruction bits
[NOTE]
The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bits are ignored
by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve compatibility with
future implementations.

.Instruction encoding space
[NOTE]
By using the `funct3` bit field entirely for selecting the actual operation a total of 8 custom R4-type instructions
can be implemented (3-bit -> 8 different values).


:sectnums:
==== CFU R5-Type Instructions

The R5-type CFU instructions operate on three source registers and return the processing result to the destination register.
As all bits of the instruction word are used to encode the five registers and the opcode, no further immediate bits
are available to specify the actual operation. There are two different R5-type instruction with two different opcodes
available. Hence, only two R5-type operations can be implemented out of the box.

Example operation: `rd <= rs1 & rs2 & rs3 & rs4`

.CFU R5-type instruction A format
image::cfu_r5type_instruction_a.png[align=center]

.CFU R5-type instruction B format
image::cfu_r5type_instruction_b.png[align=center]

* `rs4.hi` & `rs4.lo`: address of fourth source register (32-bit source data)
* `rs3`: address of third source register (32-bit source data)
* `rs2`: address of second source register (32-bit source data)
* `rs1`: address of first source register (32-bit source data)
* `rd`: address of destination register (for the 32-bit processing result)
* `opcode`: `1011011` (RISC-V "custom-2" opcode) and/or `1111011` (RISC-V "custom-3" opcode)

.RS4 bit field
[NOTE]
The `rs4` bit-field is split into two instruction word fields `rs4.hi` and `rs4.lo`. This allows a simple
decoding logic as the location of the remaining register fields is identical to other R-type instructions.

.RISC-V compatibility
[IMPORTANT]
The RISC-V ISA specifications does not specify a R5-type instruction format. Hence, this instruction
layout is NEORV32-specific.

.Instruction encoding space
[IMPORTANT]
There are no immediate fields in the CFU R5-type instruction so the actual operation is specified entirely
by the opcode resulting in just two different operations out of the box. However, another CFU instruction
(like a R3-type instruction) can be used to "program" the actual operation of a R5-type instruction by
writing operation information to a CFU-internal "command" register.


:sectnums:
==== Using Custom Instructions in Software

The custom instructions provided by the CFU can be used in plain C code by using **intrinsics**. Intrinsics
behave like "normal" functions but under the hood they are a set of macros that hide the complexity of inline assembly.
Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when including custom
instructions. Each intrinsic will result in a single 32-bit instruction word providing maximum code efficiency.

The NEORV32 software framework provides four pre-defined prototypes for custom instructions, which are defined in
`sw/lib/include/neorv32_cpu_cfu.h`:

.CFU instruction prototypes
[source,c]
----
neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instructions
neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3)    // R4-type instructions
neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4)     // R5-type instruction A
neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4)     // R5-type instruction B
----

The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded
when not needed. Each intrinsic function requires several arguments depending on the instruction type/format:

* `funct7` - 7-bit immediate (R3-type only)
* `funct3` - 3-bit immediate (R3-type, R4-type)
* `rs1` - source operand 1, 32-bit (R3-type, R4-type)
* `rs2` - source operand 2, 32-bit (R3-type, R4-type)
* `rs3` - source operand 2, 32-bit (R3-type, R4-type, R5-type)
* `rs4` - source operand 2, 32-bit (R4-type, R4-type, R5-type)

The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2` and `rs3`
arguments pass the actual data to the CFU. These register arguments can be populated with variables or literals.
The following example shows how to pass arguments when executing both CFU instruction types:

.CFU instruction usage example
[source,c]
----
uint32_t tmp = some_function();
...
uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123);
uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, some_array[i]);
uint32_t bar = neorv32_cfu_r5_instr_a(tmp, res, foo, tmp);
----

.CFU Example Program
[TIP]
There is an example program for the CFU, which shows how to use the _default_ CFU hardware module.
This example program is located in `sw/example/demo_cfu`.


:sectnums:
==== Custom Instructions Hardware

The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside
the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`.

.CFU Hardware Example & More Details
[TIP]
The default CFU hardware module already implement some exemplary instructions that are used for illustration
by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
is highly commented to explain the available signals and the handshake with the CPU pipeline.

.CFU hardware resource requirements
[WARNING]
Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using
the according operands for the CFU hardware) will add one or two additional read ports to the core's
register file increasing resource requirements.

CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of
the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)
and may also include internal states and memories. The CFU's internal controller unit takes care of
interfacing the custom user logic to the CPU's pipeline.

.CFU Execution Time
[NOTE]
The CFU is not required to finish processing within a bound time. However, you should keep in mind that the
CPU is _stalled_ until the CFU has finished processing. This also means the CPU cannot react to pending
interrupts during this time affecting real-time behavior (interrupt requests will still be queued).
projet 2023-03-06 14:48:14 +01:00			`<<<`
			`:sectnums:`
			`=== Custom Functions Unit (CFU)`

			`The Custom Functions Unit is the central part of the <<_zxcfu_custom_instructions_extension_cfu>> and represents`
			`the actual hardware module, which is used to implement _custom RISC-V instructions_. The concept of the NEORV32`
			`CFU has been highly inspired by https://github.com/google/CFU-Playground[Google's CFU-Playground].`

			`The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or`
			`program memory requirements when implemented entirely in software. Some potential application fields and exemplary`
			`use-cases might include:`

			`* AI: sub-word / vector / SIMD operations like processing all four bytes of a 32-bit data word in parallel`
			`* Cryptographic: bit substitution and permutation`
			`* Communication: conversions like binary to gray-code; multiply-add operations`
			`* Image processing: look-up-tables for color space transformations`
			`* implementing instructions from other RISC-V ISA extensions that are not yet supported by the NEORV32`

			`[NOTE]`
			`The CFU is not intended for complex and _CPU-independent_ functional units that implement complete accelerators`
			`(like block-based AES encryption). These kind of accelerators should be implemented as memory-mapped`
			`<<_custom_functions_subsystem_cfs>>.`
			`A comparison of all NEORV32-specific chip-internal hardware extension options is provided in the user guide section`
			`https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules].`


			`:sectnums:`
			`==== CFU Instruction Formats`

			The custom instructions executed by the CFU utilize a specific opcode space in the `rv32` 32-bit instruction
			`space that has been explicitly reserved for user-defined extensions by the RISC-V specifications ("_Guaranteed Non-Standard`
			Encoding Space_"). The NEORV32 CFU uses the `custom-x` opcodes to identify the instructions implemented
			`by the CFU and to differentiate between the different instruction formats.`
			`The according binary encoding of these opcodes is shown below:`

			* `custom-0`: `0001011` (R3-type instructions, RISC-V standard)
			* `custom-1`: `0101011` (R4-type instructions, RISC-V standard)
			* `custom-2`: `1011011` (R5-type instruction A, NEORV32-specific)
			* `custom-3`: `1111011` (R5-type instruction B, NEORV32-specific)

			`.CFU Instructions - Exceptions`
			`[IMPORTANT]`
			`The CPU control logic only analyzes the opcode of the custom instructions to check if the _entire_`
			`instruction word is valid. All remaining bit-fields are not checked at all.`
			This also means that the MSBs of the register fields are not checked even if the `E` ISA extension
			`is enabled (for standard RISC-V instructions this would cause an exception).`
			`Hence, a custom CFU instruction can never raise an illegal instruction exception. If the CFU is not`
			implemented at all (`Zxcfu` ISA extension is not enabled) any instruction with `custom-x` opcode
			`will raise an illegal instruction exception.`


			`:sectnums:`
			`==== CFU R3-Type Instructions`

			`The R3-type CFU instructions operate on two source registers and return the processing result to the destination register.`
			The actual operation can be defined by using the `funct7` and `funct3` bit fields. These immediates can also be used to
			`pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual`
			`functionality is entirely user-defined.`

			Example operation: `rd <= rs1 xnor rs2`

			`.CFU R3-type instruction format`
			`image::cfu_r3type_instruction.png[align=center]`

			* `funct7`: 7-bit immediate (further operand data or function select)
			* `rs2`: address of second source register (32-bit source data)
			* `rs1`: address of first source register (32-bit source data)
			* `funct3`: 3-bit immediate (further operand data or function select)
			* `rd`: address of destination register (for the 32-bit processing result)
			* `opcode`: `0001011` (RISC-V "custom-0" opcode)

			`.RISC-V compatibility`
			`[NOTE]`
			`The CFU R3-type instruction format is compliant to the RISC-V ISA specification.`

			`.Instruction encoding space`
			`[NOTE]`
			By using the `funct7` and `funct3` bit fields entirely for selecting the actual operation a total of 1024 custom R3-type
			`instructions can be implemented (7-bit + 3-bit = 10 bit -> 1024 different values).`


			`:sectnums:`
			`==== CFU R4-Type Instructions`

			`The R4-type CFU instructions operate on three source registers and return the processing result to the destination register.`
			The actual operation can be defined by using the `funct3` bit field. Alternatively, this immediate can also be used to
			`pass additional data to the CFU like offsets, look-up-tables addresses or shift-amounts. However, the actual`
			`functionality is entirely user-defined.`

			Example operation: `rd <= (rs1 * rs2 + rs3)[31:0]`

			`.CFU R4-type instruction format`
			`image::cfu_r4type_instruction.png[align=center]`

			* `rs3`: address of third source register (32-bit source data)
			* `rs2`: address of second source register (32-bit source data)
			* `rs1`: address of first source register (32-bit source data)
			* `funct3`: 3-bit immediate (further operand data or function select)
			* `rd`: address of destination register (for the 32-bit processing result)
			* `opcode`: `0101011` (RISC-V "custom-1" opcode)

			`.RISC-V compatibility`
			`[NOTE]`
			`The CFU R4-type instruction format is compliant to the RISC-V ISA specification.`

			`.Unused instruction bits`
			`[NOTE]`
			`The RISC-V ISA specification defines bits [26:25] of the R4-type instruction word to be all-zero. These bits are ignored`
			`by the hardware (CFU and illegal instruction check logic) and should be set to all-zero to preserve compatibility with`
			`future implementations.`

			`.Instruction encoding space`
			`[NOTE]`
			By using the `funct3` bit field entirely for selecting the actual operation a total of 8 custom R4-type instructions
			`can be implemented (3-bit -> 8 different values).`


			`:sectnums:`
			`==== CFU R5-Type Instructions`

			`The R5-type CFU instructions operate on three source registers and return the processing result to the destination register.`
			`As all bits of the instruction word are used to encode the five registers and the opcode, no further immediate bits`
			`are available to specify the actual operation. There are two different R5-type instruction with two different opcodes`
			`available. Hence, only two R5-type operations can be implemented out of the box.`

			Example operation: `rd <= rs1 & rs2 & rs3 & rs4`

			`.CFU R5-type instruction A format`
			`image::cfu_r5type_instruction_a.png[align=center]`

			`.CFU R5-type instruction B format`
			`image::cfu_r5type_instruction_b.png[align=center]`

			* `rs4.hi` & `rs4.lo`: address of fourth source register (32-bit source data)
			* `rs3`: address of third source register (32-bit source data)
			* `rs2`: address of second source register (32-bit source data)
			* `rs1`: address of first source register (32-bit source data)
			* `rd`: address of destination register (for the 32-bit processing result)
			* `opcode`: `1011011` (RISC-V "custom-2" opcode) and/or `1111011` (RISC-V "custom-3" opcode)

			`.RS4 bit field`
			`[NOTE]`
			The `rs4` bit-field is split into two instruction word fields `rs4.hi` and `rs4.lo`. This allows a simple
			`decoding logic as the location of the remaining register fields is identical to other R-type instructions.`

			`.RISC-V compatibility`
			`[IMPORTANT]`
			`The RISC-V ISA specifications does not specify a R5-type instruction format. Hence, this instruction`
			`layout is NEORV32-specific.`

			`.Instruction encoding space`
			`[IMPORTANT]`
			`There are no immediate fields in the CFU R5-type instruction so the actual operation is specified entirely`
			`by the opcode resulting in just two different operations out of the box. However, another CFU instruction`
			`(like a R3-type instruction) can be used to "program" the actual operation of a R5-type instruction by`
			`writing operation information to a CFU-internal "command" register.`


			`:sectnums:`
			`==== Using Custom Instructions in Software`

			`The custom instructions provided by the CFU can be used in plain C code by using intrinsics. Intrinsics`
			`behave like "normal" functions but under the hood they are a set of macros that hide the complexity of inline assembly.`
			`Using intrinsics removes the need to modify the compiler, built-in libraries or the assembler when including custom`
			`instructions. Each intrinsic will result in a single 32-bit instruction word providing maximum code efficiency.`

			`The NEORV32 software framework provides four pre-defined prototypes for custom instructions, which are defined in`
			`sw/lib/include/neorv32_cpu_cfu.h`:

			`.CFU instruction prototypes`
			`[source,c]`
			`----`
			`neorv32_cfu_r3_instr(funct7, funct3, rs1, rs2) // R3-type instructions`
			`neorv32_cfu_r4_instr(funct3, rs1, rs2, rs3) // R4-type instructions`
			`neorv32_cfu_r5_instr_a(rs1, rs2, rs3, rs4) // R5-type instruction A`
			`neorv32_cfu_r5_instr_b(rs1, rs2, rs3, rs4) // R5-type instruction B`
			`----`

			The intrinsic functions always return a 32-bit value of type `uint32_t` (the processing result), which can be discarded
			`when not needed. Each intrinsic function requires several arguments depending on the instruction type/format:`

			* `funct7` - 7-bit immediate (R3-type only)
			* `funct3` - 3-bit immediate (R3-type, R4-type)
			* `rs1` - source operand 1, 32-bit (R3-type, R4-type)
			* `rs2` - source operand 2, 32-bit (R3-type, R4-type)
			* `rs3` - source operand 2, 32-bit (R3-type, R4-type, R5-type)
			* `rs4` - source operand 2, 32-bit (R4-type, R4-type, R5-type)

			The `funct3` and `funct7` bit-fields are used to pass 3-bit or 7-bit literals to the CFU. The `rs1`, `rs2` and `rs3`
			`arguments pass the actual data to the CFU. These register arguments can be populated with variables or literals.`
			`The following example shows how to pass arguments when executing both CFU instruction types:`

			`.CFU instruction usage example`
			`[source,c]`
			`----`
			`uint32_t tmp = some_function();`
			`...`
			`uint32_t res = neorv32_cfu_r3_instr(0b0000000, 0b101, tmp, 123);`
			`uint32_t foo = neorv32_cfu_r4_instr(0b011, tmp, res, some_array[i]);`
			`uint32_t bar = neorv32_cfu_r5_instr_a(tmp, res, foo, tmp);`
			`----`

			`.CFU Example Program`
			`[TIP]`
			`There is an example program for the CFU, which shows how to use the _default_ CFU hardware module.`
			This example program is located in `sw/example/demo_cfu`.


			`:sectnums:`
			`==== Custom Instructions Hardware`

			`The actual functionality of the CFU's custom instructions is defined by the user-defined logic inside`
			the CFU hardware module `rtl/core/neorv32_cpu_cp_cfu.vhd`.

			`.CFU Hardware Example & More Details`
			`[TIP]`
			`The default CFU hardware module already implement some exemplary instructions that are used for illustration`
			by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which
			`is highly commented to explain the available signals and the handshake with the CPU pipeline.`

			`.CFU hardware resource requirements`
			`[WARNING]`
			`Enabling the CFU and actually implementing R4-type and/or R5-type instructions (or more precisely, using`
			`the according operands for the CFU hardware) will add one or two additional read ports to the core's`
			`register file increasing resource requirements.`

			`CFU operations can be entirely combinatorial (like bit-reversal) so the result is available at the end of`
			`the current clock cycle. Operations can also take several clock cycles to complete (like multiplications)`
			`and may also include internal states and memories. The CFU's internal controller unit takes care of`
			`interfacing the custom user logic to the CPU's pipeline.`

			`.CFU Execution Time`
			`[NOTE]`
			`The CFU is not required to finish processing within a bound time. However, you should keep in mind that the`
			`CPU is _stalled_ until the CFU has finished processing. This also means the CPU cannot react to pending`
			`interrupts during this time affecting real-time behavior (interrupt requests will still be queued).`