Ik kreeg een mailtje van Jay Berg die een aangepaste client heeft geschreven voor de P4. En ondanks dat presteert de P4 nog steeds slecht, dit is waarom:
The P4 is designed to handle data a specific way for specific tasks.
If you analyse the SSE2 instruction set, the following rules become
quickly apparent.
- - - -
1. Data elements must be in 8, 16, 32, or 64 bit size.
Other than store and load, there are no SSE2 instructions for dealing
with more than 64 bits of data at a time. And most operations are
severely limited when dealing with 64-bit data elements. For example,
there is no "add/sub saturated" for 64-bit data, no compare of 64-bit
data, etc.
Extensive work has been done to determine an optimum method of doing
simple math (add, sub, multiply) on 128-bit data with SSE2
instructios. To date (after more than 3 months of research), we have
been unable to find any method faster than using general CPU
registers for doing add/sub. The complications involve manual
handling of carry and borrow information, combined with the
requirement that each math result must then go through a modulo
reduction.
2. Data handling must involve several steps on each data element with
the data being handled as sets of data (i.e. arrays of data).
The instruction set is designed with SIMD (Single Instruction,
Multiple Data). This implies that you are doing the same instruction
to multiple data elements in parallel operations.
Also, the overhead in moving data to/from the MMX/XMM registers
requires expensive CPU cycles. Also important is that arranging the
data into the proper format may also require expensive cycles. For
example if you are working with data which must be manipulated
between MMX/XMM registers (for example moving portions of data
elements), you must use a large number of instructions.
Efforts have been made to group X/Y data points into groups of common
operations. This has resulted in a multiply function capable of
multiplying two 128-bit numbers in parallel (simultaneously). But we
have yet to find a method of doing SIMD nature add/sub operations
that is faster than single stream usage of general CPU registers.
3. Instruction/data must be tight enough to keep the inner loops
totally within cache memory.
When going outside of the cache for instructions and/or data, the P4
suffers extreme speed penalties. Code must remain very tight as a
result.
Unfortunately, the client logic is quite extensive. Strong efforts
have been made to shrink the code so as to be able to keep the
majority of the inner loops within instruction cache. But this has
not been 100% successful. The logic is so extensive that the result
has been a compromise between straight line code, loops (expensive
with counters and conditional branches), and go-sub's (expensive with
outside of cache execution, and push/pop sequences).
4. Logic must avoid any usage of conditional jumps that may result in
failed branch predictions.
Code must be designed without conditional branches. If they must be
used, logic must be designed that will result in the code usually
following the path that branch prediction will succeed in predicting.
Due to the complicated nature of the client, avoidence of conditional
branches has met mixed results. There is a substantial number of
loops that have a 50/50 chance of looping, or not looping. Each time
execution follows the path that results in failure of the branch
prediction, the P4 CPU comes to a screeching halt! Even worse is when
it fails branch prediction and repeats the loop once, then fails
branch prediction and doesn't repeat the loop a second time! This
results in two failed predictions and the CPU loses a major number of
CPU cycles on each branch prediction failure!
The best example of these are during the frequent mod(P) operations.
Since it is unknown whether the value will require a modulo function
until you actually test the value, the test itself will frequently
(50% of the time) result in the CPU coming to a complete halt due to
a missed branch prediction result.
- - - -
It quickly becomes apparent that we're breaking all of the P4 rules
in the ECCp109 client!
1. (Rule 1) We deal with 128-bit data elements. While it is possible
to break these into multiple smaller units, each smaller unit will
have a connection to the other units. Thus adding two 128-bit data
elements, breaking them into DWords would mean manual handling of CY
information between the data elements. The P4 simply isn't designed
to handle the 128-bit numbers the client must deal with.
2. (Rule 2) The client is designed to handle one data element
operation at a time. This means that we're doing a set of operations
on a single X/Y point, rather than a series of operations on a group
of X/Y points. Redesign has been accomplished on the client to cause
handling of data sets, but this has not yielded good results due to
breaking of the data element size rule. The client needs do not lend
themselves to the P4 capabilities.
3. (Rules 3-4) Client logic requires long series of operations on an
X/Y point. This exceeds cache size and results in a large number of
failed branch predictions. The client complexity does not lend itself
to the P4 cache size abilities.
- - - -
Unless the client can be recreated with the P4 rules in mind, there
is little chance of increasing the performance of the P4. And with
the nature of the ECCp109 challenge, it is extremely unlikely that a
redesign can be accomplished to meet the requirements of the P4 SIMD
design.
Note that such efforts are still underway and results may be
accomplished. However also note that this challenge is of transitory
nature. Once the solution has been accomplished, then all ECCp109
clients will be discarded. Thus it does not pay to exceed the 90/10
rule (spend 10% of the effort to achieve 90% of the results).
Exceeding the 90/10 rule quickly becomes very expensive in
engineering time costs!
Jay Berg
jberg@eCompute.org
http://eCompute.org/eccp109
The P4 is designed to handle data a specific way for specific tasks.
If you analyse the SSE2 instruction set, the following rules become
quickly apparent.
- - - -
1. Data elements must be in 8, 16, 32, or 64 bit size.
Other than store and load, there are no SSE2 instructions for dealing
with more than 64 bits of data at a time. And most operations are
severely limited when dealing with 64-bit data elements. For example,
there is no "add/sub saturated" for 64-bit data, no compare of 64-bit
data, etc.
Extensive work has been done to determine an optimum method of doing
simple math (add, sub, multiply) on 128-bit data with SSE2
instructios. To date (after more than 3 months of research), we have
been unable to find any method faster than using general CPU
registers for doing add/sub. The complications involve manual
handling of carry and borrow information, combined with the
requirement that each math result must then go through a modulo
reduction.
2. Data handling must involve several steps on each data element with
the data being handled as sets of data (i.e. arrays of data).
The instruction set is designed with SIMD (Single Instruction,
Multiple Data). This implies that you are doing the same instruction
to multiple data elements in parallel operations.
Also, the overhead in moving data to/from the MMX/XMM registers
requires expensive CPU cycles. Also important is that arranging the
data into the proper format may also require expensive cycles. For
example if you are working with data which must be manipulated
between MMX/XMM registers (for example moving portions of data
elements), you must use a large number of instructions.
Efforts have been made to group X/Y data points into groups of common
operations. This has resulted in a multiply function capable of
multiplying two 128-bit numbers in parallel (simultaneously). But we
have yet to find a method of doing SIMD nature add/sub operations
that is faster than single stream usage of general CPU registers.
3. Instruction/data must be tight enough to keep the inner loops
totally within cache memory.
When going outside of the cache for instructions and/or data, the P4
suffers extreme speed penalties. Code must remain very tight as a
result.
Unfortunately, the client logic is quite extensive. Strong efforts
have been made to shrink the code so as to be able to keep the
majority of the inner loops within instruction cache. But this has
not been 100% successful. The logic is so extensive that the result
has been a compromise between straight line code, loops (expensive
with counters and conditional branches), and go-sub's (expensive with
outside of cache execution, and push/pop sequences).
4. Logic must avoid any usage of conditional jumps that may result in
failed branch predictions.
Code must be designed without conditional branches. If they must be
used, logic must be designed that will result in the code usually
following the path that branch prediction will succeed in predicting.
Due to the complicated nature of the client, avoidence of conditional
branches has met mixed results. There is a substantial number of
loops that have a 50/50 chance of looping, or not looping. Each time
execution follows the path that results in failure of the branch
prediction, the P4 CPU comes to a screeching halt! Even worse is when
it fails branch prediction and repeats the loop once, then fails
branch prediction and doesn't repeat the loop a second time! This
results in two failed predictions and the CPU loses a major number of
CPU cycles on each branch prediction failure!
The best example of these are during the frequent mod(P) operations.
Since it is unknown whether the value will require a modulo function
until you actually test the value, the test itself will frequently
(50% of the time) result in the CPU coming to a complete halt due to
a missed branch prediction result.
- - - -
It quickly becomes apparent that we're breaking all of the P4 rules
in the ECCp109 client!
1. (Rule 1) We deal with 128-bit data elements. While it is possible
to break these into multiple smaller units, each smaller unit will
have a connection to the other units. Thus adding two 128-bit data
elements, breaking them into DWords would mean manual handling of CY
information between the data elements. The P4 simply isn't designed
to handle the 128-bit numbers the client must deal with.
2. (Rule 2) The client is designed to handle one data element
operation at a time. This means that we're doing a set of operations
on a single X/Y point, rather than a series of operations on a group
of X/Y points. Redesign has been accomplished on the client to cause
handling of data sets, but this has not yielded good results due to
breaking of the data element size rule. The client needs do not lend
themselves to the P4 capabilities.
3. (Rules 3-4) Client logic requires long series of operations on an
X/Y point. This exceeds cache size and results in a large number of
failed branch predictions. The client complexity does not lend itself
to the P4 cache size abilities.
- - - -
Unless the client can be recreated with the P4 rules in mind, there
is little chance of increasing the performance of the P4. And with
the nature of the ECCp109 challenge, it is extremely unlikely that a
redesign can be accomplished to meet the requirements of the P4 SIMD
design.
Note that such efforts are still underway and results may be
accomplished. However also note that this challenge is of transitory
nature. Once the solution has been accomplished, then all ECCp109
clients will be discarded. Thus it does not pay to exceed the 90/10
rule (spend 10% of the effort to achieve 90% of the results).
Exceeding the 90/10 rule quickly becomes very expensive in
engineering time costs!
Jay Berg
jberg@eCompute.org
http://eCompute.org/eccp109