The Icon Bar: Programming: Test your optimization skills pt.1
|
Test your optimization skills pt.1 |
|
sirbod (19:02 24/10/2012) Phlamethrower (19:22 24/10/2012) nunfetishist (19:31 24/10/2012) sirbod (20:33 24/10/2012) nunfetishist (22:02 24/10/2012) Phlamethrower (00:28 25/10/2012) sirbod (08:59 25/10/2012) Phlamethrower (11:57 25/10/2012) sirbod (12:11 25/10/2012) Phlamethrower (12:20 25/10/2012) sirbod (12:29 25/10/2012) Phlamethrower (14:19 25/10/2012) sirbod (20:33 24/10/2012) Phlamethrower (21:28 24/10/2012)
|
|
Jon Abbott |
Message #121308, posted by sirbod at 19:02, 24/10/2012 |
Member
Posts: 563
|
This is just a bit of fun, but interesting if you're into ARM code. I was looking at the MFM code in ADFFS and thought, what an interesting problem with many solutions.
PROBLEM: Reduce a 32 bit value down to just it's even bits in as few cycles and registers as possible. ie 32 bits > 16 bits, containing input bits 0, 3, 5 ... in output bits 0, 1, 2 ...
INPUT: R0 = value OUTPUT: R0 = reduced value
eg. an input of AAAAAAAA becomes 0 output and 55555555 becomes FFFF
Here are some examples to solving the problem:
32 instructions, TST method
AND R1, R0, #1 TST R0, #1<<2 ORRNE R1, R1, #1<<1 TST R0, #1<<4 ORRNE R1, R1, #1<<2 TST R0, #1<<6 ORRNE R1, R1, #1<<3 TST R0, #1<<8 ORRNE R1, R1, #1<<4 TST R0, #1<<10 ORRNE R1, R1, #1<<5 TST R0, #1<<12 ORRNE R1, R1, #1<<6 TST R0, #1<<14 ORRNE R1, R1, #1<<7 TST R0, #1<<16 ORRNE R1, R1, #1<<8 TST R0, #1<<18 ORRNE R1, R1, #1<<9 TST R0, #1<<20 ORRNE R1, R1, #1<<10 TST R0, #1<<22 ORRNE R1, R1, #1<<11 TST R0, #1<<24 ORRNE R1, R1, #1<<12 TST R0, #1<<26 ORRNE R1, R1, #1<<13 TST R0, #1<<28 ORRNE R1, R1, #1<<14 TST R0, #1<<30 ORRNE R1, R1, #1<<15 MOV R0, R1
28 instructions, multi-flag method
MOV R1, #&AA ORR R1, R1, R1, LSL #8 ORR R1, R1, R1, LSL #16 BIC R0, R0, R1 ORR R1, R0, R0, LSL #1
MOV R0, R1, LSR #1 AND R0, R0, #%11 MOVS R1, R1, LSL #2 ORRCS R0, R0, #1<<15 ORRMI R0, R0, #1<<14
MOVS R1, R1, LSL #4 ORRCS R0, R0, #1<<13 ORRMI R0, R0, #1<<12
MOVS R1, R1, LSL #4 ORRCS R0, R0, #1<<11 ORRMI R0, R0, #1<<10
MOVS R1, R1, LSL #4 ORRCS R0, R0, #1<<9 ORRMI R0, R0, #1<<8
MOVS R1, R1, LSL #4 ORRCS R0, R0, #1<<7 ORRMI R0, R0, #1<<6
MOVS R1, R1, LSL #4 ORRCS R0, R0, #1<<5 ORRMI R0, R0, #1<<4
MOVS R1, R1, LSL #4 ORRCS R0, R0, #1<<3 ORRMI R0, R0, #1<<2
Current winner: Jeffrey Lee with 14 instructions
MOV R3, #&0F ORR R3, R3, R3, LSL #8 ORR R3, R3, R3, LSL #16 EOR R2,R3,R3,LSL #2 EOR R1,R2,R2,LSL #1 AND R0,R0,R1 ORR R0,R0,R0,LSR #1 AND R0,R0,R2 ORR R0,R0,R0,LSR #2 AND R0,R0,R3 ORR R0,R0,R0,LSR #4 AND R1,R0,#&FF0000 AND R0,R0,#&FF ORR R0,R0,R1,LSR #8
[Edited by sirbod at 13:12, 25/10/2012] |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #121309, posted by Phlamethrower at 19:22, 24/10/2012, in reply to message #121308 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Yay! I like these things.
4 instructions (plus one 64K lookup table )
LDRB R0,[lut,R1,LSR #16] MOV R1,R1,LSL #16 LDRB R1,[lut,R1,LSR #16] ORR R0,R1,R0,LSL #8 |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #121310, posted by nunfetishist at 19:31, 24/10/2012, in reply to message #121309 |
Today's phish is trout a la creme.
Posts: 525
|
My C compiler spat out this rather naïve, but brief, code: mov ip, #0 str lr, [sp, #-4]! mov r2, #31 mov r1, ip mov lr, #1 .L2: ands r3, r0, lr, asl r2 movne r3, #0 moveq r3, #1 orr r3, ip, r3, asl r1 sub r2, r2, #2 mov r3, r3, asl #16 cmn r2, #1 mov ip, r3, lsr #16 add r1, r1, #1 bne .L2 mov r0, ip ldr pc, [sp], #4 (Edit to remove code tags that TIB's bbcode doesn't support.)
[Edited by nunfetishist at 23:02, 24/10/2012] |
|
[ Log in to reply ] |
|
Jon Abbott |
Message #121311, posted by sirbod at 20:33, 24/10/2012, in reply to message #121309 |
Member
Posts: 563
|
4 instructions (plus one 64K lookup table )
LDRB R0,[lut,R1,LSR #16] MOV R1,R1,LSL #16 LDRB R1,[lut,R1,LSR #16] ORR R0,R1,R0,LSL #8 4 out of 5 - you drop a point for the 64K table! |
|
[ Log in to reply ] |
|
Jon Abbott |
Message #121312, posted by sirbod at 20:33, 24/10/2012, in reply to message #121310 |
Member
Posts: 563
|
My C compiler Instant fail! |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #121315, posted by Phlamethrower at 21:28, 24/10/2012, in reply to message #121311 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
4 instructions (plus one 64K lookup table )
LDRB R0,[lut,R1,LSR #16] MOV R1,R1,LSL #16 LDRB R1,[lut,R1,LSR #16] ORR R0,R1,R0,LSL #8 4 out of 5 - you drop a point for the 64K table! How about 5 instructions and a 32K table? (i.e. just mask off the top bit of each halfword since you're not interested in them) |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #121317, posted by nunfetishist at 22:02, 24/10/2012, in reply to message #121312 |
Today's phish is trout a la creme.
Posts: 525
|
My C compiler Instant fail! Err, why? The code seems quite tight to me, self-explanatory, takes advantage of branch prediction on newer CPUs, and is quite tiny. Also, if targeting ARMv6 or later, you can use the sign-extend instructions to make it tighter still. |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #121321, posted by Phlamethrower at 00:28, 25/10/2012, in reply to message #121317 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Mask-and-shift approach, 8 instructions plus 3 registers needed for constants:
AND R0,R0,#&55555555 ORR R0,R0,R0,LSR #1 AND R0,R0,#&33333333 ORR R0,R0,R0,LSR #2 AND R0,R0,#&0F0F0F0F ORR R0,R0,R0,LSR #4 BIC R0,R0,#&0000FF00 ORR R0,R0,R0,LSR #8
Similar routine in NEON, 1.75 instructions (7 instructions long, but 4 words processed), 3 Q regs needed for constants:
VAND Q0,Q0,#&55555555555555555555555555555555 VSRA.U8 Q0,Q0,#1 VAND Q0,Q0,#&33333333333333333333333333333333 VSRA.U8 Q0,Q0,#2 VAND Q0,Q0,#&0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F VSRA.U16 Q0,Q0,#4 VUZP.8 Q0,Q0
|
|
[ Log in to reply ] |
|
Jon Abbott |
Message #121322, posted by sirbod at 08:59, 25/10/2012, in reply to message #121321 |
Member
Posts: 563
|
Mask-and-shift approach, 8 instructions plus 3 registers needed for constants:
AND R0,R0,#&55555555 ORR R0,R0,R0,LSR #1 AND R0,R0,#&33333333 ORR R0,R0,R0,LSR #2 AND R0,R0,#&0F0F0F0F ORR R0,R0,R0,LSR #4 BIC R0,R0,#&0000FF00 ORR R0,R0,R0,LSR #8
Impressive, 18 instructions in total.
EDIT: Correction, its 19 instructions as you need to clear two sets of bits at the end.
[Edited by sirbod at 11:22, 25/10/2012] |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #121325, posted by Phlamethrower at 11:57, 25/10/2012, in reply to message #121322 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
19 instructions? Really?
LDR R1,=&55555555 LDR R2,=&33333333 LDR R3,=&0F0F0F0F AND R0,R0,R1 ORR R0,R0,R0,LSR #1 AND R0,R0,R2 ORR R0,R0,R0,LSR #2 AND R0,R0,R3 ORR R0,R0,R0,LSR #4 AND R1,R0,#&FF0000 AND R0,R0,#&FF ORR R0,R0,R1,LSR #8 MOV PC,LR
That's 13 instructions for an APCS compliant function with (up to) 3 extra words taken from the literal pool.
Or if you really care about your literal pool size:
LDR R3,=&0F0F0F0F EOR R2,R3,R3,LSL #2 EOR R1,R2,R2,LSL #1
[Edited by Phlamethrower at 12:59, 25/10/2012] |
|
[ Log in to reply ] |
|
Jon Abbott |
Message #121326, posted by sirbod at 12:11, 25/10/2012, in reply to message #121325 |
Member
Posts: 563
|
19 instructions? Really?
LDR R1,=&55555555 LDR R2,=&33333333 LDR R3,=&0F0F0F0F AND R0,R0,R1 ORR R0,R0,R0,LSR #1 AND R0,R0,R2 ORR R0,R0,R0,LSR #2 AND R0,R0,R3 ORR R0,R0,R0,LSR #4 AND R1,R0,#&FF0000 AND R0,R0,#&FF ORR R0,R0,R1,LSR #8 MOV PC,LR
That's 13 instructions for an APCS compliant function with (up to) 3 extra words taken from the literal pool.
Or if you really care about your literal pool size:
LDR R3,=&0F0F0F0F EOR R2,R3,R3,LSL #2 EOR R1,R2,R2,LSL #1 The LDR's are an extra cache miss, and as we're looking for the least cycles, its quicker to code the values. Using the additional optimizations above you get:
MOV R3, #&0F ORR R3, R3, R3, LSL #8 ORR R3, R3, R3, LSL #16 EOR R2,R3,R3,LSL #2 EOR R1,R2,R2,LSL #1 AND R0,R0,R1 ORR R0,R0,R0,LSR #1 AND R0,R0,R2 ORR R0,R0,R0,LSR #2 AND R0,R0,R3 ORR R0,R0,R0,LSR #4 AND R1,R0,#&FF0000 AND R0,R0,#&FF ORR R0,R0,R1,LSR #8
...14 cycles, this has to be the winner! |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #121327, posted by Phlamethrower at 12:20, 25/10/2012, in reply to message #121326 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
I'm guessing this code must only need to work on one word of data at a time? You'd obviously get better performance if you could run it on a batch of data to avoid recalculating the constants all the time, and if you interleaved the processing of one word with another to avoid pipeline stalls due to RAW hazards (and to allow dual issue on A8/A9). |
|
[ Log in to reply ] |
|
Jon Abbott |
Message #121328, posted by sirbod at 12:29, 25/10/2012, in reply to message #121327 |
Member
Posts: 563
|
I'm guessing this code must only need to work on one word of data at a time? You'd obviously get better performance if you could run it on a batch of data to avoid recalculating the constants all the time, and if you interleaved the processing of one word with another to avoid pipeline stalls due to RAW hazards (and to allow dual issue on A8/A9). The problem was for a single word, but clearly it could be optimized more for batches of data.
May I use your earlier example in ADFFS? |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #121330, posted by Phlamethrower at 14:19, 25/10/2012, in reply to message #121328 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
May I use your earlier example in ADFFS? Sure. It's hardly rocket science! |
|
[ Log in to reply ] |
|
|
The Icon Bar: Programming: Test your optimization skills pt.1 |