as3corelib is the defacto library for all common utility functions in ActionScript 3. While it’s a reasonably good codebase, it’s also unforgivably slow sometimes, which led to the creation of actionjson. Its SHA-1 implementation leaves a lot to be desired, and this article takes us through the steps I took to write new one, to help understand what makes AS3 slow.

Understanding SHA-1

First we need to understand how SHA-1 works. SHA-1 takes a series of bytes, and processes them in chunks of 64 bytes at a time. There’s some extra padding and data added to the input as well. Lots of operations take place during each chunk that ensure even the most insignificant changes have a radical butterfly effect on some variables that are returned in the form of a SHA-1 hash.

Here’s some pseudocode.

add some extra data to the end of the input
set the initial sha-1 values

for each 64-byte chunk do
  extend the chunk to 320 bytes of data

  perform first set of operations on chunk (x20)
  perform second set of operations on chunk (x20)
  perform third set of operations on chunk (x20)
  perform fourth set of operations on chunk (x20)
end

return sha-1 values as a hash

You don’t need to fully understand it, but just get the general idea. Get chunk, process chunk, move onto next chunk.

Here’s as3corelib’s implementation of SHA-1 as of this writing. It seems pretty reasonable, but I see a lot of function calls, memory allocations, and unnecessary dependencies. Surely we can do better.

Round One

Click here to see the first version.

short string x10000
- SHA1.hash: 400ms
- sha1: 1266ms (improvement: x0.315)
long string x30
- SHA1.hash: 262ms
- sha1: 880ms (improvement: x0.297)

This code isn’t half bad. Everything is reasonably easy to understand and modify. This kind of code is usually how I like to do things at first, before I start optimizing. It’s pretty slow though, roughly a third of the speed of SHA1.hash. If you’re familiar with AS3 optimization you can probably spot a lot of straightforward optimizations.

Reuse the w array, inline the bitshifts

Click here to see the diff.

short string x10000
- SHA1.hash: 408ms
- sha1: 405ms (improvement: x1.007)
long string x30
- SHA1.hash: 249ms
- sha1: 237ms (improvement: x1.05)

Much better. That made a huge difference and now it’s roughly on par with SHA1.hash. The bitshift function has been inlined, and the w array is being reused instead of recreated during each iteration. Reusing the w array barely has an impact on speed though, it’s the inlining of the bitshift function that accounts for the 3x speed boost. It’s still good to be careful with memory regardless, since passing off work onto the GC isn’t good for performance and is more difficult to profile.

Lesson learned: Function calls are expensive. Inlining function calls can offer some nice speed boosts when code needs optimizing.

Unroll those loops

Click here to see the diff.

short string x10000
- SHA1.hash: 407ms
- sha1: 381ms (improvement: x1.068)
long string x30
- SHA1.hash: 248ms
- sha1: 218ms (improvement: x1.137)

There’s not much left to inline that’ll make a big impact, so how about unrolling the loops? I wrote a small Python script to generate the code in the loops, so I don’t have to write it out by hand. Any changes to those lines of code will come from that script from now on. Overall though, this doesn’t do much.

Lesson learned: Unrolling loops doesn’t help much, but it’s still useful.

Turn the w array into local variables

Click here to see the diff.

short string x10000
- SHA1.hash: 418ms
- sha1: 118ms (improvement: x3.542)
long string x30
- SHA1.hash: 252ms
- sha1: 39ms (improvement: x6.461)

That w array has a constant size, so why not just convert each entry to a local variables? It gives us an impressive speed boost. In case you’re wondering why I didn’t convert w to a Vector I would prefer to stay compatible with Flash 9. Vectors are also not as fast as local variables, they only made it 2-3 times faster while local variables made it 3-6 times faster.

Lesson Learned: Array accesses are much more expensive then local variables. Even Vectors (which are typed and pretty fast when fixed) can’t compete with local variables.

Process a ByteArray instead of a String

Click here to see the diff.

short string x10000
- SHA1.hash: 405ms
- sha1: 83ms (improvement: x4.879)
long string x30
- SHA1.hash: 250ms
- sha1: 28ms (improvement: x8.928)

This trick I learned while making actionjson. Strings are generally slow, so putting data in a ByteArray and processing from there can often be orders of magnitude faster. sha1 gets a nice boost out of this.

Lesson Learned: ByteArrays are faster than Strings for data processing.

Reuse w variables

Click here to see the diff.

short string x10000
- SHA1.hash: 405ms
- sha1: 82ms (improvement: x4.939)
long string x30
- SHA1.hash: 252ms
- sha1: 28ms (improvement: x9)

Looking carefully at the SHA-1 algorithm, it only needs the last 16 w variables. For example if it’s setting the value of w76, the value of w60 is needed, but w59 is not used and will never be used again. Since this pattern repeats itself, the w variables can be reused, and we can reduce the number of w variables from 80 to 16. This doesn’t help much though.

Lesson Learned: Flash is pretty good at optimizing local variables already.

Reuse result of an operation, rather then getting it from the local variable later

Click here to see the diff.

short string x10000
- SHA1.hash: 404ms
- sha1: 80ms (improvement: x5.05)
long string x30
- SHA1.hash: 250ms
- sha1: 27ms (improvement: x9.259)

This is pretty minor, but since assignment operators return the value they’re setting, we can use that value instead of ignoring it.

// from this...
w0 = (w0 << 1) | (w0 >>> 31);
tmp = w0 + x;
// to this...
tmp = (w0 = (w0 << 1) | (w0 >>> 31)) + x;

This helps a little.

Lesson Learned: Reusing the results of assignment operations can save Flash some extra work.

Stop shifting around the values stored in a, b, c, d, and e

Click here to see the diff.

short string x10000
- SHA1.hash: 412ms
- sha1: 78ms (improvement: x5.282)
long string x30
- SHA1.hash: 251ms
- sha1: 27ms (improvement: x9.296)

This one is hard to explain, but basically each time the a-e variables are changed, it’s mostly data being moved around, while only two variables are actually changing. By changing which variables get modified and used after each iteration, we can only modify the ones that need to be changed. This just removes a lot of code that shifts around values (e.g. c = b, a = b). This has surprisingly little impact.

Lesson Learned: Flash is still pretty good at optimizing local variables.

Some misc improvements

Click here to see the diff.

short string x10000
- SHA1.hash: 406ms
- sha1: 73ms (improvement: x5.561)
long string x30
- SHA1.hash: 249ms
- sha1: 27ms (improvement: x9.222)

I’ve been focusing on the inner SHA-1 loop, since that’s where most of the work is done, but let’s try improving some of the code outside of it. Removing an extra variable doesn’t help much. There is a small boost from inlining and optimizing intToHex. There’s not much of an impact on the “long string” test, since this reduces overhead, which the “long string” test has less of.

Lesson Learned: Inlining – still good.

Reduce unnecessary conversions

Click here to see the diff.

short string x10000
- SHA1.hash: 417ms
- sha1: 70ms (improvement: x5.957)
long string x30
- SHA1.hash: 250ms
- sha1: 20ms (improvement: x12.5)

Now let’s get even deeper. Using Apparat‘s dump tool to examine the raw ABC, and you may notice these everywhere…

PushDouble(4.023233417E9)
ConvertUInt()

PushInt(271733878)
ConvertUInt()

What? There are no floats and ints in this code. 4.023233417E9 is 0xEFCDAB89, one of the constants. The other number is one of the uints used. Shouldn’t both these variables be a uint already and not need conversion? It appears that Flash is encoding uints as ints and floats, then converting them to uints. Weird. But since ints/uints in Flash are stored with two’s complement encoding all the uints can be converted to their int equivalent and the operations needed for SHA-1 will behave exactly the same. This dramatically reduces the ConvertUInt’s and ConvertInt’s in our code, and appears to have a pretty big impact. There’s still plenty of ConvertInt’s left, but I’m not sure how to get rid of them all.

Lesson Learned: SWFs have a uint number pool, but may not use it. This can impact uint performance.

Reuse the ByteArray instead of creating a new one

Click here to see the diff.

short string x10000
- SHA1.hash: 416ms
- sha1: 57ms (improvement: x7.298)
long string x30
- SHA1.hash: 258ms
- sha1: 20ms (improvement: x12.9)

The only thing preventing this code from using a constant amount of memory is that ByteArray. Let’s reuse it instead, so a ByteArray isn’t created each time. This helps GC performance, and also lowers the overhead of our parser.

Lesson Learned: Not creating new objects is almost always a good idea.

Grasping at straws.

Here’s where I start running out of ideas. These changes provide basically no improvement.

Optimizing some of the bitwise logic by reducing the number of operations doesn’t help.

Since it worked before, perhaps reusing the rest of the operations on the w variables like before might help. It doesn’t do much, if anything.

Converting some variables to consts doesn’t have any impact.

Optimize the byte padding

short string x10000
- SHA1.hash: 405ms
- sha1: 45ms (improvement: x9)
long string x30
- SHA1.hash: 252ms
- sha1: 20ms (improvement: x12.6)

Next come a couple improvements to the chunk of code that pads the ByteArray with zeros. This gave us a nice little boost from reduced overhead.

First try calling writeInt instead when possible, reducing the function calls on ByteArray. This helped a little.

Next try writing all those zeros at once. This helped a little as well.

Now try simplifying the formula that gives us how many zeros need to be added. This didn’t appear to do anything. Even though Math.ceil should be an expensive function call, removing it doesn’t help.

Lesson Learned: Less function calls are still a good idea, and Flash may inline Math.ceil in some situations.

Final results

Here’s the final version of sha1

I did all my testing on on Linux with Flash 11.1. I spent enough time writing this article and the SHA-1 implementation that Adobe released a new version of the Flash player (11.2). Here are the final results.

Linux (Fedora 16)
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Flash Projector 11.2.202.228 (non-debug)
short string x10000
- SHA1.hash: 417ms
- sha1: 45ms (improvement: x9.266)
long string x30
- SHA1.hash: 259ms
- sha1: 18ms (improvement: x14.388)

Windows 7
Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Flash Projector 11.2.202.228 (non-debug)
short string x10000
- SHA1.hash: 195ms
- sha1: 35ms (improvement: x5.571)
long string x30
- SHA1.hash: 117ms
- sha1: 15ms (improvement: x7.8)

OS X 10.6
Intel(R) Core(TM) i5 CPU @ 2.3 GHz
Flash Projector 11.2.202.228 (non-debug)
short string x10000
- SHA1.hash: 159ms
- sha1: 34ms (improvement: x4.676)
long string x30
- SHA1.hash: 78ms
- sha1: 15ms (improvement: x5.2)

Any ideas on how to improve this implementation further are welcome.

EDIT: Excluding support for domain memory (alchemy opcodes) since those features now cost money in some situations.

TL;DR

Use more typed local variables, and less function calls, get better Flash performance.

2 Responses to “How to write an SHA-1 implementation in ActionScript 3 that's 14x faster than Adobe's”

  1. ben w Says:

    should help to cast the zeros array lookups
    i.e.
    zeros[32 – result.length]
    to
    zeros[int(32 – result.length)]
    tends to be faster in almost all cases, worth a shot

    and might help to store that result.length in a variable as it is called a number of times in a row, might not make any saving but you never know 😉

    I liked the approach of fixing one thing at a time..I tend to do it all at once and probably never learn as much in the process as a result. Thanks for sharing.

  2. ben w Says:

    edit: just noticed that stuff was out of any loop, so above suggestions I doubt would make any difference at all…should learn to maximise my windows when reading things 🙂

Leave a Reply