September 28, 2014

How SWFWire works

This is an overview of a decompiler and a debugger I made in Flash a few years ago. This is mainly for my own reference.

There are three parts of SWFWire:

SWFWire Decompiler - A library for SWF and ABC decompilation
SWFWire Inspector - An app for viewing the decompiled contents of a SWF
SWFWire Debugger - An app for interactive debugging of SWFs

Disassembly #

The disassembler takes binary data, and parses it according to the SWF spec.

SWF #

The SWF file format is very well documented by Adobe. SWFs are binary files consisting of a header and a series of tags. The header gives you some basic metadata such as the target Flash Player version and stage dimensions. Each tag contains a type, length, and data section. The type tells you how to parse it, and the length lets you verify you’ve parsed that tag correctly, or skip it without knowing anything about the contents.

Basic data types #

The basic data types defined in the SWF file format are numbers, strings, bit fields, and byte arrays.

In SWFWire, the com.swfwire.decompiler.SWFByteArray class provides methods for parsing these.

Byte alignment #

Because bit arrays have a length specified in bits rather than bytes, we have to keep track of the current bit position. Other data types are aligned to the next byte.

Here’s how we keep track of the bit position and align bytes in SWFWire:

private var bitPosition:uint = 0;
public function alignBytes():void
{
    if(bitPosition != 0)
    {
        bytes.position++;
        bitPosition = 0;
    }
}

Numbers #

There are a lot of different formats for storing numbers in a SWF, to optimize file size. One important thing to understand is that integers are little-endian. This basically means that bytes are stored in reverse order. For example, 0xDEADBEEF parsed as two 32-bit integers becomes 0xDAED, 0xFEEB.

Fixed size integers #

These come in four fixed sizes, in either signed or unsigned form: SI8, UI8, SI16, UI16, SI24, UI24, SI32, and UI32. We use the built-in ByteArray methods for parsing these, except for the 24-bit integers, which we have to read it as 3 bytes and account for the little endian weirdness.

public function readUI24():uint
{
    alignBytes();
    return bytes.readUnsignedByte() << 16 | bytes.readUnsignedByte() << 8 | bytes.readUnsignedByte();
}

Encoded integers #

Encoded integers improve efficiency by adjusting the number of bytes occupied by storage to fit the size of the integer being stored. When a byte is read, the first bit is used to tell the parser whether to continue reading, up to four bytes.

Here’s how SWFWire parses this field:

public function readEncodedUI32():uint
{
    alignBytes();
    var result:uint;
    var bytesRead:uint;
    var currentByte:uint;
    var shouldContinue:Boolean = true;
    while(shouldContinue && bytesRead < 5)
    {
        currentByte = bytes.readUnsignedByte();
        result = ((currentByte & filter7) << (7 * bytesRead)) | result;
        shouldContinue = ((currentByte >> 7) == 1);
        bytesRead++;
    }
    return result;
}

Fixed-point decimal #

These are pretty straight-forward. There’s an integer part and a decimal part, which we add to get the result.

public function readFixed8_8():Number
{
    alignBytes();
    var decimal:uint = bytes.readUnsignedByte();
    var result:Number = bytes.readByte();

    result += decimal / 0xFF;

    return result;
}

Floating-point decimal #

ByteArray provides us with almost everything we need to read these. But of course, there’s a 16-bit version. We can still get ByteArray to do some of the work by changing it to a 32-bit format, then calling readFloat/writeFloat.

public function readFloat16():Number
{
    var raw:uint = readUI16();

    var sign:uint = raw >> 15;
    var exp:int = (raw >> 10) & filter5;
    var sig:int = raw & filter10;

    //Handle infinity/NaN
    if(exp == 31)
    {
        exp = 255;
    }
    //Handle normalized values
    else if(exp == 0)
    {
        exp = 0;
        sig = 0;
    }
    else
    {
        exp += 111;
    }

    var temp:uint = sign << 31 | exp << 23 | sig << 13;

    return unsignedIntAsFloat32(temp);
}

Strings #

All Strings are UTF-8 encoded. Most strings are null-terminated. There’s also a version which specifies length first, which is trivial to parse.

Here’s how null-terminated strings are parsed in SWFWire:

public function readString():String
{
    alignBytes();
    var byteCount:uint = 1;
    while(bytes.readUnsignedByte())
    {
        byteCount++;
    }
    bytes.position -= byteCount;
    var result:String = bytes.readUTFBytes(byteCount);
    return result;
}

Bit fields #

Bit fields allow you to store numbers using a specific number of bits. For example, to store a flag, writeUB(1, 1) would require 1 bit, where writeBoolean(true) would require 8 bits.

public function readUB(length:uint):uint
{
if(!length) return 0;

    var totalBytes:uint = Math.ceil((bitPosition + length) / 8);

    var iter:uint = 0;
    var currentByte:uint = 0;
    var result:uint = 0;

    while(iter < totalBytes)
    {
        currentByte = bytes.readUnsignedByte();
        result = (result << 8) | currentByte;
        iter++;
    }

    var newBitPosition:uint = ((bitPosition + length) % 8);

    var excessBits:uint = (totalBytes * 8 - (bitPosition + length));
    result = result >> excessBits;
    result = result & (~0 >>> -length);

    bitPosition = newBitPosition;
    if(bitPosition > 0)
    {
        bytes.position--;
    }
    return result;
}

Byte arrays #

These are just arrays of integer types.

ABC #

The most interesting SWF tag is the DoABC tag. It contains the ActionScript Byte Code which has its own spec.

Constant pool #

Basically a huge array of all the constant values in the code. This consists of strings, numbers, namespaces, and references known as multinames.

Classes #

Class definitions are very simple. They contain references to the static properties, and a class initializer, which contains all the code in the top level of your class. For example, initialization code for static properties.

Instances #

Instance definitions contain references to all the instance properties and methods of a class. They also specify the super class and implemented interfaces.

Method bodies #

These contain the actual code for each method, which SWFWire interprets as an array of instructions. They also contain some metadata that helps the runtime know how to run them efficiently, such as the number of local registers and maximum stack needed.

Instructions #

ABC instructions are kind of ActionScript mixed with assembly. It operates on an operand stack, local registers, and a scope stack, but it still has a notion of objects and method calls.

Decompilation #

Flash compilers transform code from ActionScript into ABC. Each transformation has many possible correct answers, and optimizations can be made to improve performance. A decompiler works in a similar way, but in the opposite direction. Optimizations can be made to improve readability. In both cases, there is an art to finding new optimizations. If you read ABCToActionScript it will be clear that there’s more art than science.

Here’s what the instructions for an empty constructor look like:

getlocal0  
pushscope  
getlocal0  
constructsuper  argCount: 0
returnvoid

When the runtime enters a method, it sets the local 0 to this, and each of the other locals to the method arguments. The ABC spec specifies how each instruction modifies the stacks and local registers. Almost every method starts with getlocal0, pushscope. This pushes this onto the scope stack. Unqualified property references will be looked up on this object first. Next, we push this onto the operand stack, and call constructsuper, which pops a value off the operand stack, and calls the constructor method of the base class. This is the compiler implementing the default ActionScript behavior of calling the super method at the end of a constructor.

Here’s an arbitrary method from SWFWire itself, that is less trivial:

getlocal0  
pushscope  
getlocal1  
getproperty  bytes
coerce  com.swfwire.decompiler.abc.ABCByteArray
setlocal2  
findpropstrict  com.swfwire.decompiler.abc.tokens.MetadataInfoToken
constructprop  argCount: 0, index: 20
coerce  com.swfwire.decompiler.abc.tokens.MetadataInfoToken
setlocal3  
getlocal3  
getlocal2  
callproperty  argCount: 0, readU30
setproperty  name
getlocal3  
getlocal2  
callproperty  argCount: 0, readU30
setproperty  itemCount
getlocal3  
findpropstrict  Vector
getproperty  Vector
findpropstrict  com.swfwire.decompiler.abc.tokens.ItemInfoToken
getproperty  com.swfwire.decompiler.abc.tokens.ItemInfoToken
applytype  argCount: 1
getlocal3  
getproperty  itemCount
construct  argCount: 1
setproperty  items
pushbyte  byteValue: 0
convert_u  
setlocal  index: 4
pushbyte  byteValue: 0
convert_u  
setlocal  index: 4
jump  reference: 
label  
getlocal3  
getproperty  items
getlocal  index: 4
getlocal0  
getlocal1  
callproperty  argCount: 1, readItemInfoToken
setproperty  /263(1b) // This appears to be a compiler bug.
getlocal  index: 4
increment  
convert_u  
setlocal  index: 4
getlocal  index: 4
getlocal3  
getproperty  itemCount
iflt  reference: 
getlocal3  
returnvalue

In the actual byte code, references to properties, such as ItemInfoToken, are indexes into the multiname array of the constant pool. These are resolved by SWFWire when displaying the byte code to make it readable.

Here’s the original source:

var bytes:ABCByteArray = context.bytes;

var methodBodyInfo:MethodBodyInfoToken = new MethodBodyInfoToken();

var iter:uint;
methodBodyInfo.method = bytes.readU30();
methodBodyInfo.maxStack = bytes.readU30();
methodBodyInfo.localCount = bytes.readU30();
methodBodyInfo.initScopeDepth = bytes.readU30();
methodBodyInfo.maxScopeDepth = bytes.readU30();
methodBodyInfo.codeLength = bytes.readU30();
methodBodyInfo.code = new ByteArray();
if(methodBodyInfo.codeLength > 0)
{
    bytes.readBytes(methodBodyInfo.code, 0, methodBodyInfo.codeLength);
}

methodBodyInfo.exceptionCount = bytes.readU30();
methodBodyInfo.exceptions = new Vector.<ExceptionInfoToken>(methodBodyInfo.exceptionCount);
for(iter = 0; iter < methodBodyInfo.exceptionCount; iter++)
{
    methodBodyInfo.exceptions[iter] = readExceptionInfoToken(context);
}
methodBodyInfo.traitCount = bytes.readU30();
methodBodyInfo.traits = new Vector.<TraitsInfoToken>(methodBodyInfo.traitCount);
for(iter = 0; iter < methodBodyInfo.traitCount; iter++)
{
    methodBodyInfo.traits[iter] = readTraitsInfoToken(context);
}

return methodBodyInfo;

Here’s the decompiled version from SWFWire:

var uint1:uint = 0;
var aBCByteArray1:com.swfwire.decompiler.abc.ABCByteArray = context.bytes;
var methodBodyInfoToken1:com.swfwire.decompiler.abc.tokens.MethodBodyInfoToken = new com.swfwire.decompiler.abc.tokens.MethodBodyInfoToken();
methodBodyInfoToken1.method = aBCByteArray1.readU30();
methodBodyInfoToken1.maxStack = aBCByteArray1.readU30();
methodBodyInfoToken1.localCount = aBCByteArray1.readU30();
methodBodyInfoToken1.initScopeDepth = aBCByteArray1.readU30();
methodBodyInfoToken1.maxScopeDepth = aBCByteArray1.readU30();
methodBodyInfoToken1.codeLength = aBCByteArray1.readU30();
methodBodyInfoToken1.code = new flash.utils.ByteArray();
if(methodBodyInfoToken1.codeLength > 0)
{
  aBCByteArray1.readBytes(methodBodyInfoToken1.code, 0, methodBodyInfoToken1.codeLength);
}
methodBodyInfoToken1.exceptionCount = aBCByteArray1.readU30();
methodBodyInfoToken1.exceptions = new Vector.<com.swfwire.decompiler.abc.tokens.ExceptionInfoToken>(methodBodyInfoToken1.exceptionCount);
uint1 = 0;
while(uint1 < methodBodyInfoToken1.exceptionCount)
{
  methodBodyInfoToken1.exceptions[uint1] = this.readExceptionInfoToken(context);
  uint1 = uint1 + 1;
}
methodBodyInfoToken1.traitCount = aBCByteArray1.readU30();
methodBodyInfoToken1.traits = new Vector.<com.swfwire.decompiler.abc.tokens.TraitsInfoToken>(methodBodyInfoToken1.traitCount);
uint1 = 0;
while(uint1 < methodBodyInfoToken1.traitCount)
{
  methodBodyInfoToken1.traits[uint1] = this.readTraitsInfoToken(context);
  uint1 = uint1 + 1;
}
return methodBodyInfoToken1;

Pretty close. The SWF doesn’t contain information about local variable names. Also, SWFWire doesn’t bother trying to create for loops since there is always an equivalent while loop.

So, how can SWFWire look at these instructions and create ActionScript code? Well, most instructions, such as method calls can be easily translated into ActionScript if you know the state of the stacks and registers. All we need to do is keep track of the VM state, and write ActionScript when certain instructions are encountered. Unfortunately, branches complicate things. We can encounter the same instruction twice with a different state in the case of loops. So, to prevent infinite loops while correctly evaluating the code, we need to detect when we arrive at an instruction we’ve already seen with a state we’ve already been in.

Another option would have been to try to recognize patterns in the byte code and transform them blindly to ActionScript that is known to create that pattern. However, compiler bugs and obfuscators can break the decompiler, causing it to generate invalid ActionScript in his case.

Debugging #

What other useful things could we do with the ability to read and write byte code? How about automatically instrumenting a SWF with debug information? Oh, and our code is written in Flash too, so we can even load the instrumented SWF and play with it. We can also do fun things like instrument SWFs loaded with loadBytes. This means that it doesn’t matter if you embed your main SWF into another SWF and encrypt it to try to hide it from decompilers. If you call loadBytes, we can modify the SWF.

Here’s how we instrument each SWF:

Inject logging statements at the beginning of every function and at every return
Inject a debug call immediately after every object allocation
Replace calls to certain Flash APIs with calls to SWFWire APIs

Now we can trace every function call. We can store a reference to every newly allocated object in a Dictionary with weak keys, then iterate the Dictionary to get references to every object, and detect when they have been garbage collected. We can track calls to interesting Flash APIs, such as network requests.

That’s basically how the debugger works, but the details are a little more complicated. For example, branch instructions reference other instructions by index. If we insert a new instruction, all branch indexes after our new instruction will be wrong. We get around these problems by changing index references to Dictionary references before modifying the instructions, then convert them back to indices afterward.

I will now proceed to purge all information related to Flash from my brain.

Kudos