Creating Assembly Language DLL Modules for Windows – DEVELOPPARADISE

Creating Assembly Language DLL Modules for Windows

This article explains how to create fully functional .DLL and .LIB modules for Windows using pure assembly language. While much of the discussion for the article is centered on working off of Visual Studio, the .DLL and .LIB modules you create can be integrated into any language that allows their use. What’s generated are standard .DLLs, with nothing to distinguish the final product from .DLL modules created any other way.

Visual Studio only allows inline assembly in 32-bit mode, and it doesn’t even allow that much in 64-bit mode. In the latter, you have to use the enormously complex and confusing intrinsics instead. Either way, you’re only getting a fraction of the power that ASM offers for handling processor-intensive tasks effectively. Most of what you can do in a full-fledged ASM .DLL module can’t be touched with inline assembly or intrinsics.

Creating an all-assembly .DLL module is nowhere near as complicated as many may think – you can do it with Notepad alone (assuming you have a suitable assembler and linker). It opens up the full power of the language – including functions, macros, and a host of other benefits that are unavailable in Visual Studio (or any other environment that lets you work with some form of ASM).

In Windows, the Portable Executable format – PE – is used universally for executables, drivers, and .DLLs. The only real difference between them is what the loader chooses to do with them. There are other subtle changes in various fields within the files but the overall format is identical between all three file types.

Getting Started – the Main Module

ASM comments use the ; character. There is no open/close comment pair, although you can use:

comment ^
This is comment text.
It can run on forever as many politicians do.
Vote for me and put the Purple Party in power!

The ^ character can be anything, but keep in mind that whatever is used will close the comment block the first time it’s encountered by the assembler. So pick something that isn’t going to be part of the comment block itself.

Aside from comments, the first line of your ASM source file should be:


Once declared, you can then include files that contain data declarations, or enter those declarations directly. The .data block inherently ends when .code is encountered; it’s the next line after the .data block.

Although they can go anywhere, I typically throw my macro definitions into the data block as well. Do what works for you, but macros, of course, must be defined before they’re actually used.

After your data declarations come:


Now you’re in the code segment.

At this point, you’re thinking well duh!, but that’s about the extent of how complex an assembly language app is so get used to “easy.”

For .DLL modules, the traditional entry point is DllMain, and you’ll have to declare it as a function:

DllMain   proc   ; 64-bit function

                   … code goes here …


DllMain   endp   ; End function

If you’re using 32-bit code, the declaration is:

DllMain   proc   near   hInstDll:dword, fdwReason:dword, lpReserved:dword

… code goes here …

               ret   12   ; Return to caller  

DllMain endp ; End function

In the 32-bit version, the parameters hInstDll, fdwReason, and lpReserved can be accessed directly, by name. In the 64-bit version, the 64-bit calling convention is followed, which means that on entry into the DllMain function:

RCX = hInstDll
RDX = fdwReason
R8  = lpReserved

The Windows loader sets up the parameters to pass so when entering DllMain, the input values will always be as specified above. The loader doesn’t know or care which language was used to create a .DLL module. If it’s formatted correctly, DllMain will enter with standard parameters passed.

The fdwReason parameter will contain one of only four possible values: DLL_PROCESS_ATTACH (1), DLL_PROCESS_DETACH (0), DLL_THREAD_ATTACH (2), and DLL_THREAD_DETACH (3). These values should be declared as constants somewhere in your data segment (or before, if you prefer, as equates are insensitive to which segment they live in), as follows:

DLL_THREAD_ATTACH     equ     2
DLL_THREAD_DETACH     equ     3

This allows you to work with these values by their standard names instead of using hard-coded integers.

Handling fdwReason Within DllMain

The fdwReason parameter answers the question “why are you calling me?” I employ a method of message routing that I’ve used primarily in window callback functions, but I also apply it to DllMain functions. I’ve done it since the dawn of mankind. This method looks up the incoming message (in this case fdwReason) on a lookup table, and jumps to the same position on a router table. This allows lookup/router table pairs to be employed in place of switch statements. Since all the values in both tables are in static memory, oodles of code are saved over using the brute-force switch statement, which performs a single compare of the incoming messages against a list of possible values, one at a time. In addition to hard-coding the values in the code stream, this method is also empirically slow and inefficient. The impact of using the lookup method in DllMain will actually be negligible, considering the function is only called two times for each process or thread that attaches to it, but I use it nonetheless if for no other reason than it involves a lot less coding.

The lookup table for the fdwReason value is shown below – don’t type it in because it’s going to be ditched shortly; it’s presented here for informational purposes:

dll_reasons    qword     ( dll_reasons_e – dll_reasons_s ) / 8  ; Count of values in the table
dll_reasons_s  qword     DLL_PROCESS_DETACH                     ; DLL_PROCESS_DETACH
               qword     DLL_PROCESS_ATTACH                     ; DLL_PROCESS_ATTACH
               qword     DLL_THREAD_ATTACH                      ; DLL_THREAD_ATTACH
               qword     DLL_THREAD_DETACH                      ; DLL_THREAD_DETACH
dll_reasons_e  label     qword                                  ; Reference label

The router table is listed below:

dll_router     qword     DllMain_P_Detach                       ; DLL_PROCESS_DETACH = 0
               qword     DllMain_P_Attach                       ; DLL_PROCESS_ATTACH = 1
               qword     DllMain_T_Attach                       ; DLL_THREAD_ATTACH = 2
               qword     DllMain_T_Detach                       ; DLL_THREAD_DETACH = 3

To use this process, the CPU provides the instruction repnz scasq. It’s short for repeat while zero flag clear (or repeat while not zero): scan string quadword. The instruction searches 64-bit quadwords (qwords) at the location pointed to by the RDI register, for a count of RCX, comparing the value in RAX against each successive qword. This register usage is hardwired for this instruction so it cannot change. After setting the RAX, RCX, and RDI registers, the instruction is issued. The CPU then scans qword after qword (RDI auto-advances with each scan) beginning at memory location RDI, decrementing RCX each scan. This continues until either a match is found or the count in RCX reaches zero. Since the scan of a matching value must complete to determine that it’s a match, RDI will always point after the matching value.

Coding of this process is shown below:

lea   rdi, dll_reasons      ; Set the scan start pointer
mov   rcx, [rdi]            ; Load the first qword – the entry count for the table
scasq                       ; Skip over the entry count
mov   rsi, rdi              ; Save the location of table entry 0
mov   rax, fdwReason        ; Set the scan target (the value to search for)
repnz scasq                 ; Execute the scan
jnz   <no_match>            ; Not found – do <whatever>
sub   rdi, rsi              ; Set byte count into table, remembering target was passed over
sub   rdi, 4                ; Undo the scan overshot
lea   rax, dll_router       ; Get the router table offset
jmp   qword ptr [rdi + rax] ; Jump to the target process

This code handles tables of any size. It eliminates the need to check the incoming message (fdwReason, in this case) against each possible value one at a time, and allows for very clean source that lists all the possible values in one place. The dll_reason table is set up to allow the addition of an unlimited number of new entries without constantly having to update the table’s entry count – it updates automatically.

Now that I’ve presented the lookup process, I’m going to abandon the lookup part. There’s no point in using it in this relatively unique case. Since fdwReason can only have the values 0, 1, 2, or 3, simply multiply the value by 8 (shift left 3 bits) and use that value as an offset directly into the router table. In the almost-exclusive case of handling fdwReason within DllMain, using the lookup table can’t return any information that isn’t already contained in fdwReason.

mov rax, fdwReason          ; Get the incoming value
shl rax, 3                  ; Scale to * 8 for qword size
jmp dll_router [rax]        ; Jump to the target process

The above handles the routing for the fdwReason value perfectly.

Moving On…

Once you’ve executed the desired handler for the incoming fdwReason value, you can simply exit DllMain. The final return value must be present in RAX on return from DllMain. 1 means success, 0 means failure. If the Windows loader receives a return code of 0 from DllMain, it will unload the library and call it a wash so the proper return value is critical.

External Functions

Linking to Windows libraries requires declaring the functions you’re going to be calling as external. Nobody likes name mangling, even though the 64-bit Windows libraries have dropped all the @24 type stuff at the end of the parameter name. But you still have __imp_ preceding each function name and nobody wants to work with that every time a function is called.

Assembly’s text equates rescue the developer from this nightmare. The full declaration for the Windows function LoadLibrary is shown below – noting that the A and W deviants still must be specified when any of the parameters going into a function are strings.

First, the 64-bit version:

extrn        __imp_LoadLibraryA:qword   ; In 64 bit mode, externals are always declared as qwords
LoadLibary   textequ   <__imp_LoadLibraryA>

The 32 bit version looks like this:

extrn        _imp__LoadLibraryA@4:dword ; In 32 bit mode, externals are always declared as dwords
LoadLibrary   textequ   <_imp__LoadLibraryA@4>

Count the underscores carefully! They’re not the same between 32- and 64-bit declarations.

For those not familiar with abhorrent name mangling in 32-bit mode, the number after the @ is always the number of parameters passed to the function times four. LoadLibrary takes one parameter so it’s declared with @4.

Once the externals are declared, you can simply:

call LoadLibary    ; Load the target library

and you’re off and running.

64-Bit Parameter Passing

There are many articles online covering 64-bit parameter passing. It’s done in the registers, not on the stack, in 64-bit mode. It can become confusing when it comes to floats, but the following table should clarify:

Parameter Number

Float values are passed in XMM registers for single-precision, YMM for double, and ZMM for 128-bit values.

All parameters after the fourth are passed on the stack so register usage is irrelevant, however floats cannot be passed directly beyond the fourth parameter – they must be passed by reference (a pointer to the value is passed instead of the actual value).

If the function foo were being called with the following C++ code:

int hr = foo ( "Hello", "world!", 3.14f, 95 );

Hello” would be pointed to by RCX, “world!” would be pointed to by RDX, 3.14 would be contained in XMM2 (it’s the third parameter), and the integer value 95 would be in R9.

I’ve written in-depth about the 64-bit calling convention, as have many others. My article on the subject is on CodeProject at Nightmare on (Overwh)Elm Street: The 64-bit Calling Convention.

For 32-bit code, parameters are passed on the stack. The function that’s called will clear them when it returns so no cleanup is required. Remember to pass parameters in the reverse order from what’s displayed. 32-bit code for calling foo, as shown in the C++ code above, might appear as follows (any register will work since only the stack is looked at by foo for accessing parameter data):

push     95
push     pi          ; "pi" is a 32-bit real4 variable initialized at 3.14
push     offset world_string
push     offset hello_string
call     foo

Wrapping Up DllMain

When you’re done coding the function, there’s little left to do beyond closing out DllMain:

ret                 ; Return from DllMain (for 32-bit code, use ret 12)

DllMain    endp     ; End procedure

The last line of the main module is simply end for 64-bit code, and end DllMain for 32-bit code.

Compiling the Project

The batch file to compile the code is shown below:

@echo off

rem Set this value to the location of rc.exe under the VC directory; it contains the RC.EXE executable
set rc_directory="C:/Program Files (x86)/Windows Kits/10/bin/x86

rem Set this value to the location of ml64.exe under the VC directory
set ml_directory="C:/Program Files (x86)/Microsoft Visual Studio 14.0/VC/bin/x86_amd64

rem Set this value to the location of link.exe under the VC directory; 
it contains the LINK.EXE executable
set link_directory="C:/Program Files (x86)/Microsoft Visual Studio 14.0/VC/bin

rem Set this directory to the INCLUDE location for ASM source
set asm_source_directory="C:/[your ASM directory]

rem Set this directory to the include path for Windows libraries. 
Use c:/dir1;c:/dir2 format for multiple directories. 
set lib_directory="C:/Program Files (x86)/Windows Kits/10/Lib/10.0.10586.0/um/x64

%rc_directory%/rc.exe" %asm_source_directory%/resource.rc"
%ml_directory%/ml64.exe" /c /Cp /Cx /Fm /FR /W2 /Zd /Zf /Zi /Ta 
%asm_source_directory%/your_dll.asm" /I%asm_source_directory% > 
%link_directory%/link.exe" %asm_source_directory%/your_dll.obj" 
/debug /def:%asm_source_directory%/your_dll.def" /entry:DllMain 
/manifest:no /machine:x64 /map /dll /out:%asm_source_directory%/your_dll.dll" /pdb:
%asm_source_directory%/your_dll.pdb" /subsystem:windows,6.0 /LIBPATH:%lib_directory%" 
user32.lib kernel32.lib
rem                                                             <-----> use /debug for debug symbols
                                         <----------> use /machine:x86 for 32-bit code>
rem                                                                     use /debug:none for 
                                                                        release version

rem copy *.dll [wherever you want the .DLL and .LIB files to copy to]"
type %asm_source_directory%/asm_errors.txt"

There are so many variations in directory setups between any two developers that common sense will have to be applied to get the batch file functioning correctly. What’s in it is straightforward and should not pose any problems.


This article has been aimed at guiding you through the process of creating an all-assembly .DLL module for the extreme situations that might warrant it. Beyond what’s been covered here, the rest of what’s involved is simply creating your functions as you normally would in ASM.

Using ASM in high-demand situations can potentially save thousands and thousands of cumulative man-hours for a team of developers over being restricted to inline assembly or intrinsics – depending, of course, on the specific situation. For example, in today’s world, it’s considered normal for compiles that inherently require seconds to complete to take hours. Creating ASM-coded .DLL modules opens up the power of the entire language. If you have Visual Studio, you can use ml64.exe and the C++ linker; if you use another language you can use its linker and any number of alternative assemblers available online. Of course, if you use another assembler besides Microsoft’s, you’ll have to adjust the code and data snippets shown here for the particulars of that assembler.

Creating just one .DLL project using the direction in this article should be all that’s needed to make any developer comfortable with the process, illustrating how relatively simple the task really is.

There are situations in the real world that demand special handling. Being properly armed with the capability to apply ASM to a given task when it’s called for can translate to enormous time savings for many developers and end users over time. Assessing the situation realistically, windows is so bogged down with machine-specific dependencies, OS dependencies, and dependencies on dependencies, with ten gazillion versions of all the above on any given day, nothing about adding an assembly language .DLL to your project should be out of line.