使用 x86 汇编实现 C# 的快速内存拷贝

作者：出处：互联网 2015-06-10 17:55:40 阅读：4

本文将为大家展示和介绍怎么样在C#和.NET下使用汇编秒速拷贝数据，在是实例里面用了一运用程序创建了一段视频，里面包含图片，视频和声音。

AD：

大家好，是Oleksandr Karpov，这个是第一次发表文章，希望大家喜欢。

在这将为大家展示和介绍怎么样在C#和.NET下使用汇编秒速拷贝数据，在是实例里面用了一运用程序创建了一段视频，里面包含图片，视频和声音。

当然如果你也需要在C#使用汇编的情况，这方法提供一个快速简单的解决途径。

背景

理解本文的内容, 最好具备以下知识: 汇编语言, 内存对齐, c#, windows 和 .net 高级技巧(advanced techniques).

要提高数据复制(copy-past )的速度, 们需要将内存地址按 16 个字节对齐. 否则, 速度不会有明显的改变. (例子大概快 1.02 倍 )

Pentium III+ (KNI/MMX2) 和 AMD Athlon (AMD EMMX) 这两种处理器都支持本文代码用到 SSE 指令集.

用配置为: Pentium Dual-Core E5800 3.2GHz, 4GB 双通道内存的计算机做测试, 16 个字节内存对齐的速度要比标准方式快 1.5 倍, 而非内存对齐方式的速度几乎没有变化(1.02倍).

这是一个完整的演示测试，向你展示了性能测试以及如何使用。

FastMemCopy 类包含了用于快速内存拷贝逻辑的所有内容。

首先你需要创建一个默认的Windows Forms应用程序工程，在窗体上放两个按钮，一个PictureBox 控件，因为们将用图片来测试。

声明几个字段先：

string bitmapPath;  Bitmap bmp, bmp2;  BitmapData bmpd, bmpd2;  byte[] buffer = null;

现在创建两个方法用来处理按钮的点击事件。

标准方法如下：

private void btnStandard_Click(object sender, EventArgs e)  {  using (OpenFileDialog ofd = new OpenFileDialog())  {  if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)  return;    bitmapPath = ofd.FileName;  }    //open a selected image and create an empty image with the same size  OpenImage();    //unlock for read and write images  UnlockBitmap();    //copy data from one image to another by standard method  CopyImage();    //lock images to be able to see them  LockBitmap();    //lets see what we have  pictureBox1.Image = bmp2;  }

快速方法如下：

private void btnFast_Click(object sender, EventArgs e)  {  using (OpenFileDialog ofd = new OpenFileDialog())  {  if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)  return;  bitmapPath = ofd.FileName;  }    //open a selected image and create an empty image with the same size  OpenImage();    //unlock for read and write images  UnlockBitmap();    //copy data from one image to another with our fast method  FastCopyImage();    //lock images to be able to see them  LockBitmap();    //lets see what we have  pictureBox1.Image = bmp2;  }

好的，现在们有按钮并且也有了事件处理，下面来实现打开图片、锁定、解锁它们的方法，以及标准拷贝方法：

打开一个图片：

void OpenImage()  {  pictureBox1.Image = null;  buffer = null;  if (bmp != null)  {  bmp.Dispose();  bmp = null;  }  if (bmp2 != null)  {  bmp2.Dispose();  bmp2 = null;  }  GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);    bmp = (Bitmap)Bitmap.FromFile(bitmapPath);    buffer = new byte[bmp.Width * 4 * bmp.Height];  bmp2 = new Bitmap(bmp.Width, bmp.Height, bmp.Width * 4, PixelFormat.Format32bppArgb,  Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0));  }

锁定和解锁位图：

void UnlockBitmap()  {  bmpd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,   PixelFormat.Format32bppArgb);  bmpd2 = bmp2.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,   PixelFormat.Format32bppArgb);  }    void LockBitmap()  {  bmp.UnlockBits(bmpd);  bmp2.UnlockBits(bmpd2);  }

从一个图片拷贝数据到另一个图片，并且显示测得的时间：

void CopyImage()  {  //start stopwatch  Stopwatch sw = new Stopwatch();  sw.Start();    //copy-past data 10 times  for (int i = 0; i   10; i++)  {  System.Runtime.InteropServices.Marshal.Copy(bmpd.Scan0, buffer, 0, buffer.Length);  }    //stop stopwatch  sw.Stop();    //show measured time  MessageBox.Show(sw.ElapsedTicks.ToString());  }

这就是标准快速拷贝方法。其实一点也不复杂，们使用了知名的 System.Runtime.InteropServices.Marshal.Copy 方法。

以及又一个中间方法（middle-method）以用于快速拷贝逻辑：

void FastCopyImage()  {  FastMemCopy.FastMemoryCopy(bmpd.Scan0, bmpd2.Scan0, buffer.Length);  }

现在，来实现FastMemCopy类。下面是类的声明以及们将会在类中使用到的一些类型：

internal static class FastMemCopy  {  [Flags]  private enum AllocationTypes : uint {  Commit = 0x1000, Reserve = 0x2000,  Reset = 0x80000, LargePages = 0x20000000,  Physical = 0x400000, TopDown = 0x100000,  WriteWatch = 0x200000  }    [Flags]  private enum MemoryProtections : uint {  Execute = 0x10, ExecuteRead = 0x20,  ExecuteReadWrite = 0x40, ExecuteWriteCopy = 0x80,  NoAccess = 0x01, ReadOnly = 0x02,  ReadWrite = 0x04, WriteCopy = 0x08,  GuartModifierflag = 0x100, NoCacheModifierflag = 0x200,  WriteCombineModifierflag = 0x400  }    [Flags]  private enum FreeTypes : uint {  Decommit = 0x4000, Release = 0x8000  }    [UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]  private unsafe delegate void FastMemCopyDelegate();    private static class NativeMethods  {  [DllImport( kernel32.dll , SetLastError = true)]  internal static extern IntPtr VirtualAlloc(  IntPtr lpAddress,  UIntPtr dwSize,  AllocationTypes flAllocationType,  MemoryProtections flProtect);    [DllImport( kernel32 )]  [return: MarshalAs(UnmanagedType.Bool)]  internal static extern bool VirtualFree(  IntPtr lpAddress,  uint dwSize,  FreeTypes flFreeType);  }

现在声明方法本身：

public static unsafe void FastMemoryCopy(IntPtr src, IntPtr dst, int nBytes)  {  if (IntPtr.Size == 4)  {  //we are in 32 bit mode    //allocate memory for our asm method  IntPtr p = NativeMethods.VirtualAlloc(  IntPtr.Zero,  new UIntPtr((uint)x86_FastMemCopy_New.Length),  AllocationTypes.Commit | AllocationTypes.Reserve,  MemoryProtections.ExecuteReadWrite);    try {  //copy our method bytes to allocated memory  Marshal.Copy(x86_FastMemCopy_New, 0, p, x86_FastMemCopy_New.Length);    //make a delegate to our method  FastMemCopyDelegate _fastmemcopy =   (FastMemCopyDelegate)Marshal.GetDelegateForFunctionPointer(p,   typeof(FastMemCopyDelegate));    //offset to the end of our method block  p += x86_FastMemCopy_New.Length;    //store length param  p -= 8;  Marshal.Copy(BitConverter.GetBytes((long)nBytes), 0, p, 4);    //store destination address param  p -= 8;  Marshal.Copy(BitConverter.GetBytes((long)dst), 0, p, 4);    //store source address param  p -= 8;  Marshal.Copy(BitConverter.GetBytes((long)src), 0, p, 4);    //Start stopwatch  Stopwatch sw = new Stopwatch();  sw.Start();    //copy-past all data 10 times  for (int i = 0; i   10; i++)  _fastmemcopy();    //stop stopwatch  sw.Stop();    //get message with measured time  System.Windows.Forms.MessageBox.Show(sw.ElapsedTicks.ToString());  }  catch (Exception ex)  {  //if any exception  System.Windows.Forms.MessageBox.Show(ex.Message);  }  finally {  //free allocated memory  NativeMethods.VirtualFree(p, (uint)(x86_FastMemCopy_New.Length),   FreeTypes.Release);  GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);  }  }  else if (IntPtr.Size == 8)  {  throw new ApplicationException( x64 is not supported yet! );  }  }

汇编代码被表示成带注释的字节数组：

private static byte[] x86_FastMemCopy_New = new byte[]  {  0x90, //nop do nothing  0x60, //pushad store flag register on stack  0x95, //xchg ebp, eax eax contains memory address of our method  0x8B, 0xB5, 0x5A, 0x01, 0x00, 0x00, //mov esi,[ebp][00000015A] get source buffer address  0x89, 0xF0, //mov eax,esi  0x83, 0xE0, 0x0F, //and eax,00F will check if it is 16 byte aligned  0x8B, 0xBD, 0x62, 0x01, 0x00, 0x00, //mov edi,[ebp][000000162] get destination address  0x89, 0xFB, //mov ebx,edi  0x83, 0xE3, 0x0F, //and ebx,00F will check if it is 16 byte aligned  0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy  0xC1, 0xE9, 0x07, //shr ecx,7 divide length by 128  0x85, 0xC9, //test ecx,ecx check if zero  0x0F, 0x84, 0x1C, 0x01, 0x00, 0x00, //jz 000000146  darr; copy the rest  0x0F, 0x18, 0x06, //prefetchnta [esi] pre-fetch non-temporal source data for reading  0x85, 0xC0, //test eax,eax check if source address is 16 byte aligned  0x0F, 0x84, 0x8B, 0x00, 0x00, 0x00, //jz 0000000C0  darr; go to copy if aligned  0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch more source data  0x0F, 0x10, 0x06, //movups xmm0,[esi] copy 16 bytes of source data  0x0F, 0x10, 0x4E, 0x10, //movups xmm1,[esi][010] copy more 16 bytes  0x0F, 0x10, 0x56, 0x20, //movups xmm2,[esi][020] copy more  0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more  0x0F, 0x10, 0x5E, 0x30, //movups xmm3,[esi][030]  0x0F, 0x10, 0x66, 0x40, //movups xmm4,[esi][040]  0x0F, 0x10, 0x6E, 0x50, //movups xmm5,[esi][050]  0x0F, 0x10, 0x76, 0x60, //movups xmm6,[esi][060]  0x0F, 0x10, 0x7E, 0x70, //movups xmm7,[esi][070] we apos;ve copied 128 bytes of source data  0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned  0x74, 0x21, //jz 000000087  darr; go to past if aligned  0x0F, 0x11, 0x07, //movups [edi],xmm0 past first 16 bytes to non-aligned destination address  0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more  0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2  0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3  0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4  0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5  0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6  0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we apos;ve pasted 128 bytes of source data  0xEB, 0x1F, //jmps 0000000A6  darr; continue  0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past first 16 bytes to aligned destination address  0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more  0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2  0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3  0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4  0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5  0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6  0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we apos;ve pasted 128 bytes of source data  0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128  0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128  0x83, 0xE9, 0x01, //sub ecx,1 decrement counter  0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 000000035  uarr; continue if not zero  0xE9, 0x86, 0x00, 0x00, 0x00, //jmp 000000146  darr; go to copy the rest of data    0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch source data  0x0F, 0x28, 0x06, //movaps xmm0,[esi] copy 128 bytes from aligned source address  0x0F, 0x28, 0x4E, 0x10, //movaps xmm1,[esi][010] copy more  0x0F, 0x28, 0x56, 0x20, //movaps xmm2,[esi][020]  0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more data  0x0F, 0x28, 0x5E, 0x30, //movaps xmm3,[esi][030]  0x0F, 0x28, 0x66, 0x40, //movaps xmm4,[esi][040]  0x0F, 0x28, 0x6E, 0x50, //movaps xmm5,[esi][050]  0x0F, 0x28, 0x76, 0x60, //movaps xmm6,[esi][060]  0x0F, 0x28, 0x7E, 0x70, //movaps xmm7,[esi][070] we apos;ve copied 128 bytes of source data  0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned  0x74, 0x21, //jz 000000112  darr; go to past if aligned  0x0F, 0x11, 0x07, //movups [edi],xmm0 past 16 bytes to non-aligned destination address  0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more  0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2  0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3  0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4  0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5  0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6  0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we apos;ve pasted 128 bytes of data  0xEB, 0x1F, //jmps 000000131  darr; continue copy-past  0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past 16 bytes to aligned destination address  0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more  0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2  0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3  0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4  0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5  0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6  0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we apos;ve pasted 128 bytes of data  0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128  0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128  0x83, 0xE9, 0x01, //sub ecx,1 decrement counter  0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 0000000C0  uarr; continue copy-past if non-zero  0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy  0x83, 0xE1, 0x7F, //and ecx,07F get rest number of bytes  0x85, 0xC9, //test ecx,ecx check if there are bytes  0x74, 0x02, //jz 000000155  darr; exit if there are no more bytes  0xF3, 0xA4, //rep movsb copy rest of bytes  0x0F, 0xAE, 0xF8, //sfence performs a serializing operation on all store-to-memory instructions  0x61, //popad restore flag register  0xC3, //retn return from our method to C#    0x00, 0x00, 0x00, 0x00, //source buffer address  0x00, 0x00, 0x00, 0x00,    0x00, 0x00, 0x00, 0x00, //destination buffer address  0x00, 0x00, 0x00, 0x00,    0x00, 0x00, 0x00, 0x00, //number of bytes to copy-past  0x00, 0x00, 0x00, 0x00  };

们将会通过前面创建的托管来调用汇编方法。

该方法目前工作在32位模式下，将来会实现64位模式。

谁感兴趣的话可以添加到源代码中（文章中几乎包含了所有的代码）

在实现及测试该方法期间，发现prefetchnta命令描述的不是很清楚，甚至是Intel的说明书也是一样。所以尝试自己以及通过google来弄明白 Smile | :) 。注意movntps和movaps说明，它们只在16字节内存地址对齐时工作。

英文原文：C# - Fast memory copy method with x86 assembly usage

译文出自：http://www.oschina.net/translate/csharp-fast-memory-copy-method-with-x-assembly-usa

功能分类