PDFium 是 Chromium 的 PDF 渲染引擎,BSD 3-Clause许可协议。 是基于 Foxit Software (福昕软件)的渲染代码,Google 与其合作开源出的。
函数定义以FPDF_EXPORT标记的 可以用C# 通过封装(P/Invoke)调用
函数封装(FPDF_LoadMemDocument为例):
//加载pdf FPDF_EXPORT FPDF_DOCUMENT FPDF_CALLCONV FPDF_LoadMemDocument(const void* data_buf, int size, FPDF_BYTESTRING password); //对应的封装C# //注:要根据入参类型使用“[MarshalAs(UnmanagedType.LPArray)]”来标识如何将参数或字段封送至非托管代码 [DllImport("pdfium", EntryPoint = "FPDF_LoadMemDocument", CallingConvention = CallingConvention.Cdecl)] internal static extern FpdfDocumentT FPDF_LoadMemDocument([MarshalAs(UnmanagedType.LPArray)] byte[] data_buf, int size, [MarshalAs(UnmanagedType.LPStr)] string password);
常见的 DllImport
参数:
- CallingConvention:指定调用约定,常见的调用约定有
CallingConvention.Cdecl
和CallingConvention.StdCall
。 - CharSet:指定字符串参数的字符集,可以是
CharSet.Ansi
、CharSet.Unicode
或CharSet.Auto
。 - ExactSpelling:如果设置为
true
,则不会对函数名称进行任何修改。如果设置为false
,则可能会根据字符集自动添加后缀。 - SetLastError:如果设置为
true
,则被调用的方法可以调用SetLastError
以设置最后的错误代码,这对于 Windows API 函数特别有用。
数据类型C++对C#:
C++ | C# |
备注 |
bool | bool | |
char | char |
C++中是8位,C#中是16位 |
wchar_t | char |
使用UTF-16编码 |
int | int |
|
long | long | 对应于C#的int |
long long |
long |
|
float | float |
|
double | double | |
std::string |
string |
Marshal.PtrToStringAnsi(ptr) |
std::wstring |
string |
Marshal.PtrToStringUni(wptr) |
unsigned char |
byte |
|
unsigned short |
ushort |
|
unsigned int | uint |
|
unsigned long | ulong |
|
unsigned long long | ulong |
C++返回的指针C#定义:
- 根据C++返回的对象不同C#定义成不同的struct 类型,把指针转成对象,识别度更高。
- 实现IHandle接口方便统一做内存释放。
/// <summary>Handle to a FpdfDocumentT</summary> [StructLayout(LayoutKind.Sequential)] public struct FpdfDocumentT : IHandle<FpdfDocumentT> { IntPtr _ptr; /// <summary>Gets a value indicating whether the handle is <c>null</c>.</summary> public bool IsNull => _ptr == IntPtr.Zero; public override string ToString() => "FpdfDocumentT: 0x" + _ptr.ToString("X16"); /// <summary>Gets a handle representing <c>null</c>.</summary> public static FpdfDocumentT Null => new FpdfDocumentT(); FpdfDocumentT(IntPtr ptr) { _ptr = ptr; } FpdfDocumentT IHandle<FpdfDocumentT>.SetToNull() => new FpdfDocumentT(Interlocked.Exchange(ref _ptr, IntPtr.Zero)); } public interface IHandle<T> { bool IsNull { get; } T SetToNull(); }
PDFium开放的函数和C#封装对应关系如下图:
跨平台:
Pdfium针对不同的平台编译包,封装时需要根据环境区分调用(仅通过Windows,ubuntu-22.04.2测试)
判断系统环境:
public static Platforms CurrentPlatform { get { if (_currentPlatform != null) { return _currentPlatform.Value; } #if NET5_0_OR_GREATER string environmentVariable = Environment.GetEnvironmentVariable("windir"); if (!string.IsNullOrEmpty(environmentVariable) && environmentVariable.Contains("\\") && Directory.Exists(environmentVariable)) { _currentPlatform = new Platforms?(Platforms.Windows); } else if (File.Exists("/proc/sys/kernel/ostype")) { if (File.ReadAllText("/proc/sys/kernel/ostype").StartsWith("Linux", StringComparison.OrdinalIgnoreCase)) { _currentPlatform = new Platforms?(Platforms.Linux); } } else if (File.Exists("/System/Library/CoreServices/SystemVersion.plist")) { _currentPlatform = new Platforms?(Platforms.OSX); } if (_currentPlatform != null) { return _currentPlatform.Value; } if (RuntimeInformation.IsOSPlatform(OSPlatform.Linux)) { _currentPlatform = new Platforms?(Platforms.Linux); } else if (RuntimeInformation.IsOSPlatform(OSPlatform.OSX)) { _currentPlatform = new Platforms?(Platforms.OSX); } else { _currentPlatform = new Platforms?(Platforms.Windows); } #else if (_currentPlatform == null) _currentPlatform = new Platforms?(Platforms.Windows); #endif return _currentPlatform.Value; } set { _currentPlatform = new Platforms?(value); } } public enum Platforms { /// <summary> /// Represents the Linux operating system. /// </summary> Linux, /// <summary> /// Represents the OSX operating system. /// </summary> OSX, /// <summary> /// Represents the Windows operating system. /// </summary> Windows }
加载Pdfium:
public static IntPtr LoadLibrary(string path) { if ((path ?? "").Trim() == "") { return IntPtr.Zero; } Platforms currentPlatform = CurrentPlatform; if (currentPlatform == Platforms.Linux) { return dlopenLinux(path, 2); } if (currentPlatform != Platforms.OSX) { return LoadLibraryWin(path); } return dlopenOSX(path, 2); } public static bool FreeLibrary(IntPtr handle) { if (handle == IntPtr.Zero) { return false; } Platforms currentPlatform = CurrentPlatform; if (currentPlatform == Platforms.Linux) { return dlcloseLinux(handle) == 0; } if (currentPlatform != Platforms.OSX) { return FreeLibraryWin(handle); } return dlcloseOSX(handle) == 0; } [DllImport("kernel32", EntryPoint = "LoadLibraryW", SetLastError = true, CharSet = CharSet.Auto)] private static extern IntPtr LoadLibraryWin([MarshalAs(UnmanagedType.LPTStr)] string lpFileName); [DllImport("libdl.so.2", EntryPoint = "dlopen")] private static extern IntPtr dlopenLinux(string filename, int flags); [DllImport("libdl.dylib", EntryPoint = "dlopen")] private static extern IntPtr dlopenOSX(string filename, int flags); [DllImport("Kernel32.dll", EntryPoint = "FreeLibrary", SetLastError = true)] private static extern bool FreeLibraryWin(IntPtr handle); [DllImport("libdl.so.2", EntryPoint = "dlclose")] private static extern int dlcloseLinux(IntPtr handle); [DllImport("libdl.dylib", EntryPoint = "dlclose")] private static extern int dlcloseOSX(IntPtr handle);
遇到的问题:
问题1:返回的字节乱码
Pdfium函数返回以下几种编码格式:
- encoded in 7-bit ASCII
FPDF_EXPORT unsigned long FPDF_CALLCONV FPDFAction_GetURIPath(FPDF_DOCUMENT document, FPDF_ACTION action, void* buffer, unsigned long buflen);
public delegate int GetStringHandler(ref byte buffer, int length); GetAsciiString((ref byte buffer, int length) => (int)Internal.FPDFActionGetURIPath(document, action, out buffer, (uint)length)); public static string GetAsciiString(GetStringHandler handler) { byte b = 0; int length = handler(ref b, 0); if (length == 0) return null; var buffer = new byte[length]; handler(ref buffer[0], length); return Encoding.ASCII.GetString(buffer, 0, (int)length - 1); }
- the |buffer| is always in UTF-8 encoding.
FPDF_EXPORT unsigned long FPDF_CALLCONV FPDFFont_GetFontName(FPDF_FONT font, char* buffer, unsigned long length);
GetUtf8String((ref byte buffer, int length) => (int)Internal.FPDFFontGetFontName(font, out buffer, (uint)length)); public static string GetUtf8String(GetStringHandler handler) { byte b = 0; int length = handler(ref b, 0); var buffer = new byte[length]; handler(ref buffer[0], length); return Encoding.UTF8.GetString(buffer, 0, (int)length - 1); }
- the |buffer| is always in UTF-16LE encoding
FPDF_EXPORT unsigned long FPDF_CALLCONV FPDFBookmark_GetTitle(FPDF_BOOKMARK bookmark, void* buffer, unsigned long buflen);
GetUtf16String((ref byte buffer, int length) => (int)Internal.FPDFBookmarkGetTitle(bookmark, out buffer, (uint)length), sizeof(byte), true); public static string GetUtf16String(GetStringHandler handler, int lengthUnit, bool lengthIncludesTerminator) { byte b = 0; int length = handler(ref b, 0); if (length == 0) return null; var buffer = new byte[length * lengthUnit]; handler(ref buffer[0], length); length *= lengthUnit; if (lengthIncludesTerminator) length -= 2; return Encoding.Unicode.GetString(buffer, 0, length); }
问题2:从IntPtr中取出相应的数据
例:指针指向数组对象
/* * Function: FPDF_GetDefaultTTFMap * Returns a pointer to the default character set to TT Font name map. The * map is an array of FPDF_CharsetFontMap structs, with its end indicated * by a { -1, NULL } entry. * Parameters: * None. * Return Value: * Pointer to the Charset Font Map. */ FPDF_EXPORT const FPDF_CharsetFontMap* FPDF_CALLCONV FPDF_GetDefaultTTFMap();
public static FpdfCharsetFontMap[] GetDefaultTTFMaps() { var ptr = FPDFGetDefaultTTFMap(); var result = new List<FpdfCharsetFontMap>(); int i = 0; var size = Marshal.SizeOf(typeof(FpdfCharsetFontMap)); var element = new FpdfCharsetFontMap() { Charset = 0, Fontname = null }; while (element.Charset >= 0) { IntPtr midd = IntPtr.Add(ptr, i * size); //IntPtr midd = ptr + i * size; element = (FpdfCharsetFontMap)Marshal.PtrToStructure(midd, typeof(FpdfCharsetFontMap)); if (element.Charset == -1) { break; } result.Add(element); i++; } return result.ToArray(); }
问题3:尝试读取或写入受保护的内存。这通常指示其他内存已损坏。
通常这种问题是由于数据类型使用不对导致,比如c++入参类型为int,c#对应类型为long。
问题4:内存泄漏
- 大对象,分配一个大型对象(大于85000字节),但却很少分配小对象,导致2代垃圾回收从不执行,即使这些大对象不再被引用,依然得不到释放,最终导致内存泄漏。
- 非托管代码遵循谁创建谁释放的原则,比如句柄为托管代码创建,需要托管代码进行释放。
问题5:变换矩阵
PDF 在二维坐标系中表示其内容。每个点的坐标都可以表示为向量:(x, y)。 变换矩阵允许更改默认坐标系并将原始坐标 (x, y) 映射到这个新坐标系:(x', y')。根据我们改变坐标系的方式,我们以这种方式有效地旋转、缩放、移动(平移)或剪切对象。
以平移为例:
将坐标系移动给定偏移量。该操作将生成一个新坐标系,该坐标系沿 x 轴移动 e,沿 y 轴移动 f。
原始坐标系中点的坐标为 (240 651 1)。我们想将坐标系向左平移 10 个点,向上平移 20 个点。所需的变换矩阵为:生成的坐标为:
如下图,坐标已按计划更改。图像的所有其他像素的转换方式相同。
平移坐标示意图:
问题6:嵌入字体
Pdfium通过FPDFTextLoadFont函数加载字体嵌入pdf内,但是由于Pdfium未提供仅嵌入使用字符的函数,这样会导嵌入字体的全部字节导致编辑后生成pdf文件很大。
[DllImport("pdfium", EntryPoint = "FPDFText_LoadFont", CallingConvention = CallingConvention.Cdecl)] internal static extern FpdfFontT FPDFTextLoadFont(FpdfDocumentT document, [MarshalAs(UnmanagedType.LPArray)] byte[] data, uint size, uint font_type, bool cid);
创建字符子集,只嵌入使用的字符集
/// <summary> /// create SubCharacterSet /// </summary> /// <param name="fontPath">font path</param> /// <param name="sourceText"></param> /// <returns></returns> public static byte[] CreateSubSet(this string fontPath, string sourceText) { if (!File.Exists(fontPath)) throw new ArgumentException($"{fontPath} not find"); var glyphTypeface = new GlyphTypeface(new Uri(fontPath, UriKind.RelativeOrAbsolute)); var Index = new List<ushort>(); var sourceTextBytes = Encoding.Unicode.GetBytes(sourceText); var sourceTextChars = Encoding.Unicode.GetChars(sourceTextBytes); for (var charPos = 0; charPos <= (sourceTextChars.Length - 1); charPos++) { var sourceTextCharVal = (int)sourceTextChars[charPos]; var glyphIndex = glyphTypeface.CharacterToGlyphMap[sourceTextCharVal]; if (!Index.Contains(glyphIndex)) Index.Add(glyphIndex); } return glyphTypeface.ComputeSubset(Index); }
这次分享就到这里