code

Monday, 24 July 2017

Pyramid TIFF writing.

Pyramid TIFFs are not well defined.

The following is my approach to this (Works with QGIS). Some of this was found by reading the GDAL GeoTIFF driver the rest by trial and error.

When we open a TIFF file for writing the first TIFF directory is created.

Before we write any layers out we record where that layer is:

const toff_t nBaseDirOffset = TIFFCurrentDirOffset(tif);

We then write out each layer. The first layer is the highest resolution with each subsequent layer a reduced resolution layer.

This is the start of the loop for writing out pyramid layers.

For the reduced resolution layers we write out the following (we don't write this bit out for the highest resolution layer):

// This is a bit of a hack to cause (*tif->tif_cleanup)(tif); to be called.
// See https://trac.osgeo.org/gdal/ticket/2055
TIFFSetField(tif, TIFFTAG_COMPRESSION, COMPRESSION_NONE);

// Create a new directory entry - this is done for the reduced resolution images.
TIFFCreateDirectory(tif);
TIFFSetField(tif, TIFFTAG_SUBFILETYPE, FILETYPE_REDUCEDIMAGE);

The basic TIFF settings:
TIFFSetField(tif, TIFFTAG_IMAGEWIDTH, width);
TIFFSetField(tif, TIFFTAG_IMAGELENGTH, height);
TIFFSetField(tif, TIFFTAG_BITSPERSAMPLE, 8);
TIFFSetField(tif, TIFFTAG_SAMPLESPERPIXEL, bpp);
TIFFSetField(tif, TIFFTAG_COMPRESSION, COMPRESSION_LZW);
TIFFSetField(tif, TIFFTAG_ORIENTATION, ORIENTATION_TOPLEFT);
TIFFSetField(tif, TIFFTAG_PLANARCONFIG, PLANARCONFIG_CONTIG);
TIFFSetField(tif, TIFFTAG_RESOLUTIONUNIT, RESUNIT_NONE);
TIFFSetField(tif, TIFFTAG_PHOTOMETRIC, PHOTOMETRIC_RGB);
Write out the basic pixel sizing and extent information.
Write out GeoTIFF tags and metadata
Write out tile size information.
Write the tile data however may times you need too.
After writing the tile data we need to write out the TIFF directory:
TIFFWriteDirectory(tif);
For simplicity set the top level Directory:
TIFFSetSubDirectory(tif, nBaseDirOffset);

Write the next level out by going to the start of the loop (above).

Tuesday, 13 June 2017

Mesa GLSL i965 inaccurate pow(x) function (pre 12.1 versions)

The pow(x) method appears to not be particularly accurate on HD4600 running Mesa 10.2.7.

What I am seeing is the following:

What I should be seeing is:

The customer is unable to upgrade so a software pow implementation was required.

The software version is based on code by Sun Microsystems, as such the copyright is reproduced here for the code.

/*
 * ====================================================
 * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
 *
 * Developed at SunPro, a Sun Microsystems, Inc. business.
 * Permission to use, copy, modify, and distribute this
 * software is freely granted, provided that this notice
 * is preserved.
 * ====================================================
 */

A copysignf function is required to be implemented:
float copysignf(in float x, in float y)
{
 uint ix = floatBitsToUint(x);
 uint iy = floatBitsToUint(y);
 return uintBitsToFloat((ix&uint(0x7fffffff))|(iy&uint(0x80000000)));
}

The scalebnf function is required as well:
float scalbnf (in float x, in int n)
{
  const float
  two25   =  3.355443200e+07, /* 0x4c000000 */
  twom25  =  2.9802322388e-08, /* 0x33000000 */
  huge   = 1.0e+30,
  tiny   = 1.0e-30;

  int ix = floatBitsToInt(x);
  int k = (ix&0x7f800000)>>23;  /* extract exponent */
  if (k==0) {    /* 0 or subnormal x */
    if ((ix&0x7fffffff)==0) return x; /* +-0 */
    x *= two25;
    ix = floatBitsToInt(x);
    k = ((ix&0x7f800000)>>23) - 25;
    if (n< -50000) return tiny*x;  /*underflow*/
  }
  if (k==0xff) return x+x;  /* NaN or Inf */
  k = k+n;
  if (k >  0xfe) return huge*copysignf(huge,x); /* overflow  */
  if (k > 0)     /* normal result */
  {
    x = intBitsToFloat((ix&0x807fffff)|(k<<23));
    return x;
  }
  if (k <= -25) {
    if (n > 50000)  /* in case integer overflow in n+k */
      return huge*copysignf(huge,x); /*overflow*/
    else return tiny*copysignf(tiny,x); /*underflow*/
  }
  k += 25;    /* subnormal result */
  x = intBitsToFloat((ix&0x807fffff)|(k<<23));
  return x*twom25;
}

Finally the powf function:
float powf(in float x, in float y)
{
  const float huge = 1.0e+30, tiny = 1.0e-30;
  const float
  zero    =  0.0,
  one  =  1.0,
  two  =  2.0,
  two24  =  16777216.0, //0x1.0p+24, /* 16777216.0, 0x4b800000 */
    /* poly coefs for (3/2)*(log(x)-2s-2/3*s**3 */
  L1  =  6.0000002384e-01, /* 0x3f19999a */
  L2  =  4.2857143283e-01, /* 0x3edb6db7 */
  L3  =  3.3333334327e-01, /* 0x3eaaaaab */
  L4  =  2.7272811532e-01, /* 0x3e8ba305 */
  L5  =  2.3066075146e-01, /* 0x3e6c3255 */
  L6  =  2.0697501302e-01, /* 0x3e53f142 */
  P1   =  1.6666667163e-01, /* 0x3e2aaaab */
  P2   = -2.7777778450e-03, /* 0xbb360b61 */
  P3   =  6.6137559770e-05, /* 0x388ab355 */
  P4   = -1.6533901999e-06, /* 0xb5ddea0e */
  P5   =  4.1381369442e-08, /* 0x3331bb4c */
  lg2  =  6.9314718246e-01, /* 0x3f317218 */
  lg2_h  =  6.93145752e-01, /* 0x3f317200 */
  lg2_l  =  1.42860654e-06, /* 0x35bfbe8c */
  ovt =  4.2995665694e-08, /* -(128-log2(ovfl+.5ulp)) */
  cp    =  9.6179670095e-01, /* 0x3f76384f =2/(3ln2) */
  cp_h  =  9.6179199219e-01, /* 0x3f763800 =head of cp */
  cp_l  =  4.7017383622e-06, /* 0x369dc3a0 =tail of cp_h */
  ivln2    =  1.4426950216e+00, /* 0x3fb8aa3b =1/ln2 */
  ivln2_h  =  1.4426879883e+00, /* 0x3fb8aa00 =16b 1/ln2*/
  ivln2_l  =  7.0526075433e-06; /* 0x36eca570 =1/ln2 tail*/

  float bp[2];
  bp[0] = 1.0;
  bp[1] = 1.5;
  float dp_h[2], dp_l[2];

  dp_h[0] = 0.0;
  dp_h[1] = 5.84960938e-01; /* 0x3f15c000 */
  dp_l[0] = 0.0;
  dp_l[1] = 1.56322085e-06; /* 0x35d1cfdc */

  float z,ax,z_h,z_l,p_h,p_l;
  float y1,t1,t2,r,s,t,u,v,w;
  int i,j,k,yisint,n;
  int hx,hy,ix,iy,is;

  hx = floatBitsToInt(x);
  hy = floatBitsToInt(y);
  ix = hx&0x7fffffff;
  iy = hy&0x7fffffff;

  /* y==zero: x**0 = 1 */
  if(iy==0)
    return one;

  /* +-NaN return x+y */
  if(ix > int(0x7f800000) ||
    iy > int(0x7f800000))
  return x+y;

  /* determine if y is an odd int when x < 0
   * yisint = 0  ... y is not an integer
   * yisint = 1  ... y is an odd int
   * yisint = 2  ... y is an even int
   */
  yisint  = 0;
  if(hx<0) {
    if(iy>=0x4b800000)
      yisint = 2; /* even integer y */
    else if(iy>=0x3f800000) {
      k = (iy>>23)-0x7f;     /* exponent */
      j = iy>>(23-k);
      if((j<<(23-k))==iy) yisint = 2-(j&1);
    }
  }

  /* special value of y */
  if (iy==0x7f800000) {  /* y is +-inf */
    if (ix==0x3f800000)
      return  y - y;  /* inf**+-1 is NaN */
    else if (ix > 0x3f800000)/* (|x|>1)**+-inf = inf,0 */
      return (hy>=0)? y: zero;
    else      /* (|x|<1)**-,+inf = inf,0 */
      return (hy<0)?-y: zero;
  }
  if(iy==0x3f800000) {  /* y is  +-1 */
    if(hy<0)
      return one/x; else return x;
  }
  if(hy==0x40000000) return x*x; /* y is  2 */
  if(hy==0x3f000000) {  /* y is  0.5 */
    if(hx>=0)  /* x >= +0 */
      return sqrt(x);
  }

  ax   = abs(x);
  /* special value of x */
  if(ix==0x7f800000||ix==0||ix==0x3f800000){
    z = ax;      /*x is +-0,+-inf,+-1*/
    if(hy<0) z = one/z;  /* z = (1/|x|) */
    if(hx<0) {
      if(((ix-0x3f800000)|yisint)==0) {
        z = (z-z)/(z-z); /* (-1)**non-int is NaN */
      } else if(yisint==1)
        z = -z;    /* (x<0)**odd = -(|x|**odd) */
    }
    return z;
  }

  uint hX = uint(hx);
  hX = hX >> 31;
  int nN = int(hX) -1;

  /* (x<0)**(non-int) is NaN */
  if((nN|yisint)==0) return (x-x)/(x-x);

  /* |y| is huge */
  if(iy>0x4d000000) { /* if |y| > 2**27 */
    /* over/underflow if x is not close to one */
    if(ix<0x3f7ffff8) return (hy<0)? huge*huge:tiny*tiny;
    if(ix>0x3f800007) return (hy>0)? huge*huge:tiny*tiny;
    /* now |1-x| is tiny <= 2**-20, suffice to compute
     log(x) by x-x^2/2+x^3/3-x^4/4 */
    t = x-1;    /* t has 20 trailing zeros */
    w = (t*t)*(0.5-t*(0.333333333333-t*0.25));
    u = ivln2_h*t;  /* ivln2_h has 16 sig. bits */
    v = t*ivln2_l-w*ivln2;
    t1 = u+v;
    is = floatBitsToInt(t1);
    t1 = intBitsToFloat(is&0xfffff000);
    t2 = v-(t1-u);
  } else {
    float s2,s_h,s_l,t_h,t_l;
    n = 0;
    /* take care subnormal number */
    if(ix<0x00800000)
    {
      ax *= two24; n -= 24; ix = floatBitsToInt(ax);
    }
    n  += ((ix)>>23)-0x7f;
    j  = ix&0x007fffff;
    /* determine interval */
    ix = j|0x3f800000;    /* normalize ix */
    if(j<=0x1cc471) k=0;  /* |x|<sqrt(3/2) */
    else if(j<0x5db3d7) k=1;  /* |x|<sqrt(3)   */
    else {k=0;n+=1;ix -= 0x00800000;}
    ax = intBitsToFloat(ix);

    /* compute s = s_h+s_l = (x-1)/(x+1) or (x-1.5)/(x+1.5) */
    u = ax-bp[k];    /* bp[0]=1.0, bp[1]=1.5 */
    v = one/(ax+bp[k]);
    s = u*v;
    s_h = s;
    is = floatBitsToInt(s_h);
    s_h = intBitsToFloat(is&0xfffff000);
    /* t_h=ax+bp[k] High */
    t_h = intBitsToFloat(((ix>>1)|0x20000000)+0x0040000+(k<<21));
    t_l = ax - (t_h-bp[k]);
    s_l = v*((u-s_h*t_h)-s_h*t_l);
    /* compute log(ax) */
    s2 = s*s;
    r = s2*s2*(L1+s2*(L2+s2*(L3+s2*(L4+s2*(L5+s2*L6)))));
    r += s_l*(s_h+s);
    s2  = s_h*s_h;
    t_h = 3.0+s2+r;
    is = floatBitsToInt(t_h);
    t_h = intBitsToFloat(is&0xfffff000);
    t_l = r-((t_h-3.0)-s2);
    /* u+v = s*(1+...) */
    u = s_h*t_h;
    v = s_l*t_h+t_l*s;
    /* 2/(3log2)*(s+...) */
    p_h = u+v;
    is = floatBitsToInt(p_h);
    p_h = intBitsToFloat(is&0xfffff000);
    p_l = v-(p_h-u);
    z_h = cp_h*p_h;    /* cp_h+cp_l = 2/(3*log2) */
    z_l = cp_l*p_h+p_l*cp+dp_l[k];
    /* log2(ax) = (s+..)*2/(3*log2) = n + dp_h + z_h + z_l */
    t = n;
    t1 = (((z_h+z_l)+dp_h[k])+t);
    is = floatBitsToInt(t1);
    t1 = intBitsToFloat(is&0xfffff000);
    t2 = z_l-(((t1-t)-dp_h[k])-z_h);
  }

  s = one; /* s (sign of result -ve**odd) = -1 else = 1 */

  hX = uint(hx);
  hX = hX >> 31;
  nN = int(hX) -1;

  if((nN|(yisint-1))==0)
    s = -one;  /* (-ve)**(odd int) */

  /* split up y into y1+y2 and compute (y1+y2)*(t1+t2) */
  is = floatBitsToInt(y);
  y1 = intBitsToFloat(is&0xfffff000);
  p_l = (y-y1)*t1+y*t2;
  p_h = y1*t1;
  z = p_l+p_h;
  j = floatBitsToInt(z);
  if (j>0x43000000)        /* if z > 128 */
    return s*huge*huge;        /* overflow */
  else if (j==0x43000000) {      /* if z == 128 */
    if(p_l+ovt>z-p_h) return s*huge*huge;  /* overflow */
  }
  else if ((j&0x7fffffff)>0x43160000)    /* z <= -150 */
    return s*tiny*tiny;        /* underflow */
  else if (uint(j)==0xc3160000u){  /* z == -150 */
    if(p_l<=z-p_h) return s*tiny*tiny;    /* underflow */
  }
  /*
   * compute 2**(p_h+p_l)
   */
  i = j&0x7fffffff;
  k = (i>>23)-0x7f;
  n = 0;
  if(i>0x3f000000) {    /* if |z| > 0.5, set n = [z+0.5] */
    n = j+(0x00800000>>(k+1));
    k = ((n&0x7fffffff)>>23)-0x7f;  /* new k for n */
    t = intBitsToFloat(n&~(0x007fffff>>k));
    n = ((n&0x007fffff)|0x00800000)>>(23-k);
    if(j<0) n = -n;
    p_h -= t;
  }
  t = p_l+p_h;
  is = floatBitsToInt(t);
  t = intBitsToFloat(is&0xfffff000);
  u = t*lg2_h;
  v = (p_l-(t-p_h))*lg2+t*lg2_l;
  z = u+v;
  w = v-(z-u);
  t  = z*z;
  t1  = z - t*(P1+t*(P2+t*(P3+t*(P4+t*P5))));
  r  = (z*t1)/(t1-two)-(w+z*w);
  z  = one-(r-z);
  j = floatBitsToInt(z);
  j += (n<<23);
  if((j>>23)<=0) z = scalbnf(z,n);  /* subnormal output */
  else z = intBitsToFloat(j);
  return s*z;
}

This will potentially kill performance so this should only be used for the offending driver/GPU.

The problem with Mesa has been resolved by around Mesa version 12.1.


Notes: The GIS data is from Natural Earth.

Optimising a GLSL shader (notes)

The following are a list of potential optimisations for shaders:

Do as little work as possible:
  • Don't continually calculate something that is essentially static, pass the value to the shader using a uniform.
  • If you have to calculate something every frame store the result in a uniform or a texture so the result is available to any shader that has to be run on every vertex or fragment.
  • If you need to calculate a lot of values at the start of the frame consider doing this using a shader and store the results in a texture.
  • Don't create and destroy resources.
  • Set the data as little as possible.
Keep shaders simple and as small as possible:
  • Don't create large all encompassing shaders. These take time to compile and optimise. Use the shader pre-processor to switch in different behaviors. I implemented an '#include' concept for my shaders.
  • Consider using ARB_separate_shader_objects.
  • Consider using shader sub-routines - in Mesa this is usually implemented using a 'if then else' construct with a uniform controlling the program flow. With 'Mesa >= 16', NVIDIA/AMD I've not seen any impact on performance using sub-routines with sensibly sized shaders.
  • Because shader sub-routines may result in a large all encompassing shader as a side effect of the target implementation you have to validate that this will not impact your application.
  • NVIDIA/AMD and Mesa 16.x (i965 driver) and newer I would use shader sub-routines. Older versions of Mesa or when using a different driver I would be careful and test.
  • Turn on OpenGL debugging and look at the output produced. Mesa will at least warn you about registry spills which basically means that your shader is a bit too big or that there is a problem in the compiler/optimiser.
Math:
  • Avoid iterative algorithms. Consider taking a different approach to performing a calculation such as using a min/max polynomial. Don't use a Taylor Series.
  • Know your sin/cos/tan/... properties and relationships.
  • Consider what GLSL functions are available and use the appropriate one, e.g. inversesqrtfma, etc...
  • Don't calculate something unless you have too.
  • Consider re-arranging the algorithm to reduce the number of operations or expensive math calls.
  • GLSL is close enough to C so test on the CPU and use a performance analysis tool.
  • Avoid using division.
  • Consider using MAD instruction.
  • Consider evaluation order of operators.
  • Keep floats together and vec's together.
  • pow(x, y) can be implemented as:
exp2(log2(x) * y)
  • exp(x) can be implemented as:
exp2(x * 1.442695)
  • log(x) can be implemented as:
log2(x * 0.695147)
  • sign() if we don't care about zero.
(x>0) ? 1 : -1
  • sign(x) * y can be implemented as
(x >= 0) ? y : -y

It is reported that conditional assignment is faster than calling sign().
  • atan can be implemented as:
float ATAN(in float y) {
  bool s = (1.0 > abs(y));
  if (s)
    return M_PI_2 - atan(1.0, y);
  return atan(y, 1.0);
}
  • atan2 may not be as robust as you think see Stackoverflow robust atan2
  • pow is known to have accuracy issues on i965 - I can vouch for this.
Debugging:
  • Hook up all the debugging extensions. You will be amazed at how much information most OpenGL drivers will provide to you.
  • For math algorithms implement on the CPU first.
  • If any of the standard math library functions fail to work then look at the symptoms and try and first work out what the pattern for the error is. Based on this you can potentially narrow down the issue to a handful of functions. Implement each function in turn using soft float implementation.
  • If you are really stuck on this then I am available for contact work (damian dot dixon @ acm dot org).
Code flow:
  • Avoid if and while statements.
  • Use the preprocessor to deal with code flow if at all possible.

pragma in a shader

There is no defined list.

Each implementation has it's own set.

Some common ones are:
  • #pragma debug(on)
  • #pragma debug(off)
  • #pragma optimize(on)
  • #pragma optimize(off)
  • #pragma invariant(all)
NVIDIA specific pragmas:
  • #pragma optionNV(fastmath on)
  • #pragma optionNV(fastprecision on)
  • #pragma optionNV(ifcvt none)
  • #pragma optionNV(inline all)
  • #pragma optionNV(strict on)
  • #pragma optionNV(unroll all)
There are also some environment variables that can be defined (options typically controlled globally with NVEmulate):
  • __GL_WriteProgramObjectAssembly
  • __GL_WriteProgramObjectSource
  • __GL_WriteInfoLog
  • __GL_ShaderPortabilityWarnings

Discussion

Some of the above optimisations use features that are not available on all OpenGL implementations.

You should consider what your lowest common denominator is and program to that level. You can then check for additional extension support and enable optimisations as needed.

This approach will ensure that your application works on the majority of hardware that is circulation. In the business or defense world people usually have some awful hardware that can't be changed.

Additional References:

Wednesday, 7 June 2017

Intel GPU sin/cos accuracy

While developing a system that re-projects GIS vector and raster data in realtime I found that the Intel GPU (HD4000) sin/cos was not particularly accurate (projecting data from WGS84 to Stereographic, Mercator etc...).

Because the accuracy was so poor I had to look for a different way of calculating sin/cos on the GPU.

Subsequently patches were made to Mesa and an environment variable was added to enable a slightly more precise variant of sin/cos, see:
The patch however still does not provide sufficient accuracy for display of GIS data.

The approach I took was to:

  • Use a min/max polynomial for sin.
  • To adjust the input value to be in the range [-PI/2, PI/2].
  • To take advantage of fma instruction in glsl.

You can take advantage of knowledge about sin/cos rules and identities as follows (pseudo code):
  • To calculate cos by calculating sin:
cos_value = calcSin(M_PI_2 - value);
  • To calculate sin and cos of a value:
sin_value = calcSin(value);
cos_value = sqrt( 1 + -(sin_value * sin_value) );

In variably you need both sin and cos when doing projection calculations and sqrt on a GPU is not as slow as you may think!


References:


XRender Debugging and Programming

There is very little detailed information on how to use XRender extension.

In this case the following is helpful:
The specification for Xrender is a bit lacking:
In a previous post I was looking at a BadMatch error returned by a call to XRenderCreatePicture.

The issue is that the drawable and Xrender format is likely to not match (you can look at the Xserver and client libraries source code to see what can cause a BadMatch).

The Xrender implementation in the XServer can be found here.

The file that implements the RenderCreatePicture is called render.c.

In this case the following two parameters must match:
  • Drawable depth
  • Visual depth
Basically this means that the XRenderPictFormat must match the Visual depth and that this must match the Drawable depth.

You can get the Visual information in the client application from:

int nitems = 0;
int vmask = VisualIDMask | VisualScreenMask;
XVisualInfo *visInfo = XGetVisualInfo(display, vmask, &vinfo, &nitems);

You can get the depth of the drawable using:

Window    root;
int       x, y;
unsigned int  awidth, aheight, border, adepth;
if ( !XGetGeometry( display, drawable, &root, &x, &y, &awidth, &aheight,
                          &border, &adepth ) )

You can get the correct format for the Visual using:

XRenderPictFormat *drawableFormat = XRenderFindVisualFormat( display, visual );

Tuesday, 6 June 2017

Parsing comma/space separated values in C++

I use environment variables for changing the behaviour of an application. I try not to have too many variables as it becomes difficult to remember what they are. As such I set a number of parameters into a single environment variable using a comma/space separated list and read the list as follows:
#include 
using namespace boost;

const char *st = getenv("APP_VERBOSE");
if (st != NULL)
{
  std::string stV(st);
  char_separator sep(", ");
  tokenizer< char_separator > tokens(stV, sep);
  FOREACH( const string &t, tokens)
  {
    if (t == "verbose")
    {
      // set a verbose flag
    }
    else if (t == "help")
    {
      // output all the options...
    }
  }
}
I am using boost here because it is more portable.

X11 Debugging Notes

I have been programming using Xlib off and on for for many years. I even wrote the Xlib implementation of a GIS graphics engine that was used in the back end of search and rescue planes (Aurora and associated programmes). The latest version will use a Modern OpenGL graphics engine that I had a big hand in implementing.

The one website I continually go back too when I need to debug X11 is:
In particular this page:
The main reason is for the detailed information on how to debug a problem when something goes wrong with your program.

While this site is pretty old the information is still current, at least for Xlib and Motif, not that many people are using these in anger directly very much now.

A couple of things have changed:
  • You should probably add the following to your application to cause Xlib protocol to be synchronous:
XSynchronize(display, True);
  • gdb is now used in preference to dbx (old Sun debugger, now Oracle debugger).
(gdb) break _XError
Function "_XError" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (_XError) pending.

If you find that your problem goes away when the protocol is synchronous then your problem is going to be a lot more difficult to fix.

In my case this is usually related to my programs being multi-threaded with multiple connections to the XServer (not all Xlib libraries are thread safe even if you follow the rules that are known about). This is partly why xcb was created.

Either way what you get will be a stack trace back to the problem:

(gdb) where
#0  0xf73106a0 in _XError () from /lib/libX11.so.6
#1  0xf730cff6 in handle_error () from /lib/libX11.so.6
#2  0xf730d0be in handle_response () from /lib/libX11.so.6
#3  0xf730e328 in _XReply () from /lib/libX11.so.6
#4  0xf7309677 in XSync () from /lib/libX11.so.6
#5  0xf7309713 in _XSyncFunction () from /lib/libX11.so.6
#6  0xf49aa46c in XRenderCreatePicture () from /lib/libXrender.so.1
The actual error reported is:
(gdb) c
Continuing.
X Error of failed request:  BadMatch (invalid parameter attributes)
  Major opcode of failed request:  138 (RENDER)
  Minor opcode of failed request:  4 (RenderCreatePicture)
  Serial number of failed request:  94
  Current serial number in output stream:  95

To be honest sometimes the errors make no sense and you have to search on the internet for answers. In this case the following is helpful:
The specification for Xrender is a bit lacking:
A subsequent post covers fixing the above issue with XRenderCreatePicture.

The additional complication in all this is that most modern Linux systems are now using an Xlib library implemented using xcb. This does not however sort out the threading mess that Xlib is known for.