theme:

# DIY: Gamma and Zeta Function Implementation

in developing my language, kscript, i wanted to have the Riemann Zeta Function and Gamma Function available as part of the standard library. so, i implemented it!

our basic workflow is:

• write a python script that can generate some C code that corresponds to both functions
• run that python script with enough precision for double in C, with correct error bounds

let’s get to it!

## Definitions

i’ll assume you’re more or less familiar with what the Zeta and Gamma functions are, but i’ll also provide a definition we can work with:

$\Gamma(x) = (x - 1)! = \int_{0}^{\infty} t^{x-1} e^{-t} dt$ $\zeta(x) = \sum_{n=1}^{\infty} \frac{1}{n^x}$

we use the following reflection formulas to define the value elsewhere:

$\zeta(x) = 2 (2 \pi) ^ {x - 1} \sin(\frac{x \pi}{2}) \Gamma(1 - x) \zeta(1 - x)$ $\Gamma(x) = \sin(\pi x) \Gamma(1 - x)$

## Goal

our goal is to define C functions with the following signatures, which evaluate the specific function at a particular point:


double my_zeta(double x);
double complex my_czeta(double complex x);

double my_gamma(double x);
double complex my_cgamma(double complex x);

// equivalent to log(gamma(x))
double my_lgamma(double x);
double complex my_lcgamma(double complex x);



we would like this to be self contained, and distributable to any other C99 project. further, the results should be accurate to the requested precision (double in C is typically IEEE 64 bit)

we include lgamma functions to compute the logarithm of the gamma function; we won’t go into optimizing for this case too much, but i need this for kscript so we will also generate it (to generate it yourself, include --lgamma in your arguments to the script)

## Implementation

the source code i used is available for free: view on GitHub

i used this paper to form the basis of my implementation for the Zeta function. specifically, section 1.2 entitled Convergence of alternating series method. we’ll also need an implementation of the Gamma Function, which i’ve linked papers to help us. note that we can use C’s tgamma function for real number computations, but we’ll have to roll our own for complex numbers (we’ll implement both, for completeness). i won’t go into all of the derivations for all the formula (those are covered in the papers i linked if you’re interested); i’ll try and just breifly cover the motivation and basic algebra between formulas

references:

we will dynamically generate C99 code that will be suitable to machine precision via a Python script, which will use the Python package gmpy2 (pip3 install gmpy2)

we start by importing things, declaring an argument format, and having some built ins:

# std library
import sys
import math
import argparse

# for cached functions (may help performance)
from functools import lru_cache

# gmpy2: multiprecision
import gmpy2
from gmpy2 import mpfr, const_pi, exp, sqrt, log, log10

parser = argparse.ArgumentParser(formatter_class=lambda prog: argparse.HelpFormatter(prog, max_help_position=40))

parser.add_argument('--prec', help='Internal precision to use in all computations (in bits)', default=1024, type=int)
parser.add_argument('--prefix', help='Prefix to the C style functions to generate (include the "_"!)', default="my_")
parser.add_argument('--lgamma', help='Whether or not to include an implementation of the lgamma function', action='store_true')

args = parser.parse_args()

# set precision to the requested one
gmpy2.get_context().precision = args.prec

# pi & e, but to full precision within gmpy2
pi = const_pi()
e = exp(1)

# factorial, product of all integers <= x
@lru_cache
def factorial(x):
return math.factorial(x)

# double factorial, if n is even, product of all even numbers <= n, otherwise the product of all odd numbers <= n
@lru_cache
def double_factorial(n):
if n <= 1:
return 1
else:
return n * double_factorial(n - 2)

# n choose k, i.e.
@lru_cache
def choose(n, k):
if k < 0:
return 0
return math.comb(n, k)

# return the amount of digits that are accurate in a given error bound
# example: digits_accurate(.001) returns 3, since it is down to 3 places
def digits_accurate(x):
return float(max([-log10(x), 0]))



these are all pretty self explanatory; especially with the comments. we use @lru_cache to cache results of functions to reduce overhead when repeatedly calling with the same arguments.

## Gamma Function

to create a table-based approximation (specifically, the Lanczos Approximation), we start with an observation that the Gamma function can be written as:

$\Gamma(x+1) \approx \sqrt{2 \pi} (x+g+\frac{1}{2})^{x+\frac{1}{2}} e^{-(x+g+\frac{1}{2})} A_g(x)$

where

$A_g(x) = \frac{1}{2}p_0(g) + p_1(g)\frac{x}{x+1} + p_2(g)\frac{x(x-1)}{(x+1)(x+2)} + ...$

where $g$ is a specifically chosen number to maximize accuracy, $n$ is the number of terms, and the $p_i$ are coefficients computed for the best fit

this works fine on paper, but when actually implementing the C code there is a number of problems:

• the number of floating-point multiplies is $O(n^2)$ at run time (think about all the fractions in $A_g$), which will make it slower to execute, as well as introduce more error
• those multiplications will result in fractions which have very different magnitudes; doing that and summing their results will make roundoff errors much more prevalent
• more intermediate adds $x, x+1, x+2, …$, means more register usage and for large tables, maybe even stack variables (that’s really bad for performance!)

Wouldn’t it be nice to re-arrange the $A_g$ function into something that looks like the following:

$A_g'(x) = a_0 + \frac{a_1}{x+1} + \frac{a_2}{x+2} + ... \frac{a_{n-1}}{x+n-1}$

Therefore minimizing and simplifying the resulting code? Yes it would! But how do we do that?

Well; we view the summation as a Partial Fraction Summation; think of the following equation:

$\frac{1}{x+1} + \frac{1}{x+2} = \frac{(x + 1) + (x + 2)}{(x+1)(x+2)} = \frac{2x+3}{(x+1)(x+2)}$

We can algebraically manipulate the rational expression to yield either a summation of divisions, or a division of products & sums. This is actually very similar to our above expression that we want for $A_g’(x)$

I’ll skip all the murky details here, but essentially we’ll end up solving a matrix equation that will tell us what our $a_i$ should be (similar to how, in the simple expression, if we are given $\frac{2x+3}{(x+1)(x+2)}$, our result $a_i$ should be $[1, 1]$). That matrix equation is defined via:

$a = \mathbf{B} \mathbf{C} \mathbf{D_c} \mathbf{D_r} p$

Where $a, p$ are vectors of the coefficients mentioned in the above formulas, and the rest are $n \times n$ matrices generated to describe the partial fraction decomposition.

Further, the error may be calculated as:

$\textrm{err} = |\Gamma(x) - approx(x)| \leq |\frac{\pi}{2}(\frac{e^g}{\sqrt{2}} - (\frac{p_0}{2} + \sum_{j=1}^{n}(-1)^{j} p_j))|$

We would like to ensure that $$\textrm{err} \leq 10^{-14}$$, which is among the limits of IEEE 64 bit floating point (i.e. a double in C)

Here’s my code implementing the actual formulae:

# generates a table for the Gamma function, used for approximation
# returns (coefs, errbound())
@lru_cache
def get_gamma_table(n, g):

# we need to generate the array of coefficients a such that:
# Gamma(x) = (x + g + 0.5)^(x+0.5) / (e^(x+g-0.5)) * L_g(x)
# L_g(x) = a[0] + sum(a[k] / (z + k) for k in range(1, N))

# essentially, we construct some matrices from number-theoretic functions
#   and we can generate the coefficients of the partial fraction terms 1 / (z + k)
# This greatly simplifies from the native z(z-1).../((z+1)(z+2)...) form, which
#   would require a lot more operations to implement internally (see definition of Ag(z) on wikipedia)

# calculate an element for the 'B' matrix (n x n)
def getB(i, j):
if i == 0:
return 1
elif i > 0 and j >= i:
return (-1) ** (j - i) * choose(i + j - 1, j - i)
else:
return 0

# calculate an element for the 'C' matrix (n x n)
def getC(i, j):
if i == j and i == 0:
return mpfr(0.5)
elif j > i:
return 0
else:
# this is the closed form instead of calculating a sum via the 'S' symbol mentioned in some places
return int((-1) ** (i - j) * 4 ** j * i * factorial(i + j - 1) / (factorial(i - j) * factorial(2 * j)))

# calculate an element for the 'Dc' matrix (n x n)
@lru_cache
def getDc(i, j):
if i != j:
# it's a diagonal matrix, so return 0 for all non-diagonal elements
return 0
else:
# otherwise, compute via the formula given
return 2 * double_factorial(2 * i - 1)

# calculate an element for the 'Dr' matrix (n x n)
@lru_cache
def getDr(i, j):
# it's diagonal, so filter out non-diagonal efforts
if i != j:
return 0
elif i == 0:
return 1
else:
# guaranteed to be a integer, so cast it (so no precision is lost)
return -int(factorial(2 * i) / (2 * factorial(i) * factorial(i - 1)))

# generate matrices from their generator functions as 2D lists
# NOTE: this obviously isn't very efficient, but it allows arbitrary precision elements,
#   which numpy does not
# these matrices are size <100, so it won't be that bad anyway
B  = [[getB (i, j) for j in range(n)] for i in range(n)]
C  = [[getC (i, j) for j in range(n)] for i in range(n)]
Dc = [[getDc(i, j) for j in range(n)] for i in range(n)]
Dr = [[getDr(i, j) for j in range(n)] for i in range(n)]

# the f vector, defined as F but without the double rising factorial (which Dc has)
# i left this in here instead of combining here to be more accurate to
#   the method given in 4
f_gn = [sqrt(2) * (e / (2 * (i + g) + 1)) ** (i + 0.5) for i in range(n)]

# multiply matrices X*Y*...
def matmul(X, Y, *args):
if args:
return matmul(matmul(X, Y), *args)
else:
# nonrecursive
assert len(X[0]) == len(Y)
M, N, K = len(X), len(Y[0]), len(Y)

# GEMM kernel (very inefficient; but doesn't matter due to AP floats & small matrix sizes)
return [[sum(X[i][k] * Y[k][j] for k in range(K)) for j in range(N)] for i in range(M)]

# normalization factor; we multiply everything by this so it is pretty close to 1.0
W = exp(g) / sqrt(2 * pi)

# get the resulting coefficients
# NOTE: we should get a column vector back, so return the 0th element of each row to get the coefficients
a = list(map(lambda x: W * x[0], matmul(Dr, B, C, Dc, [[f_gn[i]] for i in range(n)])))

# compute 'p' coefficients (only needed for the error bound function)
p = [sum([getC(2 * j, 2 * j) * f_gn[j] * Dc[j][j] for j in range(i)]) for i in range(n)]

# error bound; does not depend on 'x'
def errbound():
# given: err <= |pi/2*W*( sqrt(pi) - u*a )|
# compute dot product u * a
dot_ua = (a[0] + sum([2 * a[i] / mpfr(2 * i - 1) for i in range(1, n)])) / W
# compute full formula
return abs(pi / 2 * W) * abs(sqrt(pi) - dot_ua)

# return them
return a, errbound


Great! Now we can generate a function (in C code), that should look like:

I define macros for constants such as PI for kscript; but you may want to use your own; check the script to see how i precompute constants. The script will generate constants for $\pi$, $\log \pi$, $\sqrt{2 \pi}$, $\log\sqrt{2 \pi}$ in full precision, so we don’t have to worry about that (defines them as MY_PI, MY_LOG_PI, etc)

// evaluate the gamma function at a given point
double my_gamma(double x) {
if (x <= 0) {
// check for poles
if (x == (int)x) return INFINITY;

// use reflection formula, since it won't converge otherwise
// Gamma(x) = pi / (sin(pi * x) * Gamma(1 - x))
return MY_PI / (sin(MY_PI * x) * my_gamma(1 - x));
} else {
// shift off by 1 to make indexing cleaner
x -= 1.0;

// constant
static const double g = ...;
// length (n) used
static const int a_n = ...;
// array of data
static const long double a[] = {
... table values ...
};

// keep track of sum
long double sum = a[0];
int i;
for (i = 1; i < a_n; ++i) {
sum += a[i] / (x + i);
}

// temporary variable
double tmp = x + g + 0.5;
return sqrt(2 * MY_PI) * pow(tmp, x + 0.5) * exp(-tmp) * sum;
}
}


Similarly, we can do the complex version (replacing sin with csin, etc):

double complex my_cgamma(double complex x) {
double x_re = creal(x), x_im = cimag(x);

// short circuit for real only
if (x_im == 0.0) return my_gamma(x_re);

if (x_re < 0) {
// use reflection formula, since it won't converge otherwise
// Gamma(x) = pi / (sin(pi * x) * Gamma(1 - x))
return MY_PI / (csin(MY_PI * x) * my_cgamma(1 - x));
} else {
// shift off by 1 to make indexing cleaner
x -= 1.0;

// constant
static const double g = ...;
// length (n) used
static const int a_n = ...;
// array of data
static const long double a[] = {
... table values ...
};

// keep track of sum
long double complex sum = a[0];
int i;
for (i = 1; i < a_n; ++i) {
sum += a[i] / (x + i);
}

// temporary variable
double complex tmp = x + g + 0.5;
return MY_SQRT_2PI * pow(tmp, x + 0.5) * exp(-tmp) * sum;
}
}


For the sake of brevity; i won’t post the implementation of my_*lgamma; it is very similar to these. run the script yourself to see it’s output for that!

## Zeta Function

The hard part is over; the Zeta function is actually (in my opinion) easier to generate a table for. it is based on a (quite) simple formula:

$\zeta(x) = \frac{1}{d_0(1 - 2^{1-x})} \sum_{k=1}^{n}\frac{(-1)^{k-1}d_k}{k^x} + \gamma_n(x)$

Where:

$n$ is the number of terms in the approximation, $d$ is a vector of coefficients (similar to $a$ in the Gamma approximation) best fit for a particular model, and $gamma_n(x)$ is the error term

The form of $d_k$ is much simpler to compute; it is given by:

$d_k = n \sum_{j=k}^{n}\frac{(n+j-1)!4^j}{(n-j)!(2j)!}$

The error term is bounded by:

$|\gamma_n(x)| \leq \frac{2}{(3+\sqrt{8})^n|\Gamma(x)||1-2^{1-x}|} \leq \frac{3(1+2|\Im(x)|e^{\frac{|\Im(x)|\pi}{2}})}{(3+\sqrt{8})^n|1-2^{1-x}|}$

And, therefore, again in this approximation, we would require that $\gamma_n(x) \leq 10^{-14}$, to give us a sufficiently accurate approximation


# -- ZETA --

# generates a zeta table which can be used for calculation
# returns (coefs, errbound(x)), which are the coefficients for the table as well as an error bound function
@lru_cache
def get_zeta_table(n):

# the error bound of the approximation technique for a given input x
def errbound(x):
# absolute value of the imaginary component
t = abs(complex(x).imag)

# calculate error term
et = (3 / (3 + sqrt(8)) ** n) * ((1 + 2 * t) * exp(t * pi / 2)) / (1 - 2 ** (1 - x))
return abs(et)

# compute d_k, from Proposition #1 in the paper
def d(k):
res = 0
for j in range(k, n+1):
num, den = factorial(n + j - 1) * 4 ** j, (factorial(n - j) * factorial(2 * j))
res += mpfr(num) / den

return n * res

# d(0) is the value by which we normalize; this is more efficient
#   and reduces loss of precision while doing arithmetic in C
d_norm = d(0)

# list of 'd's to sum later (normalized to d(0))
d = [d(k) / d_norm for k in range(1, n + 1)]

return d, errbound



to generate C code for real inputs, we can generate a single table (since, the errbound() function relies on the imaginary component of the input, and does not change significantly with the real portion).

double my_zeta(double x) {
if (x < 0) {
// check for negative even integers (which are exactly 0)
int ix = (int)x;
if (ix == x && ix % 2 == 0) return 0.0;

// use reflection formula with the functional equation
// Zeta(x) = 2 * (2*PI)^(x-1) * sin(x * PI/2) * Gamma(1-x) * Zeta(1 - x)
return 2 * pow(2 *MY_PI, x - 1) * sin(x * MY_PI / 2) * my_gamma(1 - x) * my_zeta(1 - x);
} else {
// for x >= 0, summation with the coefficients will work fine

// the 'n' used in computation
static const int n = ...;

// d_k for k == 0 through n-1 (0-indexing)
// NOTE: we bake in the alternating sign here, instead of at runtime (cheaper this way)
static const long double d[] = {
... table values ...
};

// use long double will prevent some rounding errors
long double sum = 0.0;

int i;
// compute elementwise sum
for (i = 0; i < n; ++i) {
sum += d[i] * pow(i + 1, -x);
}

// divide by the normalization factor in Proposition 1 (without d0, since everything has been normalized by that)
sum /= 1 - pow(2, 1 - x);

return (double)sum;
}
}


Note that our zeta function may call the gamma function we defined; this is neccessary due to the fact that the summation will not converge for negative values of $x$

For complex inputs; we need to realize that the maximum error term increases as the imaginary component; for example, the author estimates that it is required that $n \geq 1.3d + 0.9\Im(x)$, if $d$ digits are required for accuracy. However; for $x$ with small imaginary components, we should use the smallest $n$ that will work:


# set to whatever the required digits of precision is
goal_digits = 14.0

# dictiorary where k, v indicates:
# when imag(x) <= k, v may be used as the list of d_k for sufficient accuracy
imag_d_map = { }
# current table length
n = 4

# get table bounds
d, errbound = get_zeta_table(n)

# find tables for imag(x) <= 2 ** p
for p in range(0, 5):

val_r = 2.0
val_i = 2 ** p

while digits_accurate(errbound(val_r + val_i * 1j)) < goal_digits:
n += 4
d, errbound = get_zeta_table(n)

imag_d_map[val_i] = d



This code creates a mapping that maps a threshold of the imaginary component to the corresponding minimal table size (and always has $n \equiv 0 \mod 4$), we then loop in our python script to generate the following generated code:

double complex my_czeta(double complex x) {
// get real and imaginary components
double x_re = creal(x), x_im = cimag(x);

// use the real-only version of the function if it is a real argument
if (x_im == 0.0) return my_zeta(x_re);

if (x_re < 0) {

// use reflection formula with the functional equation
// Zeta(x) = 2 * (2*PI)^(x-1) * sin(x * PI/2) * Gamma(1-x) * Zeta(1 - x)
return 2 * cpow(2 * MY_PI, x - 1) * csin(x * MY_PI / 2) * my_cgamma(1 - x) * my_czeta(1 - x);

} else {
// when the real component is >= 0, summation with the coefficients will work fine

/* Generated Tables */

// the 'n' used in computation
// NOTE: this table is useful for abs(imag(x)) <= 1
static const int n_0 = 28;

// d_k for k == 0 through n-1 (0-indexing)
// NOTE: we bake in the alternating sign here, instead of at runtime (cheaper this way)
static const long double d_0[] = {
... table values ...
};

// the 'n' used in computation
// NOTE: this table is useful for abs(imag(x)) <= 2
static const int n_1 = 28;

// d_k for k == 0 through n-1 (0-indexing)
// NOTE: we bake in the alternating sign here, instead of at runtime (cheaper this way)
static const long double d_1[] = {
... table values ...
};

// the 'n' used in computation
// NOTE: this table is useful for abs(imag(x)) <= 4
static const int n_2 = 28;

// d_k for k == 0 through n-1 (0-indexing)
// NOTE: we bake in the alternating sign here, instead of at runtime (cheaper this way)
static const long double d_2[] = {
... table values ...
};

// the 'n' used in computation
// NOTE: this table is useful for abs(imag(x)) <= 16
static const int n_4 = 40;

// d_k for k == 0 through n-1 (0-indexing)
// NOTE: we bake in the alternating sign here, instead of at runtime (cheaper this way)
static const long double d_4[] = {
... table values ...
};

// absolute value of the imaginary portion (used for discriminating amongst tables)
double x_im_abs = fabs(x_im);

// the sum of all elements in the series
long double complex sum = 0.0;
int i;
if (x_im_abs <= 1) {
for (i = 0; i < n_0; ++i) {
sum += d_0[i] * cpow(i + 1, -x);
}
} else if (x_im_abs <= 2) {
for (i = 0; i < n_1; ++i) {
sum += d_1[i] * cpow(i + 1, -x);
}
} else if (x_im_abs <= 4) {
for (i = 0; i < n_2; ++i) {
sum += d_2[i] * cpow(i + 1, -x);
}
} else if (x_im_abs <= 8) {
for (i = 0; i < n_3; ++i) {
sum += d_3[i] * cpow(i + 1, -x);
}
} else {
for (i = 0; i < n_4; ++i) {
sum += d_4[i] * cpow(i + 1, -x);
}
}

// transform by the normalization factor
sum /= 1 - cpow(2, 1 - x);

return sum;
}
}


The real code goes up to $2^{10}$, but you get the point; essentially, the smallest possible table is generated for every power of two threshold of the imaginary component. For example, $n = 144$ is required for $\Im(x) \leq 128$. If we used that for all numbers, we would be $\frac{144}{28} \approx 5.14$ times slower than we need to be! This will help in performance.

## Testing

To test this, i wrote a small program using the generated code. You can check out the full source code (mg.c). I tested values of $x=\sigma+it$, for $\sigma, t \in [0, 256)$, and compared the time. I also compared the built in tgamma function discussed earlier, and measured how accurate my implementation was relative to it; here are the results summarized:

gcc -std=c99 -Ofast -fno-math-errno t.c -lm -o test_gz
./test_gz
out:# -- ACCURACY
out:|my_gamma(x)-tgamma(x)| <= 0.000000 , at x=0.000480, accurate to 14.76 digits
out:
out:# -- SPEED
out:my_gamma(x), x in [0, 1)      :    0.039 us/iter
out:my_gamma(x), x in [0, 4)      :    0.038 us/iter
out:my_gamma(x), x in [0, 16)     :    0.037 us/iter
out:my_gamma(x), x in [0, 256)    :    0.052 us/iter
out:my_cgamma(x), x in [0i, 1i)   :    0.137 us/iter
out:my_cgamma(x), x in [0i, 4i)   :    0.135 us/iter
out:my_cgamma(x), x in [0i, 16i)  :    0.139 us/iter
out:my_cgamma(x), x in [0i, 256i) :    0.155 us/iter
out:my_zeta(x), x in [0, 1)       :    0.395 us/iter
out:my_zeta(x), x in [0, 4)       :    0.396 us/iter
out:my_zeta(x), x in [0, 16)      :    0.397 us/iter
out:my_zeta(x), x in [0, 256)     :    0.416 us/iter
out:my_czeta(x), x in [0i, 1i)    :    1.356 us/iter
out:my_czeta(x), x in [0i, 4i)    :    1.494 us/iter
out:my_czeta(x), x in [0i, 16i)   :    2.005 us/iter
out:my_czeta(x), x in [0i, 256i)  :   11.456 us/iter
out:tgamma(x), x in [0, 1)        :    0.029 us/iter
out:tgamma(x), x in [0, 4)        :    0.036 us/iter
out:tgamma(x), x in [0, 16)       :    0.058 us/iter
out:tgamma(x), x in [0, 256)      :    0.050 us/iter


Feel free to compile it on your machine and email me results; i’d be happy to include them.

My implementation and glibc’s implementation of the gamma function agree everywhere up to 14 digits, which is plenty accurate (we could check Wolfram alpha exactly to see whether i was closer or they were, but they are both fine for our purposes).

The generated source code i use in kscript (as well as for the demo) is available here, as a single file, feel free to use in non-commercial projects.

I hope you’ve enjoyed the blog, and can use these implementations for your own project. The C code is very simple and should be pretty easy to port to other languages (JavaScript, Python, C#, etc, etc).