In the first two parts of this series, I looked at the basics of the interface I created between rmathlib and kdb+. In this post, I’ll go through some of the convenience functions I wrote to emulate some basic R functionality.
NOTE: I wrote these functions as a learning exercise to familiarise myself with q/kdb+ a bit more – they are very simplistic and I am sure there are much better ways to do them (I know the very nice qml project does a lot of this stuff, plus interfaces to external BLAS libraries).
Some basics: load the library:
q)\l rmath.q
seq – sequence function
The first convenience function is seq
, which is like R’s seq()
or Numpy’s arange()
in that it takes a start and end point, and generates a range of numbers.
q)seq[1;100]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26...
The function is just a curried wrapper around seqn
, which also takes a step size:
q)seqn[1;10;.5]
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
This is handy when working with some of the probability generation functions shown last time. For instance the commands below calculate the normal distribution (cdf) values for evenly spaced points from -3 to 3:
q)each[{pnorm[x;0;1]}] seqn[-3;3;.5]
0.001349898 0.006209665 0.02275013 0.0668072...
table – Symbol Tabulation
R provides a handy function to show summary count tables of factor levels. In q, this can be used as follows:
q)t:([] sym:1000?`A`B`C`D)
q)table[`sym;t]
B| 241
A| 244
C| 256
D| 259
This tabulates the column sym
from the table t
. The summary is ordered by increasing frequency count.
NOTE: this has really no advantage over the standard
q)`x xasc select count i by sym from t
sym| x
---| ---
B | 241
A | 244
C | 256
D | 259
except some slight brevity.
range – Min/Max
The range
function simply returns the boundaries of a set of values:
q)x:rnorm[10000]
q)range[x]
-3.685814 4.211363
q)abs (-) . range[x] / absolute range
7.897177
summary – Min/Max/Mean/Median/IQR
The summary
function provides summary stats, a la R’s summary()
function:
q)x:norm[10000;3;2]
q)summary x
min | -4.755305
1q | 1.59379
median| 2.972523
mean | 2.966736
3q | 4.336589
max | 10.00284
quantile – Quantile Calculations
A very simple quantile calc:
q)x:norm[10000;3;2]
q)quantile[x;.5]
2.973137
hist – Bin Count
Very crude bin count – specify the data and the number of bins:
q)hist[x;10]
-4.919383| 14
-3.279589| 101
-1.639794| 601
0 | 1856
1.639794 | 3043
3.279589 | 2696
4.919383 | 1329
6.559177 | 319
8.198972 | 40
9.838766 | 1
diag – Identity Matrix Generation
q)diag 10
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1
Scale functions
Sometimes its very useful to scale an input data set – e.g. when feeding multiple inputs into a statistical model, large differences between the relative scales of the inputs combined with finite-precision computer arithmetic can result in some inputs being dwarfed by others. The scale
function just adjusts the input as follows: \(X_{s}=(X-\mu)/\sigma\).
The example below scales two inputs with different ranges:
q)x:norm[10;0;1]; y:norm[10;5;3]
q)x
1.920868 -1.594028 -0.02312519 1.079606 -0.5310111 0.2762119 0.1218428 0.9584264 -0.4244091 -0.7981221
q)y
10.69666 2.357529 8.93505 3.65696 5.218461 3.246216 5.971919 7.557135 1.412827 1.246241
q)range x
-1.594028 1.920868
q)range y
1.246241 10.69666
q)range scale x
-1.74507 1.878671
q)range scale y
-1.232262 1.845551
There are other useful scaling measures, including the min/max scale: \( \frac{x-min(x)}{max(x)-min(x)} \). This is implemented using the minmax
function:
q)minmax x
1 0 0.4469273 0.760658 0.302432 0.5320897 0.4881712 0.726182 0.3327606 0.226438
range minmax x
0 1f
There are other functions which are useful for scaling, e.g. the RMSD (root-mean-square deviation): \( \sqrt{\frac{\sum{x_i^2}}{N}}\):
q)x:rnorm 1000
q)rms x
1.021065
nchoosek – Combinations
The nchoosek
function calculates the number of combinations of N items chosen k at a time (i.e. \({N}\choose{k}\):
q)nchoosek[100;5]
7.528752e+07
q)each[{nchoosek[10;x]}] seq[1;10]
10 45 120 210 252 210 120 45 10 1f
The source file is available here: https://github.com/rwinston/kdb-rmathlib/blob/master/rmath_aux.q.