This repo is no longer actively being maintained. Don't be dissapointed though, check out https://github.com/waylonflinn/bvec instead!
Bdot does big dot products (by making your RAM bigger on the inside). It's based on Bcolz and includes transparent disk-based storage.
Supports matrix . vector
and matrix . matrix
for most common numpy numeric data types (numpy.int64
, numpy.int32
, numpy.float64
, numpy.float32
)
pip install bdot
or build from source (requires bcolz >= 0.9.0)
python setup.py build_ext --inplace
python setup.py install
Multiply a matrix (carray
) with a vector (numpy.ndarray
), returns a vector (numpy.ndarray
)
import bdot
import numpy as np
matrix = np.random.random_integers(0, 12000, size=(300000, 100))
bcarray = bdot.carray(matrix, chunklen=2**13, cparams=bdot.cparams(clevel=2))
v = bcarray[0]
result = bcarray.dot(v)
expected = matrix.dot(v)
# should return True
(expected == result).all()
Multiply a matrix (carray
) with the transpose of a matrix (carray
), returns a matrix (carray
)
import bdot
import numpy as np
matrix = np.random.random_integers(0, 120, size=(1000, 100))
bcarray1 = bdot.carray(matrix, chunklen=2**9, cparams=bdot.cparams(clevel=2))
bcarray2 = bdot.carray(matrix, chunklen=2**9, cparams=bdot.cparams(clevel=2))
# calculates bcarray1 . bcarray2.T (transpose)
result = bcarray1.dot(bcarray2)
expected = matrix.dot(matrix.T)
# should return True
(expected == result).all()
Save really big results directly to disk
# create correctly sized container (helper method, not required)
output = bcarray1.empty_like_dot(bcarray2, rootdir='/path/to/bcolz/output')
# generate results directly on disk
bcarray1.dot(bcarray2, out=output)
# make sure the last bits get written
output.flush()
The out
parameter can also be used to get carray
output with an ndarray
vector input. If you don't want disk based storage, just leave out the rootdir
parameter. You can also use your own carray
container, as long as it's the correct shape.
nosetests bdot
Benchmarks were done on data structures generated by the above code, are very informal, and vary a bit across data sets.
numpy
~229MBbdot
~64MB
compression ratio: 3.5
numpy
~33 msbdot
~48 ms
percent performance: 68%
This project has three goals, each slightly more fantastic than the last:
-
Allow computation on (compressed) data which is (~5-10x) larger than RAM at approximately the same speed as
numpy.dot
-
Allow computation on (slightly compressed) data at speeds that improve on
numpy.dot
-
Allow computation on (compressed) data which resides on disk at some sizable percentage (~50-30%) of the speed of
numpy.dot
So far, the first goal has been met.
This library wouldn't be possible without all the talented people who worked hard to create Bcolz (and the libraries on which it's based). Initial code was also heavily influenced by Bquery.
Awesome TARDIS can be found here