Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So the file format is a lot better than CSV files, but in principle it's basically just a bunch of files. Maybe a better analogy would have been a big folder of feather/hdf5/etc files.

(Incidentally, I'm a big fan of the folder/s3 bucket/etc full of CSV/binary files and use it whenever possible.)

I agree - it's absolutely better to use than that, but it's a lot closer to that model than to the Riak model of querying a distributed system to send you the data.

I stand corrected on single box and pricing - it's been a while since I've used it.



kdb+/q/k are used for IOT applications [1], not just fin tec. After all, it is all time series data.

The benchmarks given in a response above by srpeck [2], shows spark/shark to be 230 times slower than a k4 query, and using 50GB or RAM vs. 0.2GB RAM for k4. If RiakTS is relying on spark/shark as the in-memory database engine, it is already at a big disadvantage compared to k in terms of speed, and all the RAM that is going to be required on those distributed servers.

I will have to look at the DDL/math functions available in RiakTS too, since that is how you get your work done regardless of speed of access.

[1] http://www.kdnuggets.com/2016/04/kxcon2016-kdb-conference-ma...

[2] http://kparc.com/q4/readme.txt


Very cool, I stand corrected. I hope one day I have another opportunity to play with KDB.

As for the speed advantage, you'll have a similar speed advantage with python/pandas/big folder of CSV files. For all of Spark's claims on "speed", it's really just reducing the speed penalty of Hadoop from 500x to 50x. (Here 500x and 50x refer to the performance of loading flat files from a disk.)


Do you really mean flat CSV text files? I get the simplicity of that, but it seems really expensive (speed and size). But I'm used to tables with more than a dozen columns, and with kdb+ you only be pull in the columns of interest, and the rows of interest (due to on-disk sorting and grouping), which is a smaller subset, often much smaller.


By number, my data sets are usually in CSV. I could probably get some additional advantage via HDF5, but a gzipped CSV is usually good enough and simpler. By volume (i.e. on my 2 or 3 biggest data sets) I'll probably be mostly HDF5. I haven't tried feather yet but it looks pretty nice.

KDB would probably be better, but don't underestimate what you can do with just a bunch of files.


RiakTS does not rely on any external data storage (other than our fork of Google's leveldb) or processing tool, so Spark's performance is irrelevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: