My last post is few years old and it’s not because I have lack of good ideas to be written, but because of my poor time management I’m trying to improve. It’s January, time to make some New Year resolutions :), so I will try to post more often.
Recently, on one project I had to load several CSV files from a remote linux host to a DB2 database residing also on a linux host. Sounds easy, but the files were quite big (several GiB), so I tried to do it in the most efficient way. Here is my approach:
- Compress the file on the source system, because usually the text files have low entropy, so the compression ration is very good.
- Create a named pipe on the target host
- Send the data over the network using ssh in a compressed format
- Uncompress the data on the fly and redirect the uncompressed output into the named pipe
- Use the named pipe as an input file for the DB2 loading tool
I believe this is the most efficient way for loading data from a remote host. It has also few advantages. You don’t need space for uncompressing the file on the target host and you don’t even need space for compressed file on the target, at all. This approach also saves a lot of network traffic because we are moving the data over the network in a compressed format. It was very useful in my scenario because I loaded the data to a columnar database (DB2 BLU and Sybase IQ) which usually have also a very good compression ratio of the data and the amount of the storage was quite restricted on the target.
Here is an example:
source$ gzip file.csv target$ mkfifo -m 0666 /load/file.csv source$ cat file.csv.gz | ssh target 'gunzip -c - > /load/file.csv' target$ db2 "load FROM /load/file.csv ... ";