jeudi 13 août 2015

HDFS - load mass amount of files

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.

The problem is when I try to upload those local files into HDFS with the command:

hdfs dfs -copyFromLocal /home/user/Documents/smallData /

Then i get one of the following Java-Heap-Size errors:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?

I'm very thankfully for every helpful comment!



via Chebli Mohamed

Aucun commentaire:

Enregistrer un commentaire