-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java Heap Out of Memory Error #11
Comments
Hi, and thanks for the nice feedback 🙂 I'm guessing that the file you're opening is an .xlsx file which means that 218MB is it's compressed size. The library has to decompress the entire file so that it can read it (this is so the formula evaluation works), which can lead to OOM errors. I'm looking at having a simplified version which lets the library stream the data out without the formula evaluation. One of the things you could try is to use a memory optimized sku, or a standard sku with more memory available. There's also the option of setting the JVM options to allow for a bigger heap size, |
Hi @soleyjh did this help out at all? |
I also faced this issue, trying to think of a workaround, can we disable the formula evaluation as part of the configuration and just read static data from Excel? |
It's definitely doable, I'll need to handle the cell evaluation slightly differently based on the config, but let me have a look and see what I can do with it |
Okay, the code is in there to handle this now in the 0.1.13 branch, so you can pass in an option like this .option("evaluateFormulae", "false") What this will do instead is extract the formula itself (e.g. "A7*2"), but it won't attempt to evaluate it. If you let me know which spark version you're targeting I can create a jar for you from that branch to test with. Or check out the branch and build it yourself if fancy :) |
I'm going to test it on my local by checking out the branch |
Alright, I have some insights, it helped a LOT but I still have a Java Heap OOM. The file is XLSX approx 300MB. I processed it with https://github.com/crealytics/spark-excel using the streaming reader config and it works:
I don't know how we can add this streaming as an option to this project. |
It is something I've looked at before, but you lose a lot of the features using the streaming reader. My current thought is to have a different parser which uses the stream reader and a heavy check on the options provided. Is processing locally your usage environment? |
That makes sense. I'm currently performing my testing by running Spark on K8S driver 8CPU 64GB RAM + executors same size. |
What's the JVM heap size set to? |
It's set up to 4GB |
Any success if you increase that size? |
I'm going to get the 0.1.13 release out and then start looking at the 0.1.14 with a view to creating a companion parser which uses the SXSSF (streaming) reader. It means that it won't have the ability to do formula evaluation, and merged cells over a certain size, but it should reduce the memory required by the POI library |
For info, there's a branch called |
Excellent work @dazfuller!! Will review and perform some testing on that branch |
Hi! And thanks so much for writing this great package!
When I run the following command:
df = spark.read.format("com.elastacloud.spark.excel").option("cellAddress", "A1").load(file_location)
I get the following error:
java.lang.OutOfMemoryError: Java heap space
The excel file is 218MB roughly 750K rows with ~50 or so fields (no long text strings).
and I'm running in Azure Databricks: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
Running: Standard_F4 with 8GB Memory and 4 Cores
I downloaded the following JAR and Installed on the Cluster (Above): spark_excel_3_1_2_0_1_7.jar
Any Help or Advice would be appreciated.
The text was updated successfully, but these errors were encountered: