Issues running PySpark on AWS Lambda

I know the recommended strategy is to use EMR Serverless or EMR. However, I have a particular use case where I only need to run a fairly small PySpark job and need quick results. I've already gotten my job working on EMR Serverless. I want to configure a Lambda to perform the same functionality based on my aforementioned needs. I've used the Spark on AWS Lambda example code as a guide. I'm trying to make my Lambda compatible with EMR /EMR Serverless 7.1.0 - using the same library versions as specified here.

I've been able to create a container image Lambda. I've gotten most of the runtime issues out of the way, but I'm encountering an issue where I have too many open files now:

2024-06-07T17:47:36.185Z	ERROR StatusLogger Error creating converter for d
2024-06-07T17:47:36.185Z	java.lang.reflect.InvocationTargetException
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
2024-06-07T17:47:36.185Z	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2024-06-07T17:47:36.185Z	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.createConverter(PatternParser.java:590)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.finalizeConverter(PatternParser.java:657)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:420)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.pattern.PatternParser.parse(PatternParser.java:177)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout$SerializerBuilder.build(PatternLayout.java:473)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:139)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:60)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:766)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.config.AbstractConfiguration.setToDefault(AbstractConfiguration.java:745)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:47)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:84)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.createContext(ClassLoaderContextSelector.java:254)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:218)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:140)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:123)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:230)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:176)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:666)
2024-06-07T17:47:36.185Z	at org.apache.logging.log4j.LogManager.getRootLogger(LogManager.java:700)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:114)
2024-06-07T17:47:36.185Z	at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:108)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:76)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:84)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
2024-06-07T17:47:36.185Z	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2024-06-07T17:47:36.185Z	Caused by: java.lang.Error: java.io.FileNotFoundException: /usr/lib/jvm/java-17-amazon-corretto.x86_64/lib/tzdb.dat (Too many open files)

I think the problem is due to me including both AWS SDK for Java versions 2 and 1 which is noted in the above documentation:

2.23.18, 1.12.656

Which results in over a thousand jar files in my PySpark jar directory. I would like to cull this list, but it would be a tedious process to determine which jar files are needed. Unfortunately, EMR/EMR Serverless code is not public, so I don't know exactly which libs are needed. Does anyone know which libraries are needed or if I can limit what jar files I need to include? I cannot increase the number of open file descriptors because Lambda limits it to 1024.

Or is there another issue I should know about?

Topics

Serverless Compute Analytics

Relevant content

Distributed XGBoost on Pyspark/EMR Serverless
zzzz8888
asked 6 days ago
EMR Serverless jar job issues
Sviat
asked 2 months ago
Running pyspark jobs on EMR serverless with libraries/dependencies for optimized performance
Jose
asked a year ago
How to pass EMR Serverless PySpark entryPointArguments as variable
george_ognyanov
asked a year ago
How do I modify the Spark configuration in my Amazon EMR Studio Workspace?
AWS OFFICIALUpdated 14 days ago
How do I determine whether to use a bootstrap action or a step on an Amazon EMR cluster?
AWS OFFICIALUpdated a month ago
How do I configure Amazon EMR to run a PySpark job using Python 3.4 or 3.6?
AWS OFFICIALUpdated 2 years ago
How do I use the AWSSupport-AnalyzeEMRLogs runbook to identify errors when I run an Amazon EMR job on a cluster?
AWS OFFICIALUpdated a year ago
Increase the EBS volume in running EMR cluster
SUPPORT ENGINEER
Yokesh NK
published a month ago
EMR Serverless service principal is not authorized to perform: ECR:DescribeImages on resource
SUPPORT ENGINEER
Yokesh NK
published 4 months ago