Default EMR Spark python environment contains two different versions of the dateutil package

1

I noticed that when you create a new EMR cluster using Spark, the default Python environment includes two different packages that both provide the "dateutil" package:

py-dateutil==2.2
python-dateutil==2.8.1

The latter seems like the correct/official dateutil import: https://pypi.org/project/python-dateutil/ The former has not been updated since 2014 and has a single release: https://pypi.org/project/py-dateutil/

Is this intentional? If not, can py-dateutil be removed from the default python environment? The overlap leads to confusing situations like the linked issue, where you think you have one version of dateutil installed but actually have an older version installed: https://github.com/dagster-io/dagster/issues/22586

To reproduce, simply create a brand new EMR cluster with Spark, SSH into it, and run pip freeze.

dgibson
asked 2 months ago467 views
1 Answer
0

Hello,

py-dateutil is getting installed from EMR side as part of provisioning. Under /usr/local/lib/python3.6/site-packages, you can see the file "py_dateutil-2.2.egg-info". Precedence is given to this pre-installed file at the time of puppet deployment. More details provided here

We do not recommend changing these package versions, as these combinations have not been tested, and we cannot predict what incompatibilities may arise with the services running in EMR. However, if you are willing to perform rigorous testing on your end to ensure everything is working correctly, you can proceed with caution.

AWS
SUPPORT ENGINEER
answered 2 months ago
  • I’m aware that these are part of the default packages provisioned by EMR - both py-dateutil and python-dateutil are part of those default packages that were added during provisioning. I haven’t made any changes to the default packages.

    My question is whether it is intentional that the same import (“dateutil”) is being provided by two different packages (py-dateutil and python-dateutil) at different versions - one at 2.2 and one at 2.8.1.

  • The docs for dateutil reference python-dateutil and do not mention py-dateutil: https://dateutil.readthedocs.io/en/stable/