Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize Spliterator implementations for better parallelism #290

Closed
wants to merge 6 commits into from

Conversation

daniel-shuy
Copy link

@daniel-shuy daniel-shuy commented Dec 22, 2021

Why

There are many classes in Apache POI that implements Iterable, but only implements Iterable#iterator() without overriding Iterable#spliterator().

The default implementation of Iterable#spliterator() is:

default Spliterator<T> spliterator() {
    return Spliterators.spliteratorUnknownSize(iterator(), 0);
}

Spliterator#spliteratorUnknownSize(Iterator, int) returns a Spliterator with no initial size estimate, which has poor splitting capabilities. The default implementation has to do this because not all Iterables have a fixed size.

This is important when trying to convert the Iterable to a parallel Stream. For example, when trying to process Rows in a XSSFSheet (XSSFSheet implements Iterable<Row>) in parallel, the performance is terrible:

StreamSupport.stream(sheet.spliterator(), true)
    .forEach(row -> {
        // ...
    });

The XSSFSheet's Iterator is backed by a TreeMap#values():

private final SortedMap<Integer, XSSFRow> _rows = new TreeMap<>();

/**
* @return an iterator of the PHYSICAL rows. Meaning the 3rd element may not
* be the third row if say for instance the second row is undefined.
* Call getRowNum() on each row if you care which one it is.
*/
@Override
@SuppressWarnings("unchecked")
public Iterator<Row> rowIterator() {
return (Iterator<Row>)(Iterator<? extends Row>) _rows.values().iterator();
}
/**
* Alias for {@link #rowIterator()} to
* allow foreach loops
*/
@Override
public Iterator<Row> iterator() {
return rowIterator();
}

TreeMap#values() returns a Collection, and thus has a fixed size. Its Spliterator is hence sized and has good splitting capabilities, allowing it to be properly parallelized. Therefore an easy way to customize XSSFSheet#spliterator() is to simply delegate to the underlying TreeMap#values(), eg.

@Override
@SuppressWarnings("unchecked")
public Spliterator<Row> spliterator() {
    return (Spliterator<Row>)(Spliterator<? extends Row>) _rows.values().spliterator();
}

How

  • For classes that expose an Iterator factory method with a backing Collection/Iterable, I simply delegated the Spliterator factory method to the underlying Collection/Iterable.
  • For classes that expose an Iterator factory method but do not have a backing Collection/Iterable, but have a fixed size, I customized the Spliterator factory method to return a sized Spliterator using Spliterators#spliterator(Iterator, long, int).
  • Classes that expose an Iterator factory method but do not have a fixed size remain unchanged.

I also took the liberty to perform the following refactors:

  • Move iterator() implementation in child classes to Workbook/Sheet/Row interface using default methods (the iterator() method is an alias of sheetIterator()/rowIterator()/cellIterator(), and is currently being duplicated in all child classes)
  • Add Iterable interface to IntMapper and XDDFTextParagraph (so that they can be iterated using an enhanced for loop)

None of the changes break source/binary compatibility.


For anyone interested, the current workaround is create a sized Spliterator using Sheet#getPhysicalNumberOfRows(), eg.

Spliterator<Row> spliterator = Spliterators.spliterator(sheet.iterator(), sheet.getPhysicalNumberOfRows(), 0);
StreamSupport.stream(spliterator, true)
    .forEach(row -> {
        // ...
    });

This however, is still less optimal than using the backing Collection's Spliterator.

@@ -373,6 +373,14 @@ public PackageRelationshipCollection getRelationships(String typeFilter) {
return relationshipsByID.values().iterator();
}

/**
* Get this collection's spliterator.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add @since POI 5.2.0 on any new public methods?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.util.Map;
import java.util.NoSuchElementException;
import java.util.Set;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep the explicit imports and remove the * import?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was automatically formatted by my IDE, will remove it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* @return a spliterator of the sheets.
*/
default Spliterator<Sheet> sheetSpliterator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, we only need spliterator() - I know we have 2 iterator methods but that is some legacy stuff - I don't think new stuff needs to add 2 separate methods

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will remove it, I agree its better to have 1 method

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pjfanning
Copy link
Contributor

@daniel-shuy generally looks good - but could you add some test coverage for some of the new methods? We don't necessarily need full coverage but some regression tests would be a pre-req for a merge.

*/
@Override
public Spliterator<Sheet> spliterator() {
return new SheetSpliterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better than Spliterators.spliterator(sheets, Spliterator.ORDERED)?

Copy link
Author

@daniel-shuy daniel-shuy Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its actually the same, because SheetSpliterator is delegating to sheets.spliterator(), which is actually calling Spliterators.spliterator(this, Spliterator.ORDERED).

IMO its still better to delegate to the backing Collection's spliterator(), so that if the backing Collection changes, there is no risk of the Spliterator's characteristics going out of sync.

I added SheetSpliterator to be consistent with sheetIterator(), which creates an instance of SheetIterator to delegate sheets.iterator(). But now that I think about it, its probably better to simply do a cast, reducing an object instantiation, eg.

@Override
@SuppressWarnings("unchecked")
public Spliterator<Sheet> spliterator() {
    return (Spliterator<Sheet>)(Spliterator<? extends Sheet>) sheets.spliterator();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you'll need to cast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately sheets.spliterator() returns Spliterator<XSSFSheet>

Copy link
Contributor

@pjfanning pjfanning Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try omitting the cast and see if it compiles? I just added spliterators in another project and was suprised that I didn't need to cast but it worked without the cast - pjfanning/excel-streaming-reader@4ac5fb8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, it doesn't work. Using Spliterators.spliterator(this, Spliterator.ORDERED) avoids having to cast because Spliterators.spliterator(Collection, int) takes in Collection<? extends T> and returns Spliterator<T>. If only Iterable<T>#spliterator() returned Spliterator<? extends T> instead of Spliterator<T> 🤦‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough - use the cast

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Iterable#spliterator() default implementation has poor splitting capabilities, is unsized, and does not report any spliterator characteristics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants