Customize Spliterator implementations for better parallelism #290

daniel-shuy · 2021-12-22T12:13:52Z

Why

There are many classes in Apache POI that implements Iterable, but only implements Iterable#iterator() without overriding Iterable#spliterator().

The default implementation of Iterable#spliterator() is:

default Spliterator<T> spliterator() {
    return Spliterators.spliteratorUnknownSize(iterator(), 0);
}

Spliterator#spliteratorUnknownSize(Iterator, int) returns a Spliterator with no initial size estimate, which has poor splitting capabilities. The default implementation has to do this because not all Iterables have a fixed size.

This is important when trying to convert the Iterable to a parallel Stream. For example, when trying to process Rows in a XSSFSheet (XSSFSheet implements Iterable<Row>) in parallel, the performance is terrible:

StreamSupport.stream(sheet.spliterator(), true)
    .forEach(row -> {
        // ...
    });

The XSSFSheet's Iterator is backed by a TreeMap#values():

poi/poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFSheet.java

Line 95 in cf1354d

private final SortedMap<Integer, XSSFRow> _rows = new TreeMap<>();

poi/poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFSheet.java

Lines 2049 to 2067 in cf1354d

    
               /** 
        
                * @return an iterator of the PHYSICAL rows.  Meaning the 3rd element may not 
        
                * be the third row if say for instance the second row is undefined. 
        
                * Call getRowNum() on each row if you care which one it is. 
        
                */ 
        
               @Override 
        
               @SuppressWarnings("unchecked") 
        
               public Iterator<Row> rowIterator() { 
        
                   return (Iterator<Row>)(Iterator<? extends Row>) _rows.values().iterator(); 
        
               } 
        
               /** 
        
                * Alias for {@link #rowIterator()} to 
        
                *  allow foreach loops 
        
                */ 
        
               @Override 
        
               public Iterator<Row> iterator() { 
        
                   return rowIterator(); 
        
               }

TreeMap#values() returns a Collection, and thus has a fixed size. Its Spliterator is hence sized and has good splitting capabilities, allowing it to be properly parallelized. Therefore an easy way to customize XSSFSheet#spliterator() is to simply delegate to the underlying TreeMap#values(), eg.

@Override
@SuppressWarnings("unchecked")
public Spliterator<Row> spliterator() {
    return (Spliterator<Row>)(Spliterator<? extends Row>) _rows.values().spliterator();
}

How

For classes that expose an Iterator factory method with a backing Collection/Iterable, I simply delegated the Spliterator factory method to the underlying Collection/Iterable.
For classes that expose an Iterator factory method but do not have a backing Collection/Iterable, but have a fixed size, I customized the Spliterator factory method to return a sized Spliterator using Spliterators#spliterator(Iterator, long, int).
Classes that expose an Iterator factory method but do not have a fixed size remain unchanged.

I also took the liberty to perform the following refactors:

Move iterator() implementation in child classes to Workbook/Sheet/Row interface using default methods (the iterator() method is an alias of sheetIterator()/rowIterator()/cellIterator(), and is currently being duplicated in all child classes)
Add Iterable interface to IntMapper and XDDFTextParagraph (so that they can be iterated using an enhanced for loop)

None of the changes break source/binary compatibility.

For anyone interested, the current workaround is create a sized Spliterator using Sheet#getPhysicalNumberOfRows(), eg.

Spliterator<Row> spliterator = Spliterators.spliterator(sheet.iterator(), sheet.getPhysicalNumberOfRows(), 0);
StreamSupport.stream(spliterator, true)
    .forEach(row -> {
        // ...
    });

This however, is still less optimal than using the backing Collection's Spliterator.

pjfanning · 2021-12-22T12:53:23Z

poi-ooxml/src/main/java/org/apache/poi/openxml4j/opc/PackageRelationshipCollection.java

@@ -373,6 +373,14 @@ public PackageRelationshipCollection getRelationships(String typeFilter) {
        return relationshipsByID.values().iterator();
    }

+    /**
+     * Get this collection's spliterator.
+     */


could you add @since POI 5.2.0 on any new public methods?

poi-ooxml/src/main/java/org/apache/poi/xssf/streaming/SXSSFRow.java

poi-ooxml/src/main/java/org/apache/poi/xssf/streaming/SXSSFSheet.java

poi-ooxml/src/main/java/org/apache/poi/xssf/streaming/SXSSFWorkbook.java

poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFRow.java

pjfanning · 2021-12-22T14:20:04Z

poi/src/main/java/org/apache/poi/hssf/usermodel/HSSFWorkbook.java

-import java.util.Map;
-import java.util.NoSuchElementException;
-import java.util.Set;
+import java.util.*;


can you keep the explicit imports and remove the * import?

Sorry, this was automatically formatted by my IDE, will remove it

pjfanning · 2021-12-22T14:27:02Z

poi/src/main/java/org/apache/poi/ss/usermodel/Workbook.java

+     *
+     * @return a spliterator of the sheets.
+     */
+    default Spliterator<Sheet> sheetSpliterator() {


maybe, we only need spliterator() - I know we have 2 iterator methods but that is some legacy stuff - I don't think new stuff needs to add 2 separate methods

ok, will remove it, I agree its better to have 1 method

pjfanning · 2021-12-22T15:19:43Z

@daniel-shuy generally looks good - but could you add some test coverage for some of the new methods? We don't necessarily need full coverage but some regression tests would be a pre-req for a merge.

pjfanning · 2021-12-22T17:15:52Z

poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFWorkbook.java

+     */
+    @Override
+    public Spliterator<Sheet> spliterator() {
+        return new SheetSpliterator();


Is this better than Spliterators.spliterator(sheets, Spliterator.ORDERED)?

Its actually the same, because SheetSpliterator is delegating to sheets.spliterator(), which is actually calling Spliterators.spliterator(this, Spliterator.ORDERED).

IMO its still better to delegate to the backing Collection's spliterator(), so that if the backing Collection changes, there is no risk of the Spliterator's characteristics going out of sync.

I added SheetSpliterator to be consistent with sheetIterator(), which creates an instance of SheetIterator to delegate sheets.iterator(). But now that I think about it, its probably better to simply do a cast, reducing an object instantiation, eg.

@Override @SuppressWarnings("unchecked") public Spliterator<Sheet> spliterator() { return (Spliterator<Sheet>)(Spliterator<? extends Sheet>) sheets.spliterator(); }

I don't think you'll need to cast.

Unfortunately sheets.spliterator() returns Spliterator<XSSFSheet>

can you try omitting the cast and see if it compiles? I just added spliterators in another project and was suprised that I didn't need to cast but it worked without the cast - pjfanning/excel-streaming-reader@4ac5fb8

I tried, it doesn't work. Using Spliterators.spliterator(this, Spliterator.ORDERED) avoids having to cast because Spliterators.spliterator(Collection, int) takes in Collection<? extends T> and returns Spliterator<T>. If only Iterable<T>#spliterator() returned Spliterator<? extends T> instead of Spliterator<T> 🤦‍♂️

fair enough - use the cast

Iterable#spliterator() default implementation has poor splitting capabilities, is unsized, and does not report any spliterator characteristics

daniel-shuy added 4 commits December 22, 2021 19:13

Move iterator() implementation to Workbook interface

9f460df

Move iterator() implementation to Sheet interface

1ad6873

Move iterator() implementation to Row interface

9673d85

Add Iterable interface to IntMapper and XDDFTextParagraph

a690fe8