-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customize Spliterator implementations for better parallelism #290
Conversation
@@ -373,6 +373,14 @@ public PackageRelationshipCollection getRelationships(String typeFilter) { | |||
return relationshipsByID.values().iterator(); | |||
} | |||
|
|||
/** | |||
* Get this collection's spliterator. | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add @since POI 5.2.0
on any new public methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
poi-ooxml/src/main/java/org/apache/poi/xssf/streaming/SXSSFWorkbook.java
Show resolved
Hide resolved
import java.util.Map; | ||
import java.util.NoSuchElementException; | ||
import java.util.Set; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you keep the explicit imports and remove the * import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was automatically formatted by my IDE, will remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* @return a spliterator of the sheets. | ||
*/ | ||
default Spliterator<Sheet> sheetSpliterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, we only need spliterator() - I know we have 2 iterator methods but that is some legacy stuff - I don't think new stuff needs to add 2 separate methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will remove it, I agree its better to have 1 method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
d6ccaba
to
5640530
Compare
@daniel-shuy generally looks good - but could you add some test coverage for some of the new methods? We don't necessarily need full coverage but some regression tests would be a pre-req for a merge. |
5640530
to
6130a24
Compare
*/ | ||
@Override | ||
public Spliterator<Sheet> spliterator() { | ||
return new SheetSpliterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better than Spliterators.spliterator(sheets, Spliterator.ORDERED)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its actually the same, because SheetSpliterator
is delegating to sheets.spliterator()
, which is actually calling Spliterators.spliterator(this, Spliterator.ORDERED)
.
IMO its still better to delegate to the backing Collection
's spliterator()
, so that if the backing Collection
changes, there is no risk of the Spliterator
's characteristics going out of sync.
I added SheetSpliterator
to be consistent with sheetIterator()
, which creates an instance of SheetIterator
to delegate sheets.iterator()
. But now that I think about it, its probably better to simply do a cast, reducing an object instantiation, eg.
@Override
@SuppressWarnings("unchecked")
public Spliterator<Sheet> spliterator() {
return (Spliterator<Sheet>)(Spliterator<? extends Sheet>) sheets.spliterator();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'll need to cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately sheets.spliterator()
returns Spliterator<XSSFSheet>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try omitting the cast and see if it compiles? I just added spliterators in another project and was suprised that I didn't need to cast but it worked without the cast - pjfanning/excel-streaming-reader@4ac5fb8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, it doesn't work. Using Spliterators.spliterator(this, Spliterator.ORDERED)
avoids having to cast because Spliterators.spliterator(Collection, int) takes in Collection<? extends T>
and returns Spliterator<T>
. If only Iterable<T>#spliterator()
returned Spliterator<? extends T>
instead of Spliterator<T>
🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough - use the cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Iterable#spliterator() default implementation has poor splitting capabilities, is unsized, and does not report any spliterator characteristics
6130a24
to
2077ac1
Compare
2077ac1
to
e367af3
Compare
Why
There are many classes in Apache POI that implements Iterable, but only implements Iterable#iterator() without overriding Iterable#spliterator().
The default implementation of Iterable#spliterator() is:
Spliterator#spliteratorUnknownSize(Iterator, int) returns a
Spliterator
with no initial size estimate, which has poor splitting capabilities. The default implementation has to do this because not allIterable
s have a fixed size.This is important when trying to convert the
Iterable
to a parallelStream
. For example, when trying to process Rows in a XSSFSheet (XSSFSheet
implementsIterable<Row>
) in parallel, the performance is terrible:The
XSSFSheet
'sIterator
is backed by a TreeMap#values():poi/poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFSheet.java
Line 95 in cf1354d
poi/poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFSheet.java
Lines 2049 to 2067 in cf1354d
TreeMap#values()
returns a Collection, and thus has a fixed size. ItsSpliterator
is hence sized and has good splitting capabilities, allowing it to be properly parallelized. Therefore an easy way to customizeXSSFSheet#spliterator()
is to simply delegate to the underlyingTreeMap#values()
, eg.How
Iterator
factory method with a backingCollection
/Iterable
, I simply delegated theSpliterator
factory method to the underlyingCollection
/Iterable
.Iterator
factory method but do not have a backingCollection
/Iterable
, but have a fixed size, I customized theSpliterator
factory method to return a sizedSpliterator
using Spliterators#spliterator(Iterator, long, int).Iterator
factory method but do not have a fixed size remain unchanged.I also took the liberty to perform the following refactors:
iterator()
implementation in child classes toWorkbook
/Sheet
/Row
interface usingdefault
methods (theiterator()
method is an alias ofsheetIterator()
/rowIterator()
/cellIterator()
, and is currently being duplicated in all child classes)Iterable
interface toIntMapper
andXDDFTextParagraph
(so that they can be iterated using an enhancedfor
loop)None of the changes break source/binary compatibility.
For anyone interested, the current workaround is create a sized
Spliterator
using Sheet#getPhysicalNumberOfRows(), eg.This however, is still less optimal than using the backing
Collection
'sSpliterator
.