Author Archives: Toke Eskildsen

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.

Beware the cursorMark, my son!

Efficient export of stored text content from multi-shard Solr setups using cursorMark for individual shards and merging the results externally. Continue reading

Posted in eskildsen, Hacking, open source, Performance, Solr | Tagged , , | Leave a comment

Dumb-down at Indexing or Nested Data in the Solr Search Engine

Sigfrid Lundberg, Ph. D., Software Developer Royal Danish Library Copenhagen Denmark twitter — github — web site Are passions, then, the Pagans of the soul? Reason alone baptized? alone ordain’d To touch things sacred? (Edward Young — 1683-1765) Introduction The … Continue reading

Posted in sigge, Solr, usability | Leave a comment

Which type bug?

A light tale of bug hunting an Out Of Memory problem with SolrCloud. The setup and the problem At the Royal Danish Library we provide full text search for the Danish Netarchive. The heavy lifting is done in a single … Continue reading

Posted in eskildsen, Solr | Tagged , | Leave a comment

Touching encouraged (an ongoing story)

Ongoing experiments with a large touch screen providing access to cultural heritage material Continue reading

Posted in eskildsen, Visualization | Leave a comment

DocValues jump tables in Lucene/Solr 8

Lucene/Solr 8 is about to be released. Among a lot of other things is brings LUCENE-8585, written by your truly with a heap of help from Adrien Grand. LUCENE-8585 introduces jump-tables for DocValues, is all about performance and brings speed-ups … Continue reading

Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr, Uncategorized | 7 Comments

Faster DocValues in Lucene/Solr 7+

This is a fairly technical post explaining LUCENE-8374 and its implications on Lucene, Solr and (qualified guess) Elasticsearch search and retrieval speed. It is primarily relevant for people with indexes of 100M+ documents. Teaser We have a Solr setup for … Continue reading

Posted in eskildsen, Hacking, Low-level, Lucene, Performance, Solr | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives: The New York Public Library has a very visually pleasing presentation of public domain digitizations, but with a … Continue reading

Posted in Uncategorized | Leave a comment

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, … Continue reading

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments

CDX musings

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry. There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as … Continue reading

Posted in Uncategorized | Leave a comment

Faster grouping, take 1

A failed attempt of speeding up grouping in Solr, with an idea for next attempt. Grouping at a Statsbiblioteket project We have 100M+ articles from 10M+ pages belonging to 700K editions of 170 newspapers in a single Solr shard. It … Continue reading

Posted in Uncategorized | Leave a comment