Ken Krugler

Nevada City, California, United States Contact Info
1K followers 500+ connections

Join to view profile

About

President of Scale Unlimited. Design, development and training for big data processing…

Articles by Ken

Activity

Join now to see all activity

Experience & Education

  • Scale Unlimited

View Ken’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Volunteer Experience

  • Stack Overflow Graphic

    Contributor

    Stack Overflow

    - Present 14 years 8 months

    Education

    I answer questions about Flink, Pinot, Lucene/Solr and Cascading. See https://stackoverflow.com/users/231762/kkrugler?tab=answers&sort=newest

  • Girls Who Code Graphic

    Organizer, Teacher

    Girls Who Code

    - 2 years 6 months

    Education

    Helped start the Girls Who Code club of Nevada County, taught the first session, and filled in for other teachers during subsequent sessions.

  • Volunteer Teacher

    Bitney College Prep

    - 1 year 6 months

    Education

    I taught 20 high school students how to program using Python.

  • Volunteer Teacher

    Seven Hills School

    - 4 months

    Education

    I taught computer programming to middle-schoolers.

  • Bear Yuba Land Trust Graphic

    Board Member

    Bear Yuba Land Trust

    - 5 years 1 month

    Environment

Publications

  • Building a scalable focused web crawler with Flink

    Flink Forward SF 2018

    Is it possible to build an efficient, focused web crawler using Flink? That was the question that led to the creation of the flink-crawler open source project. In this talk I’ll discuss how we use Flink’s support for AsyncFunctions and iterations to create a scalable web crawler that continuously and efficiently performs a focused web crawl with no additional infrastructure. I’ll also discuss some of the testing and debugging challenges encountered when using features such as AsyncFunctions and…

    Is it possible to build an efficient, focused web crawler using Flink? That was the question that led to the creation of the flink-crawler open source project. In this talk I’ll discuss how we use Flink’s support for AsyncFunctions and iterations to create a scalable web crawler that continuously and efficiently performs a focused web crawl with no additional infrastructure. I’ll also discuss some of the testing and debugging challenges encountered when using features such as AsyncFunctions and iterations.

    See publication
  • Faster Workflows, Faster

    ApacheCon Big Data NA 2016

    Slides from my talk at ApacheCon Big Data 2016 in Vancouver. I described how we're defining complex ETL workflows using the Cascading API, then running them on Flink (using AWS Elastic Mapreduce).

    See publication
  • Fuzzy Entity Matching

    Cassandra Summit 2014

    I discuss in-depth a real-world use case for combining Hadoop, Cassandra & Solr to solve the problem of quickly matching a target person (entity) against a large corpus of hundreds of millions of potential matches.

    See publication
  • Similarity at Scale

    Hadoop Summit 2014

    In this talk I describe use cases for both batch & real-time similarity, and discuss my experience using Hadoop and Solr to generate high quality results at scale for several different clients. I cover entity resolution (people & places), false identity detection (fraud), real-time recommendation systems and automatic document linking. These are all based on past projects for real customers. Techniques I discuss are feature extraction from text, distributed SimHash, and Solr-based real time…

    In this talk I describe use cases for both batch & real-time similarity, and discuss my experience using Hadoop and Solr to generate high quality results at scale for several different clients. I cover entity resolution (people & places), false identity detection (fraud), real-time recommendation systems and automatic document linking. These are all based on past projects for real customers. Techniques I discuss are feature extraction from text, distributed SimHash, and Solr-based real time similarity scoring.

    See publication
  • Suicide Risk Prediction using Social Media and Cassandra

    Cassandra Summit 2013

    I describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There’s a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus…

    I describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There’s a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus their time and energy on the most at-risk individuals. This project uses Cassandra as the scalable storage system for this social media data, which is then analyzed in a distributed environment using Hadoop. The project also uses the Solr search support from DataStax Enterprise to provide ways for users to dig into the underlying data, which is critical when understanding the assigned risk levels

    See publication
  • Faster, Cheaper, Better - Replacing Oracle with Hadoop and Solr

    Hadoop Summit 2012

    This talk is a distillation of experience with clients, where we use Hadoop to do off-line pre-processing of data, which then lets us use Solr as a NoSQL solution that provides faster query processing on less hardware, while adding additional search & faceting functionality.

    See publication
  • A Very Short History of Big Data

    BigDataCamp 2011

    My lightening talk from the BigDataCamp in Washington, DC.

    See publication
  • A (very) short intro to Hadoop

    BigDataCamp 2011

    A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)

    See publication
  • Thinking at Scale with Hadoop

    SDForum SAM SIG

    Presentation I gave at the SDForum SAM SIG (Software Architecture & Modeling) meeting. This talk provides a brief introduction to Map-Reduce & Hadoop, then discusses challenges of implementing complex data processing using low-level Map-Reduce support, and a number of solutions.

    See publication
  • Elastic Web Mining

    ACM Data Mining Unconference

    PDF version (with notes) of my talk at the ACM Data Mining Unconference. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.

    See publication

Patents

  • Static Code Scoring

    Filed US 12/231242

    A method of ranking source code search results, using static factors derived from file attributes, context, activity and link (usage) graph analysis.

Projects

  • Flink Web Crawler

    - Present

    A continuous scalable web crawler built on top of Flink and crawler-commons, with bits of code borrowed from bixo.

    See project
  • Suicide prediction from social media activity

    - Present

    Use social media (Facebook, Twitter, etc) activity to predict which military personnel have the highest risk of suicide.

    Uses Gigya to collect social media activity, stores it in Cassandra, and then applies multiple predictive analytics models via Hadoop/Cascading to calculate risk.

    Other creators
    See project
  • Display advertising analytics

    - Present

    Provide back-end support for an analytics web site that helps advertisers and publishers optimize display advertising.

    Process millions of crawled pages each day with Hadoop/Cascading, to build an OLAP (on-line analytics platform) using Solr indexes. Apply classification and clustering algorithms to enhance results with IAB codes, similar advertisers, recommended publishers, ec.

    See project
  • Focused crawl/index for market analytics

    - Present

    Provide back-end support for market research platform that is used by analysts to define & create brand and topic-tracking reports.

    System uses a focused crawler (with SVM classifiers) to find and extract content, particularly conversations about products on the web. The pipeline then does key term extraction, additional classification and clustering, and finally builds topic-specific search indexes that incorporate pipeline results for augmented search functionality.

    Other creators
    See project
  • Near real time web page similarity

    -

    Find pages in a 20M+ corpus that were similar to an arbitrary target page. Do it in a few milliseconds, and support hundreds of requests/second on a single server.

    This used a Cascading/Hadoop workflow to extract key terms from the corpus, and built a Solr index and "corpus map" for the near-real time similarity engine.

    Other creators

Languages

  • Japanese

    -

Organizations

  • Nevada County Tech Connection

    Member

    - Present

    Supporting, connecting and showcasing the technology and digital media eco-system as a thriving entity in Nevada County by facilitating events, educational and training opportunities, collaborative efforts and streamlined communication amongst Businesses, Talent, Education providers, Workforce development agencies, and the Community at large.

  • The Apache Software Foundation

    Member

    - Present

    I'm a committer on the Tika content extraction project, focusing on HTML parsing, character encodings and language detection.

Recommendations received

More activity by Ken

View Ken’s full profile

  • See who you know in common
  • Get introduced
  • Contact Ken directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Ken Krugler

1 other named Ken Krugler is on LinkedIn

See others named Ken Krugler

Add new skills with these courses