Nevada City, California, United States
Contact Info
1K followers
500+ connections
About
Articles by Ken
Activity
-
We LOVE our repeat customers! PRC's latest install at Fort Independence Campground in Independence, CA was built to match the exterior of a previous…
We LOVE our repeat customers! PRC's latest install at Fort Independence Campground in Independence, CA was built to match the exterior of a previous…
Liked by Ken Krugler
Experience & Education
Volunteer Experience
-
Contributor
Stack Overflow
- Present 14 years 9 months
Education
I answer questions about Flink, Pinot, Lucene/Solr and Cascading. See https://stackoverflow.com/users/231762/kkrugler?tab=answers&sort=newest
-
Organizer, Teacher
Girls Who Code
- 2 years 6 months
Education
Helped start the Girls Who Code club of Nevada County, taught the first session, and filled in for other teachers during subsequent sessions.
-
Volunteer Teacher
Bitney College Prep
- 1 year 6 months
Education
I taught 20 high school students how to program using Python.
-
Volunteer Teacher
Seven Hills School
- 4 months
Education
I taught computer programming to middle-schoolers.
Publications
-
Building a scalable focused web crawler with Flink
Flink Forward SF 2018
Is it possible to build an efficient, focused web crawler using Flink? That was the question that led to the creation of the flink-crawler open source project. In this talk I’ll discuss how we use Flink’s support for AsyncFunctions and iterations to create a scalable web crawler that continuously and efficiently performs a focused web crawl with no additional infrastructure. I’ll also discuss some of the testing and debugging challenges encountered when using features such as AsyncFunctions and…
Is it possible to build an efficient, focused web crawler using Flink? That was the question that led to the creation of the flink-crawler open source project. In this talk I’ll discuss how we use Flink’s support for AsyncFunctions and iterations to create a scalable web crawler that continuously and efficiently performs a focused web crawl with no additional infrastructure. I’ll also discuss some of the testing and debugging challenges encountered when using features such as AsyncFunctions and iterations.
-
Faster Workflows, Faster
ApacheCon Big Data NA 2016
Slides from my talk at ApacheCon Big Data 2016 in Vancouver. I described how we're defining complex ETL workflows using the Cascading API, then running them on Flink (using AWS Elastic Mapreduce).
-
Fuzzy Entity Matching
Cassandra Summit 2014
I discuss in-depth a real-world use case for combining Hadoop, Cassandra & Solr to solve the problem of quickly matching a target person (entity) against a large corpus of hundreds of millions of potential matches.
-
Similarity at Scale
Hadoop Summit 2014
In this talk I describe use cases for both batch & real-time similarity, and discuss my experience using Hadoop and Solr to generate high quality results at scale for several different clients. I cover entity resolution (people & places), false identity detection (fraud), real-time recommendation systems and automatic document linking. These are all based on past projects for real customers. Techniques I discuss are feature extraction from text, distributed SimHash, and Solr-based real time…
In this talk I describe use cases for both batch & real-time similarity, and discuss my experience using Hadoop and Solr to generate high quality results at scale for several different clients. I cover entity resolution (people & places), false identity detection (fraud), real-time recommendation systems and automatic document linking. These are all based on past projects for real customers. Techniques I discuss are feature extraction from text, distributed SimHash, and Solr-based real time similarity scoring.
-
Suicide Risk Prediction using Social Media and Cassandra
Cassandra Summit 2013
I describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There’s a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus…
I describe a portion of an early-phase project that uses social media data (tweets, Facebook posts, etc.) from service personnel to predict suicide rates. There’s a lot of motivation to provide better data for military psychologies, since more military wind up taking their own lives than are killed in the line of duty. By analyzing social media data that is voluntarily provided by personnel, plus a predictive analytics system, we can provide assessments that help mental health workers focus their time and energy on the most at-risk individuals. This project uses Cassandra as the scalable storage system for this social media data, which is then analyzed in a distributed environment using Hadoop. The project also uses the Solr search support from DataStax Enterprise to provide ways for users to dig into the underlying data, which is critical when understanding the assigned risk levels
-
Faster, Cheaper, Better - Replacing Oracle with Hadoop and Solr
Hadoop Summit 2012
This talk is a distillation of experience with clients, where we use Hadoop to do off-line pre-processing of data, which then lets us use Solr as a NoSQL solution that provides faster query processing on less hardware, while adding additional search & faceting functionality.
-
A Very Short History of Big Data
BigDataCamp 2011
My lightening talk from the BigDataCamp in Washington, DC.
-
A (very) short intro to Hadoop
BigDataCamp 2011
A very short introduction to Hadoop, from the talk I gave at the BigDataCamp held in Washington DC. Some of this content is also covered in the various big data classes we offer via on-site training (see http://www.scaleunlimited.com/training/)
-
Thinking at Scale with Hadoop
SDForum SAM SIG
Presentation I gave at the SDForum SAM SIG (Software Architecture & Modeling) meeting. This talk provides a brief introduction to Map-Reduce & Hadoop, then discusses challenges of implementing complex data processing using low-level Map-Reduce support, and a number of solutions.
-
Elastic Web Mining
ACM Data Mining Unconference
PDF version (with notes) of my talk at the ACM Data Mining Unconference. How to use an open source stack (Hadoop, Cascading, Bixo) in EC2 for cost effective, scalable and reliable web mining.
Patents
-
Static Code Scoring
Filed US 12/231242
A method of ranking source code search results, using static factors derived from file attributes, context, activity and link (usage) graph analysis.
Projects
-
Flink Web Crawler
- Present
A continuous scalable web crawler built on top of Flink and crawler-commons, with bits of code borrowed from bixo.
-
Suicide prediction from social media activity
- Present
Use social media (Facebook, Twitter, etc) activity to predict which military personnel have the highest risk of suicide.
Uses Gigya to collect social media activity, stores it in Cassandra, and then applies multiple predictive analytics models via Hadoop/Cascading to calculate risk.Other creatorsSee project -
Display advertising analytics
- Present
Provide back-end support for an analytics web site that helps advertisers and publishers optimize display advertising.
Process millions of crawled pages each day with Hadoop/Cascading, to build an OLAP (on-line analytics platform) using Solr indexes. Apply classification and clustering algorithms to enhance results with IAB codes, similar advertisers, recommended publishers, ec. -
Focused crawl/index for market analytics
- Present
Provide back-end support for market research platform that is used by analysts to define & create brand and topic-tracking reports.
System uses a focused crawler (with SVM classifiers) to find and extract content, particularly conversations about products on the web. The pipeline then does key term extraction, additional classification and clustering, and finally builds topic-specific search indexes that incorporate pipeline results for augmented search functionality.Other creatorsSee project -
Near real time web page similarity
-
Find pages in a 20M+ corpus that were similar to an arbitrary target page. Do it in a few milliseconds, and support hundreds of requests/second on a single server.
This used a Cascading/Hadoop workflow to extract key terms from the corpus, and built a Solr index and "corpus map" for the near-real time similarity engine.Other creators
Languages
-
Japanese
-
Organizations
-
Nevada County Tech Connection
Member
- PresentSupporting, connecting and showcasing the technology and digital media eco-system as a thriving entity in Nevada County by facilitating events, educational and training opportunities, collaborative efforts and streamlined communication amongst Businesses, Talent, Education providers, Workforce development agencies, and the Community at large.
-
The Apache Software Foundation
Member
- PresentI'm a committer on the Tika content extraction project, focusing on HTML parsing, character encodings and language detection.
Recommendations received
3 people have recommended Ken
Join now to viewMore activity by Ken
-
I am very grateful to the members of the The Apache Software Foundation to let me rejoin their ranks! I had gone Emeritus for a few years as I…
I am very grateful to the members of the The Apache Software Foundation to let me rejoin their ranks! I had gone Emeritus for a few years as I…
Liked by Ken Krugler
-
This is my advanced Flink class (at Flink Forward 2023). On the afternoon of the SECOND day. Usually there's a significant drop-off in energy when…
This is my advanced Flink class (at Flink Forward 2023). On the afternoon of the SECOND day. Usually there's a significant drop-off in energy when…
Shared by Ken Krugler
-
One more thing … Mojo🔥 is now available on Mac M1/M2 and it’s ⚡️🚀 fast! ⚡️🚀, download and try popular communities projects like llama.🔥 And don’t…
One more thing … Mojo🔥 is now available on Mac M1/M2 and it’s ⚡️🚀 fast! ⚡️🚀, download and try popular communities projects like llama.🔥 And don’t…
Liked by Ken Krugler
-
Join me David Anderson, Confluent & Onehouse for an awesome meetup during #flinkforward #flinkforward2023 in Renton, Washington! I'm excited to meet…
Join me David Anderson, Confluent & Onehouse for an awesome meetup during #flinkforward #flinkforward2023 in Renton, Washington! I'm excited to meet…
Liked by Ken Krugler
-
Just last week, we had the privilege of sharing some momentous news – Apache Pinot has achieved the remarkable milestone of graduating to version…
Just last week, we had the privilege of sharing some momentous news – Apache Pinot has achieved the remarkable milestone of graduating to version…
Liked by Ken Krugler
-
Latest SUP Streaming Updates for the People #podcast!!! I interview Robert Zych, engineer at Raft and #ApachePinot committer. Robert talks about…
Latest SUP Streaming Updates for the People #podcast!!! I interview Robert Zych, engineer at Raft and #ApachePinot committer. Robert talks about…
Liked by Ken Krugler
-
I'll be sharing some new bits about Restate next week at Current Conference. https://lnkd.in/e8-NM8Rq Join me for the session "𝗦𝘁𝗿𝗲𝗮𝗺…
I'll be sharing some new bits about Restate next week at Current Conference. https://lnkd.in/e8-NM8Rq Join me for the session "𝗦𝘁𝗿𝗲𝗮𝗺…
Liked by Ken Krugler
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More