Julien Nioche

Greater Bristol Area, United Kingdom Contact Info

Sign in to view Julien’s full profile

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

615 followers 500+ connections

View mutual connections with Julien

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

DigitalPebble Ltd

Institut national des langues et civilisations orientales (Inalco)

About

My expertise is in document engineering with a strong focus on open source tools. I have…

Articles by Julien

The environmental impact of the cloud - the Common Crawl case study

The environmental impact of the cloud - the Common Crawl case study

By Julien Nioche

Mar 26, 2024

Activity

An excellent opportunity for anyone either wanting to start a career in software development / DevOps/SRE, or anyone that fancies a change of career.…

An excellent opportunity for anyone either wanting to start a career in software development / DevOps/SRE, or anyone that fancies a change of career.…

Liked by Julien Nioche
The Open Web Search Initiative is a European initiative that provides an open platform to foster innovation in search and AI applications. At this…

The Open Web Search Initiative is a European initiative that provides an open platform to foster innovation in search and AI applications. At this…

Liked by Julien Nioche
Find our carbon intensity data in Samsung Electronics through the SmartThings app 🎉

Find our carbon intensity data in Samsung Electronics through the SmartThings app 🎉

Liked by Julien Nioche

Join now to see all activity

Experience & Education

DigitalPebble Ltd

*** ****** ******** **********

********* *** *** ****** ****** *****
****** ***** **********

******* ***** *******
******** ******** *** ******* ** ************* ********** (******)

******* ***

1998 - 1999
******** ******** *** ******* ** ************* **********

******* *******, ***********

1992 - 1997

View Julien’s full experience

See their title, tenure and more.

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Licenses & Certifications

Green Software for Practitioners LFC131

The Linux Foundation

Issued Mar 2024 Expires Mar 2026

Credential ID LF-mpui8eneqn

See credential
Open Source Licensing Basics for Software Developers (LFC191)

The Linux Foundation

Issued Jul 2023 Expires Jul 2024

Credential ID LF4rrxl44lg

See credential

Volunteer Experience

Volunteer

BRISTOL WOOD RECYCLING PROJECT

May 2022 - May 2023 1 year 1 month

Environment

I helped at the BWRP with collecting timber from building sites as well as making bespoke furniture and preparing the timber for reuse (sanding, planing, sawing, thicknessing, glueing) and finishing.
The BWRP is a fantastic cooperative. I have made loads of friends there and am glad to support them when I can.

Publications

Mediacampaign — A multimodal semantic analysis system for advertisement campaign detection

Conference: Conference: Content-Based Multimedia Indexing, 2008. CBMI 2008. Jul 2008

MediaCampaignpsilas scope is on discovering and inter-relating advertisements and campaigns, i.e. to relate advertisements semantically belonging together, across different countries and different media. The projectpsilas main goal is to automate to a large degree the detection and tracking of advertisement campaigns on television, Internet and in the press. For this purpose we introduce a first prototype of a fully integrated semantic analysis system based on an ontology which automatically…

MediaCampaignpsilas scope is on discovering and inter-relating advertisements and campaigns, i.e. to relate advertisements semantically belonging together, across different countries and different media. The projectpsilas main goal is to automate to a large degree the detection and tracking of advertisement campaigns on television, Internet and in the press. For this purpose we introduce a first prototype of a fully integrated semantic analysis system based on an ontology which automatically detects new creatives and campaigns by utilizing a multimodal analysis system and a framework for the resolution of semantic identity.

See publication
Using feature structures as representation format for corpora exploration

Corpus Linguistics Mar 2001

In this paper we report on the use of feature structures to represent the linguistic information of a corpus. This approach has been adopted in TyPTex, a project which aims at providing a generic architecture for corpora profiling. After a brief overview of the Typtex project, we show that corpora exploration requires manipulating linguistic features in order to obtain a required level of linguistic information or changing the set of features to get a new point of view on the data. We show that…

In this paper we report on the use of feature structures to represent the linguistic information of a corpus. This approach has been adopted in TyPTex, a project which aims at providing a generic architecture for corpora profiling. After a brief overview of the Typtex project, we show that corpora exploration requires manipulating linguistic features in order to obtain a required level of linguistic information or changing the set of features to get a new point of view on the data. We show that feature structures formalism can help the building and management of linguistic features with Meta-Rules based on unification. Finally, we provide an example of marking which uses a mixed approach between projection of information from a static lexicon and contextual marking via Meta-Rules. Results tend to show that the use of feature structures can improve the coverage and reliability of the marking.

See publication
Estimation of English and non-English language use on the WWW

Computing Research Repository - CORR 2000

The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this technique to estimating the number of words available through Web browsers for given languages. Comparing data from 1996 to data from 1999 and 2000,…

The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this technique to estimating the number of words available through Web browsers for given languages. Comparing data from 1996 to data from 1999 and 2000, we calculate the growth of a number of European languages on the Web. As expected, non-English languages are growing at a faster pace than English, though the position of English is still dominant.

See publication
TyPTex : Inductive typological text classification by multivariate statistical analysis for NLP systems tuning/evaluation

Maria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperidis, Gregory Stainhaouer (éds) Second International Conference on Language Resources and Evaluation, 2000, p. 141-148, 2000 2000

The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodolog The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling"…

The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodolog The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.y and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.

See publication
The BNC Parsed with RASP4UIMA.

Conference: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco

We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RASP4UIMA system is publicly available and can be used to parse other corpora or document collections…

We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RASP4UIMA system is publicly available and can be used to parse other corpora or document collections, and the final parsed version of the BNC will be deposited with the Oxford Text Archive.

See publication

Projects

URLFrontier

Sep 2019

Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open-source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs.

The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl…

Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open-source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs.

The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etc... Such an API can used by a variety of web crawlers, regardless of whether they are implemented in Java like StormCrawler and Heritrix or in Python like Scrapy.

The outcomes of the project are to:

design an API with gRPC, provide a Java stubs for the API and instructions on how to achieve the same for other languages
deliver a robust reference implementation of the URL Frontier service
implement a command line client for basic interactions with a service
provide a test suite to check that any implementation of the API behaves as expected
One of the objectives of URL Frontier is to involve as many actors in the web crawling community as possible and get real users to give continuous feedback on our proposals.

Please use the project mailing list or Discussions section for questions, comments or suggestions.

There are many ways to get involved if you want to.

This project is funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.

See project
StormCrawler

Apr 2013 - Present
Storm-crawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

The aim of Storm-crawler is to help build web crawlers that are :

- scalable
- resilient
- low latency
- easy to extend
- polite yet efficient

Storm-crawler is perfectly suited to use cases where the URL to fetch and parse come as streams but…

Storm-crawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

The aim of Storm-crawler is to help build web crawlers that are :

- scalable
- resilient
- low latency
- easy to extend
- polite yet efficient

Storm-crawler is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large-scale recursive crawls, particularly where low latency is required. The project is used in production by several companies and is actively developed and maintained.

Other creators
See project
Behemoth

Jun 2010 - Apr 2018

Behemoth was an open source platform for large scale document processing based on Apache Hadoop.

It consisted of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth was to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

ingesting from common data sources (Warc, Nutch, etc...)
text processing (Tika, UIMA, GATE, Language…

Behemoth was an open source platform for large scale document processing based on Apache Hadoop.

It consisted of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth was to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

ingesting from common data sources (Warc, Nutch, etc...)
text processing (Tika, UIMA, GATE, Language Identification)
generating output for external tools (SOLR, Mahout)
Its modular architecture simplifies the development of custom annotators based on MapReduce.

Behemoth did not implement any NLP or Machine Learning components as such but served as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefited from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.

See project

Languages

French

-
English

-
Russian

-

Organizations

The Apache Software Foundation

Member

Mar 2024 - Present
Boavizta

Member

Feb 2024 - Present
The Apache Software Foundation

Member

2011 - Apr 2020

Recommendations received

8 people have recommended Julien

Join now to view

More activity by Julien

Really truly in the final throes of proofing. AND ... the book is just the beginning of the conversation. We created an AI-Powered Search community…

Really truly in the final throes of proofing. AND ... the book is just the beginning of the conversation. We created an AI-Powered Search community…

Liked by Julien Nioche
A new blog post, on the impacts of replicating ten years of UK web history: Grasping a Petabyte https://lnkd.in/efYGHjr8

A new blog post, on the impacts of replicating ten years of UK web history: Grasping a Petabyte https://lnkd.in/efYGHjr8

Liked by Julien Nioche
💡 Episode Highlight In this episode with Adrian Cockcroft on decarbonizing AWS, we have gathered some highlights from their discussion with Gaël…

💡 Episode Highlight In this episode with Adrian Cockcroft on decarbonizing AWS, we have gathered some highlights from their discussion with Gaël…

Liked by Julien Nioche
Gotta catch ‘em all! #CameraForensics #Connect24 CameraForensics

Gotta catch ‘em all! #CameraForensics #Connect24 CameraForensics

Liked by Julien Nioche
Yesterday, one of our university web servers was hammered by an #AI crawler from a large #AI-based company. Their crawler is not #polite, does not…

Yesterday, one of our university web servers was hammered by an #AI crawler from a large #AI-based company. Their crawler is not #polite, does not…

Liked by Julien Nioche
National Grid ESO (Electricity System Operator) have decided to continue DFS (Demand Flexibility Service), as a year round service. In previous…

National Grid ESO (Electricity System Operator) have decided to continue DFS (Demand Flexibility Service), as a year round service. In previous…

Liked by Julien Nioche
After almost two weeks of intense conferences and meetings, I am exhausted but satisfied! My take on Berlin Buzzwords: Everyone wants to use Large…

After almost two weeks of intense conferences and meetings, I am exhausted but satisfied! My take on Berlin Buzzwords: Everyone wants to use Large…

Liked by Julien Nioche
🇬🇧 - What an honour ! We are delighted to be named “Innovator” by the analysts at International Data Corporation (IDC) for their research paper on…

🇬🇧 - What an honour ! We are delighted to be named “Innovator” by the analysts at International Data Corporation (IDC) for their research paper on…

Liked by Julien Nioche
And... my turn on stage! Reciprocal Rank Fusion in Apache Solr! Intuition, use cases, implementation details and limitations! Hopefully coming in…

And... my turn on stage! Reciprocal Rank Fusion in Apache Solr! Intuition, use cases, implementation details and limitations! Hopefully coming in…

Liked by Julien Nioche
Am sorry to have to miss #BerlinBuzzwords at the last minute. If you are at the conference, make sure you don't miss the talk I was due to co-present…

Am sorry to have to miss #BerlinBuzzwords at the last minute. If you are at the conference, make sure you don't miss the talk I was due to co-present…

Shared by Julien Nioche
That first glimpse of the Disco Ball of Alexanderplatz 🪩 means it's the beginning of one of my favourite tech events of the year – Berlin Buzzwords…

That first glimpse of the Disco Ball of Alexanderplatz 🪩 means it's the beginning of one of my favourite tech events of the year – Berlin Buzzwords…

Liked by Julien Nioche

View Julien’s full profile

See who you know in common
Get introduced
Contact Julien directly

Join to view full profile

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Other similar profiles

Explore more posts

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Julien Nioche

3 others named Julien Nioche are on LinkedIn

See others named Julien Nioche

Add new skills with these courses

See all courses

About

Articles by Julien

The environmental impact of the cloud - the Common Crawl case study

By Julien Nioche

Activity

An excellent opportunity for anyone either wanting to start a career in software development / DevOps/SRE, or anyone that fancies a change of career.…

Liked by Julien Nioche

The Open Web Search Initiative is a European initiative that provides an open platform to foster innovation in search and AI applications. At this…

Liked by Julien Nioche

Find our carbon intensity data in Samsung Electronics through the SmartThings app 🎉

Liked by Julien Nioche

Experience & Education

DigitalPebble Ltd

********

View Julien’s full experience

See their title, tenure and more.

Licenses & Certifications

Volunteer Experience

Volunteer

Publications

Conference: Conference: Content-Based Multimedia Indexing, 2008. CBMI 2008. Jul 2008

Corpus Linguistics Mar 2001

Computing Research Repository - CORR 2000

Maria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperidis, Gregory Stainhaouer (éds) Second International Conference on Language Resources and Evaluation, 2000, p. 141-148, 2000 2000

Conference: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco

Projects

Sep 2019

Apr 2013 - Present

Jun 2010 - Apr 2018

Languages

French

-

English

-

Russian

-

Organizations

The Apache Software Foundation

Member

Boavizta

Member

The Apache Software Foundation

Member

Recommendations received

Pravesh Ramachandran

Martin Grindley

More activity by Julien

Really truly in the final throes of proofing. AND ... the book is just the beginning of the conversation. We created an AI-Powered Search community…

Liked by Julien Nioche

A new blog post, on the impacts of replicating ten years of UK web history: Grasping a Petabyte https://lnkd.in/efYGHjr8

Liked by Julien Nioche

💡 Episode Highlight In this episode with Adrian Cockcroft on decarbonizing AWS, we have gathered some highlights from their discussion with Gaël…

Liked by Julien Nioche

Gotta catch ‘em all! #CameraForensics #Connect24 CameraForensics

Liked by Julien Nioche

Yesterday, one of our university web servers was hammered by an #AI crawler from a large #AI-based company. Their crawler is not #polite, does not…

Liked by Julien Nioche

National Grid ESO (Electricity System Operator) have decided to continue DFS (Demand Flexibility Service), as a year round service. In previous…

Liked by Julien Nioche

After almost two weeks of intense conferences and meetings, I am exhausted but satisfied! My take on Berlin Buzzwords: Everyone wants to use Large…

Liked by Julien Nioche

🇬🇧 - What an honour ! We are delighted to be named “Innovator” by the analysts at International Data Corporation (IDC) for their research paper on…

Liked by Julien Nioche

And... my turn on stage! Reciprocal Rank Fusion in Apache Solr! Intuition, use cases, implementation details and limitations! Hopefully coming in…

Liked by Julien Nioche

Am sorry to have to miss #BerlinBuzzwords at the last minute. If you are at the conference, make sure you don't miss the talk I was due to co-present…

Shared by Julien Nioche

That first glimpse of the Disco Ball of Alexanderplatz 🪩 means it's the beginning of one of my favourite tech events of the year – Berlin Buzzwords…

Liked by Julien Nioche

View Julien’s full profile

Sign in

Other similar profiles

Dani Solà Lagares

Renjith Viswanath

Angelos Chionis , PhD

Shailan Chudasama

Kiran Samudrala

Michael Hoddy

Jim Hayden