Julien Nioche

Greater Bristol Area, United Kingdom Contact Info
615 followers 500+ connections

Join to view profile

About

My expertise is in document engineering with a strong focus on open source tools. I have…

Articles by Julien

Activity

Join now to see all activity

Experience & Education

  • DigitalPebble Ltd

View Julien’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Licenses & Certifications

Volunteer Experience

  • BRISTOL WOOD RECYCLING PROJECT Graphic

    Volunteer

    BRISTOL WOOD RECYCLING PROJECT

    - 1 year 1 month

    Environment

    I helped at the BWRP with collecting timber from building sites as well as making bespoke furniture and preparing the timber for reuse (sanding, planing, sawing, thicknessing, glueing) and finishing.
    The BWRP is a fantastic cooperative. I have made loads of friends there and am glad to support them when I can.

Publications

  • Mediacampaign — A multimodal semantic analysis system for advertisement campaign detection

    Conference: Conference: Content-Based Multimedia Indexing, 2008. CBMI 2008.

    MediaCampaignpsilas scope is on discovering and inter-relating advertisements and campaigns, i.e. to relate advertisements semantically belonging together, across different countries and different media. The projectpsilas main goal is to automate to a large degree the detection and tracking of advertisement campaigns on television, Internet and in the press. For this purpose we introduce a first prototype of a fully integrated semantic analysis system based on an ontology which automatically…

    MediaCampaignpsilas scope is on discovering and inter-relating advertisements and campaigns, i.e. to relate advertisements semantically belonging together, across different countries and different media. The projectpsilas main goal is to automate to a large degree the detection and tracking of advertisement campaigns on television, Internet and in the press. For this purpose we introduce a first prototype of a fully integrated semantic analysis system based on an ontology which automatically detects new creatives and campaigns by utilizing a multimodal analysis system and a framework for the resolution of semantic identity.

    See publication
  • Using feature structures as representation format for corpora exploration

    Corpus Linguistics

    In this paper we report on the use of feature structures to represent the linguistic information of a corpus. This approach has been adopted in TyPTex, a project which aims at providing a generic architecture for corpora profiling. After a brief overview of the Typtex project, we show that corpora exploration requires manipulating linguistic features in order to obtain a required level of linguistic information or changing the set of features to get a new point of view on the data. We show that…

    In this paper we report on the use of feature structures to represent the linguistic information of a corpus. This approach has been adopted in TyPTex, a project which aims at providing a generic architecture for corpora profiling. After a brief overview of the Typtex project, we show that corpora exploration requires manipulating linguistic features in order to obtain a required level of linguistic information or changing the set of features to get a new point of view on the data. We show that feature structures formalism can help the building and management of linguistic features with Meta-Rules based on unification. Finally, we provide an example of marking which uses a mixed approach between projection of information from a static lexicon and contextual marking via Meta-Rules. Results tend to show that the use of feature structures can improve the coverage and reliability of the marking.

    See publication
  • Estimation of English and non-English language use on the WWW

    Computing Research Repository - CORR

    The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this technique to estimating the number of words available through Web browsers for given languages. Comparing data from 1996 to data from 1999 and 2000,…

    The World Wide Web has grown so big, in such an anarchic fashion, that it is difficult to describe. One of the evident intrinsic characteristics of the World Wide Web is its multilinguality. Here, we present a technique for estimating the size of a language-specific corpus given the frequency of commonly occurring words in the corpus. We apply this technique to estimating the number of words available through Web browsers for given languages. Comparing data from 1996 to data from 1999 and 2000, we calculate the growth of a number of European languages on the Web. As expected, non-English languages are growing at a faster pace than English, though the position of English is still dominant.

    See publication
  • TyPTex : Inductive typological text classification by multivariate statistical analysis for NLP systems tuning/evaluation

    Maria Gavrilidou, George Carayannis, Stella Markantonatou, Stelios Piperidis, Gregory Stainhaouer (éds) Second International Conference on Language Resources and Evaluation, 2000, p. 141-148, 2000

    The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodolog The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling"…

    The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodolog The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.y and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.

    See publication
  • The BNC Parsed with RASP4UIMA.

    Conference: Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco

    We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RASP4UIMA system is publicly available and can be used to parse other corpora or document collections…

    We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RASP4UIMA system is publicly available and can be used to parse other corpora or document collections, and the final parsed version of the BNC will be deposited with the Oxford Text Archive.

    See publication

Projects

  • URLFrontier

    Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open-source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs.

    The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl…

    Discovering content on the web is possible thanks to web crawlers, luckily there are many excellent open-source solutions for this; however, most of them have their own way of storing and accessing the information about the URLs.

    The aim of the URL Frontier project is to develop a crawler/language-neutral API for the operations that web crawlers do when communicating with a web frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etc... Such an API can used by a variety of web crawlers, regardless of whether they are implemented in Java like StormCrawler and Heritrix or in Python like Scrapy.

    The outcomes of the project are to:

    design an API with gRPC, provide a Java stubs for the API and instructions on how to achieve the same for other languages
    deliver a robust reference implementation of the URL Frontier service
    implement a command line client for basic interactions with a service
    provide a test suite to check that any implementation of the API behaves as expected
    One of the objectives of URL Frontier is to involve as many actors in the web crawling community as possible and get real users to give continuous feedback on our proposals.

    Please use the project mailing list or Discussions section for questions, comments or suggestions.

    There are many ways to get involved if you want to.

    This project is funded through the NGI0 Discovery Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 825322.

    See project
  • StormCrawler

    - Present

    Storm-crawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

    The aim of Storm-crawler is to help build web crawlers that are :

    - scalable
    - resilient
    - low latency
    - easy to extend
    - polite yet efficient

    Storm-crawler is perfectly suited to use cases where the URL to fetch and parse come as streams but…

    Storm-crawler is an open source SDK for building distributed web crawlers with Apache Storm. The project is under Apache license v2 and consists of a collection of reusable resources and components, written mostly in Java.

    The aim of Storm-crawler is to help build web crawlers that are :

    - scalable
    - resilient
    - low latency
    - easy to extend
    - polite yet efficient

    Storm-crawler is perfectly suited to use cases where the URL to fetch and parse come as streams but is also an appropriate solution for large-scale recursive crawls, particularly where low latency is required. The project is used in production by several companies and is actively developed and maintained.

    Other creators
    See project
  • Behemoth

    -

    Behemoth was an open source platform for large scale document processing based on Apache Hadoop.

    It consisted of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth was to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

    ingesting from common data sources (Warc, Nutch, etc...)
    text processing (Tika, UIMA, GATE, Language…

    Behemoth was an open source platform for large scale document processing based on Apache Hadoop.

    It consisted of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth was to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

    ingesting from common data sources (Warc, Nutch, etc...)
    text processing (Tika, UIMA, GATE, Language Identification)
    generating output for external tools (SOLR, Mahout)
    Its modular architecture simplifies the development of custom annotators based on MapReduce.

    Behemoth did not implement any NLP or Machine Learning components as such but served as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefited from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.

    See project

Languages

  • French

    -

  • English

    -

  • Russian

    -

Organizations

  • The Apache Software Foundation

    Member

    - Present
  • Boavizta

    Member

    - Present
  • The Apache Software Foundation

    Member

    -

Recommendations received

More activity by Julien

View Julien’s full profile

  • See who you know in common
  • Get introduced
  • Contact Julien directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Julien Nioche

Add new skills with these courses