Session: Big Data

Thème: CODE

Le: 31 oct. 2014, de 09:00 à 15:35

Track leader(s): Ori Pekelman (Founder, Constellation Matrix)

Big ideas on Big Data.. data technologies are changing fast and changing the technology landscape in a big way and open source is a huge part of that. This track aims to give a broad overview on the current state of the art with big data infrastructures, analytics, search, machine learning and real time... and open future perspectives... how do you handle big data over a decade? If you have a big data strategy these talks may give you a new perspective, if you don't.. you might very well learn where you are supposed to start.

Présentations

09:00 - Algebird : algebra for efficient big data processing Abstract algebra for data mining

Durée: 35 minutes

Orateurs: Sam Bessalah (Software Engineer, Independant)

Algebird is an abstract algebra library for Scala developed at Twitter and released under the ASL 2.0 license. It has support for algebraic structures such as semigroups, monoids, groups, rings and fields as well as the standard functional things like monads. More interestingly though are the probabilistic data structures and the accompanying monoids that come out of the box.

I'll talk a bit about Algebird in general and how it eases building large scale analytics systems with Map Reduce systems or in a stream processing context.

09:55 - Context Awareness

Durée: 35 minutes

Orateurs: Rand Hindi (CEO, Snips)

As our lives become more and more connected, there is a growing need for more intelligent technology that would be capable of understanding what we want without requiring constant interaction. This concept, called "context awareness", is the key to our hyper-connected future. From NEST to Google Now and IFTTT, in this talk we will go through some of the most successful use cases of context awareness, and explain some of the technology behind the pocket brain we are currently building at Snips.

11:00 - Apache Kafka distributed publish-subscribe messaging system

Durée: 20 minutes

Orateurs: Charly Clairmont (CTO, Altic)

Apache Kafka is a distributed publish-subscribe messaging system developed by Linkedin, and known a rapid adoption by numerous of companies. It has been created with performance, availability and scalability in mind and is used as the messaging backbone at linkedin.

In that talk it's about understand how Apache Kafka is built, and how it works. We use it in one of our projects for a customer which provides a solution for call center. Kafka is the best solution to help to collect all traces from their software. Thanks to kafke, this company is now able to deliver more realtime analyses. Kafka transform the way to collect, and share data in organisation. So we will describe how we use Apache Kafka in that project.

11:20 - Data encoding and Metadata for Streams

Durée: 30 minutes

Orateurs: Jonathan Winandy (Founder, Primatice)

Streaming is fast-raising methodology for real time distributed systems. This session will show how better data encoding and metadata management make Streaming both flexible and long-lasting.

This session will start with a brief summary on streaming, and show simple techniques and their advantages to move away from the classic schemaless json on the wire without compromising the flexibility.

11:50 - Next Open Source Big Data Suite A new low level approach for BigData

Durée: 30 minutes

Orateurs: Emmanuel Keller (CEO/CTO, OpenSearchServer)

Since 2 years, the OpenSearchServer team works on a new open source software suite dedicated to Big Data.

This new low level implementation focus on performances, scalability and simplicity.

Discover a new way to accelerate your Big Data project.

Search Engines, No SQL databases, Map Reduce, Distributed file system, all these concepts are part of the Big Data ecosystem.

Low level implementations provide better performances and better scalability while reducing the cost on the server infrastructure.

After two year of hard work, the OpenSearchServer team is proud to unveil its new Open Source Big Data Software Suite.

13:30 - State Of the Art in Machine Learning

Durée: 35 minutes

Orateurs: Olivier Grisel (Software Engineer, Inria)

A broad overview of what technologies are usable now, what are just around the corner

14:15 - Take back control of your web tracking Go further by doing it yourself

Durée: 35 minutes

Orateurs: Clément Stenac (CTO, Dataiku)

Tracking the actions of your users on your website is nowadays so fundamental that most people … don’t do it anymore, instead relying on SaaS products and dashboards. However, these services often only provide aggregated high-level views and keep the raw data.

In this talk, we’ll first see how using raw tracking data can help you go from "number of page views" to a real understanding of your usage patterns and what kind of data and technologies you need for that.

We’ll then have a look at different architectures and challenges for web tracking and highlight the need for a dedicated and open tracking infrastructure.

Are Apache logs « web tracking data » ?
How to reconstruct user sessions ?
Are cookies good for your (web) health ?
What are the specific challenges for mobile tracking ?

In a last part, we’ll introduce WT1, an Open Source web tracker that solves this challenges and doesn’t hide your data. Anyone can deploy WT1 to take back control on his own web tracking data and build awesome data-driven services.

15:00 - Real time energy data analysis with Apache Storm

Durée: 35 minutes

Orateurs: Simon Maby (Software Architect, Octo Technology)

Storm est-il un bon candidat pour le traitement continu de données de sondes issues du smart metering? C'est la question à laquelle Octo Technology et EDF R&D ont essayé de répondre au cours d'un POC devant mesurer la pertinence de l'outil pour traiter 1700 millions d'indicateurs en moins d'une heure. Nous discuterons de l'architecture retenue pour ce projet ainsi que nos retours d'expériences sur les points suivants :

Champs des possibles et complexité de développement
Quelles possibilités pour le machine learning? et en particulier l'association avec R
Retours sur le framework Trident
Quelles performances sur un cluster type commodity Hardware d'une dizaine de noeuds
Industrialisation, robustesse et tolérance aux pannes.