Loading…
Attending this event?
Apache: Big Data North America 2017 will be held at the Intercontinental Miami in Miami, Florida. 

Register now for the event taking place May 16-18, 2017. 
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Sunday, May 14
 

9:00am

Apache Traffic Server and Traffic Control Summit (separate RSVP and Registration Required)
The Apache Traffic Server and Traffic Control Summit is a two-day event taking place just prior to ApacheCon North America. Further details and information, including schedule details can be found on the Apache Traffic Server Wiki page.

Registration and a $150 fee is required for this Summit.

Sunday May 14, 2017 9:00am - 5:00pm
Alhambra / Escorial
 
Monday, May 15
 

7:00am

Morning Run
Please meet in the InterContinental Miami Lobby at 7am.  For any questions, contact: jfclere@gmail.com.

Monday May 15, 2017 7:00am - 8:00am
InterContinental Miami Lobby

9:00am

Apache Traffic Server and Traffic Control Summit (separate RSVP and Registration Required)
The Apache Traffic Server and Traffic Control Summit is a two-day event taking place just prior to ApacheCon North America. Further details and information, including schedule details can be found on the Apache Traffic Server Wiki page.

Registration and a $150 fee is required for this Summit.

Monday May 15, 2017 9:00am - 5:00pm
Alhambra / Escorial

9:30am

BarCampApache
BarCampApache is a BarCamp being facilitated by a group of people involved in the Apache Software Foundation (ASF). All topics are still welcome however! As the ASF is helping to organize, there will be a lot of people around who know a lot about Apache projects / communities / technologies, so there are normally quite a few sessions proposed on those areas. It's not exclusively Apache though, so everyone should come, and talk about fun new ideas, projects and technologiesBarCampApache will be a dynamic get together open to the public. Like other unconferences, the schedule will be determined by the participants, both Apache and non! We strongly encourage lots of people to come along and share their knowledge and ideas. We want it to be a great day of sharing for everyone, not just those at the event. Everyone coming in for the conference is encouraged to come early, as it will be a great day for all.

Monday May 15, 2017 9:30am - 3:00pm
Rafael
  • Experience Level Any
 
Tuesday, May 16
 

7:00am

Morning Run
Please meet in the InterContinental Miami Lobby at 7am.  For any questions, contact: jfclere@gmail.com.

Tuesday May 16, 2017 7:00am - 8:00am
InterContinental Miami Lobby

8:00am

Breakfast
Tuesday May 16, 2017 8:00am - 9:00am
Mezzanine

8:00am

Sponsor Showcase
Tuesday May 16, 2017 8:00am - 12:55pm
Mezzanine

8:00am

Registration
Tuesday May 16, 2017 8:00am - 6:00pm
Mezzanine

9:00am

Keynote: State of the Feather - Sam Ruby, President, Apache Software Foundation
Speakers
avatar for Sam Ruby

Sam Ruby

President, Apache Software Foundation
Sam Ruby is a prominent software developer who has made significant contributions to many of the Apache Software Foundation‘s open source software projects, and to the standardization of web feeds via his involvement with the Atom web feed standard and the feedvalidator. org web service. He is the co-chair of the W3C‘s HTML Working Group. He currently holds a Senior Technical Staff Member position in the Emerging Technologies Group of... Read More →


Tuesday May 16, 2017 9:00am - 9:20am
Versailles Ballroom

9:25am

Keynote: Alan Gates, Co-founder, Hortonworks
Speakers
avatar for Alan Gates

Alan Gates

Co-founder, Hortonworks
Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O’Reilly Press.


Tuesday May 16, 2017 9:25am - 9:45am
Versailles Ballroom

9:50am

10:10am

Keynote: Sandra Matz, Computational Social Scientist
Speakers
avatar for Sandra Matz

Sandra Matz

Computational Social Scientist
Sandra Matz is currently enrolled as a PhD student at the Department of Psychology. After spending a year at the University of Cambridge in 2011/2012, she graduated from the University in Freiburg (Germany) with a 1st Class honours degree in Psychology (BSc) in 2013. Sandra is funded by the German National Academic Foundation which is Germany’s largest and most prestigious funding body.Combining a strong background in methods and statistics... Read More →


Tuesday May 16, 2017 10:10am - 10:30am
Versailles Ballroom

10:30am

Coffee Break
Tuesday May 16, 2017 10:30am - 11:05am
Mezzanine

11:05am

Support Apache Cassandra in Production - Anuj Wadehra & Amit Singh, Ericsson
One of the prime challenges in using an Open Source database like Apache Cassandra is to build an effective support for production deployments. In their presentation, Anuj Wadehra and Amit Singh who are currently working as Cassandra Designers at Ericsson will explain the challenges associated with an Open Source distributed database such as Apache Cassandra, operational best practices, some common issues you can expect in production and how to overcome such issues.

Speakers
AS

Amit Singh

Amit Singh is a Cassandra developer with 7 years of IT experience. He is an Orielly Certified Apache Cassandra Developer. He is an Apache Cassandra contributor and has committed few patches.
avatar for Anuj Wadehra

Anuj Wadehra

Architect, Ericsson
Anuj Wadehra is an Apache Cassandra enthusiast with around 10 years of IT experience. Currently, he works as a Cassandra Architect with Ericsson R&D. He is an active contributor on Apache Cassandra mailing lists. He has designed and implemented multiple distributed, fault tolerant, scalable and HA software systems. Anuj Wadehra presented his proposal "Life Saviour" at the Workshop: Smarter and Digital Delhi, 2016. His proposal was shortlisted in... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Windsor

11:05am

Apache​ ​Mahout:​ ​An​ ​Extendable​ ​Machine​ ​Learning​ ​Framework​ ​for​ ​Spark​ ​and​ ​Flink - Trevor Grant, Market6
 A serious issue when developing distributed machine learning algorithms is the lack of people who understand the mathematics, distributed data, AND have free time. Further, most distributed engines have APIs that were not designed to be mathematically expressive, implementations are hard to follow; another qualified person must review. The Mahout project has spent two years building modular system bindings for distributed engines such as Apache Spark and Apache Flink, native solvers to enable CPU/GPU acceleration, an abstracted R-Like Scala DSL for tensor algebra on distributed matrices, and a consistent API to implement distributed algorithms. This creates an extendable and new-contributor friendly framework for machine learning. We’ll also discuss the project vision for creating a CRAN like repository of user contributed algorithms and how we are evangelizing this vision.    Mail Merge Schedule ABD submissions ABD proposals from AC Stuff that doesn't belong to ABD Mail Merge Logs  Explore A serious issue when developing distributed machine learning algorithms is the lack of people who understand the mathematics, distributed data, AND have free time. Further, most distributed engines have APIs that were not designed to be mathematically expressive, implementations are hard to follow; another qualified person must review. The Mahout project has spent two years building modular system bindings for distributed engines such as Apache Spark and Apache Flink, native solvers to enable CPU/GPU acceleration, an abstracted R-Like Scala DSL for tensor algebra on distributed matrices, and a consistent API to implement distributed algorithms. This creates an extendable and new-contributor friendly framework for machine learning. We’ll also discuss the project vision for creating a CRAN like repository of user contributed algorithms and how we are evangelizing this vision.


Speakers
TG

Trevor Grant

R&D Data Scientist, Market6
Trevor Grant is PMC Member on the Apache Mahout project, and contributor on Apache Streams (incubating), Apache Zeppelin, and Apache Flink projects. By day he is an Open Source Technical Evangelist at IBM. In former rolls he called himself a data scientist, but the term is so overused these days. He holds an MS in Applied Math and an MBA from Illinois State University. Trevor is an organizer of the Chicago Apache Flink Meet Up and newly formed... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Balmoral

11:05am

Starting with Apache Spark, Best Practices and Learning from the Field - Felix Cheung, Microsoft
Apache Spark is one of the most popular Big Data platform. In this talk we will have a quick introduction of some of the high-level concepts in Spark and its various modules: SQL, Streaming, ML, Graph and Structured Streaming.

Then we will go through some of the current Best Practices to operationalize Spark for better performance in production, and tips to detect and avoid some of the most common issues.

And lastly we will explore how some enterprises are building solutions with Spark.

Speakers
avatar for Felix Cheung

Felix Cheung

Principal Engineer, Microsoft
Felix Cheung is a Committer of Apache Spark, a PMC/Committer of Apache Zeppelin and PPMC/Committer of Apache MXNet (incubating). He has been active in the Big Data space for 3+ years, he is a co-organizer of the Seattle Spark Meetup, presented several times and he was a teaching assistant to the very popular edx Introduction to Big Data with Apache Spark, and Scalable Machine Learning MOOCs in the summer of 2015. He had presented at previous... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Trianon

11:05am

Recent Improvements on Parquet Support for Hive - Cheng Xu, Intel & Chao Sun, Cloudera
Apache Hive is a popular SQL engine for big data in the Hadoop eco-system. In Hive, data can be stored in different formats, including columnar storage formats such as Apache Orc and Apache Parquet. With columnar formats, data can be stored and processed much more efficiently, and only necessary columns need to be accessed.

In this talk we’ll focus on some recent work we’ve done to improve Parquet support for Hive performance. In particular, we’ll discuss the nested column pruning optimization, which enables skipping unnecessary data when reading data of nested types, as well as Parquet vectorization support, which offers a much better alternative than the current row-by-row execution engine for Parquet. We’ll also talk about configuration options that help to achieve optimal performance for these features, and provide some benchmark results we’ve done for these new features.

Speakers
CS

Chao Sun

Chao Sun is currently a Software Engineer at Cloudera, Inc. He has been working on Hive on Spark project since joining the company in mid 2014. Prior to that, he was a PhD student in Computer Science at U​W-Milwaukee, focusing on type systems​ and ​mechanized proofs​.​
avatar for Cheng Xu

Cheng Xu

Senior Software Engineer, Intel
I am a software engineer from Intel. I am now working on Apache Hive project, Apache Parquet and Apache Spark Project. I am a committer of Apache HIVE project. Now I am focussed on Spark Authorization specially in Spark SQL component and the performance improvements in Apache Parquet project.


Tuesday May 16, 2017 11:05am - 11:55am
Alhambra

11:05am

eBay Real-time Business Insight with Streaming Engine Built on Apache Kylin - Ken Wang, eBay

Real-time data insight is getting more important for trend capturing and just-in-time decision making.  eBay as one of the world’s largest and most vibrant marketplace, it relies on real-time data analysis at multiple domains to run the business, like user info protection, promotion prediction as well as site performance detect and monitor etc.

In this session Ken will introduce a new zero latency streaming OLAP engine built on Apache Kylin and how the streaming OLAP engine serves eBay's real-time data analysis business.  The new Kylin streaming engine uses column based storage and indexes as well as in-memory query technique to make real-time data be visible with no latency.  The new streaming engine will also provide exactly once delivery semantics to make sure data quality when used together with Apache Kafka.


Speakers
KW

Ken Wang

Ken works at eBay as senior architect in eBay for more than 9 years, focus on data platform infrastructure, like real-time streaming, MOLAP on Hadoop, SQL on Hadoop etc.


Tuesday May 16, 2017 11:05am - 11:55am
Biscayne

12:05pm

Cassandra on ARMv8 - A Comparison with x86 and Other Processor Platforms - Manish Singh, MityLytics
In this paper we present our results from evaluating Cassandra on ARMv8 based servers in the context of building real-time analytics platforms and apps. A platform built for real-time analytics is part of an ecosystem which typically consists of a Kafka based ingestion engine and spark stream-processing engine in addition to Cassandra. We use apps from several benchmark suites to compare our results to x86 platforms and GPU systems which have recently become quite popular. Our studies focus on not just performance but also a cost benefit analysis.

Speakers
MS

Manish Singh

Manish is CTO and co-founder of MityLytics which develops products to help customers make the transition to Big Data platforms and to continue to grow and tune their Big Data analytics platforms and apps using MityLytics software. He has built, deployed and maintained massively distributed and parallel systems at several companies namely Citrix-Netscaler, Lucent-Ascend, Ericsson-Redback. Most recently Manish was at GoGrid - a Cloud Infrastructure... Read More →


Tuesday May 16, 2017 12:05pm - 12:55pm
Windsor

12:05pm

Apache Hivemall: Scalable Machine Learning Library for Apache Hive/Spark/Pig - Makoto Yui, Treasure Data, Inc. & Takeshi Yamamuro, NTT
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig. Apache Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive and is very easy to use as every machine learning step is done within HiveQL.

Hivemall recently started incubating at Apache Incubator from Sept 13, 2016 and the project plans the initial Apache release in Q1, 2017. In this talk, Makoto Yui will give a walk-through of features, usages, and future roadmaps of Apache Hivemall and Takeshi Yamamuro will introduce Hivemall on Apache Spark in depth.

We consider that this talk is particularly interesting and relevant to people already familiar with Apache Hive and/or Apache Spark and working on big data analytics.

Speakers
TY

Takeshi Yamamuro

Takeshi Yamamuro is a Research Engineer of NTT, a telecommunication company in Japan, working on Database backends and SIMD/GPU-aware algorithms. He is a contributor of Hivemall. He worked on porting Hivemall functions to Apache Spark and developing a Parameter Mixing module that runs on Apache Hadoop Yarn.
avatar for Makoto Yui

Makoto Yui

Treasure Data, Inc., Treasure Data, Inc.
Makoto YUI is a Research Engineer of a Hadoop-as-a-Service startup, Treasure Data, Inc. He is leading the development of Apache Hivemall, an open source library for scalable machine learning on Apache Hive, Apache Spark, and Apache Pig. He holds a PhD degree in computer science from NAIST. Find his profile on http://myui.github.io/


Tuesday May 16, 2017 12:05pm - 12:55pm
Balmoral

12:05pm

Profiling Spark Applications - Jayesh Thakrar, Conversant
Are you interested in harnessing and analyzing the data that drives the Spark Web UI? Are you keen to use that data to tune your applications or understand fluctuations in runtime of your production applications? Do you want to understand the efficiency of your Spark executors and system resources?

This presentation will help you do that and more, by walking through the wealth of data in Spark application events. This data can be used as a foundation for a Spark profiler and advisor that analyzes application events in batch or real-time.

Speakers
JT

Jayesh Thakrar

Officially, Jayesh Thakrar is a Sr. Data Engineer at Conversant (http://www.conversantmedia.com/). But in reality he is a data geek who gets to build and play with large data systems consisting of Hadoop, HBase, Ambari, Flume and Kafka. To rest after a good day's work, he uses OpenTSDB to keep an eye on all the systems.


Tuesday May 16, 2017 12:05pm - 12:55pm
Trianon

12:05pm

Even Faster: When Presto Meets Parquet @ Uber - Zhenxiao Luo, Uber
As Uber continues to grow, our big data systems need to grow in scalability, reliability, and performance, to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Since 2016, we put Presto in production. Now Presto is serving ~100K queries per day @ Uber, and it becomes a key component for interactive SQL queries on big data. In this presentation, we would like to talk about our experiences and engineering efforts, we start with general introduction about Hadoop Infrastructure & Analytics @ Uber, then comes a brief introduction to Presto, the Interactive SQL engine for big data. We will focus on how we build the New Parquet Reader for Presto, and the detail techniques, Columnar Reads, Lazy Reads, Nested Column Pruning. We will show performance improvements and Uber's Use Cases. Finally, we would like to share our ongoing work.

Speakers
ZL

Zhenxiao Luo

Uber
Zhenxiao Luo is a software engineer at Uber. He leads interactive SQL engine projects for Hadoop, specifically, Presto and Parquet. Before joining Uber, he led the development and operations of Presto at Netflix. Zhenxiao has big data experience at Facebook, Cloudera, and Vertica on Hadoop-related projects. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.


Tuesday May 16, 2017 12:05pm - 12:55pm
Alhambra
  • Experience Level Any

12:55pm

Lunch ( Attendees on Own)
Tuesday May 16, 2017 12:55pm - 2:30pm
TBA

2:30pm

Online and Offline Analytics on Cassandra in eBay - Yi Liu & DongQian Liu, eBay
ebay is one of largest e-commerce company in the world, providing C2C and B2C sales services via the Internet. We use Cassandra to store large tables for online query. To reduce the Cassandra load, we do offline Analytics of Cassandra table, we dump sstables to HDFS and transform to Hadoop file formats. In this session, we introduce how we build high-performance, cross datacenter Cassandra cluster for online query, and for offline Analytics, we introduce how we implement splittable input format for sstables and transform to Hadoop file formats. We also introduce how we use bulk loader tool to load data from Hadoop to Cassandra quickly.

Speakers
YL

Yi Liu

Architect, ebay
Yi Liu (刘轶) is the committer and PMC member of Apache Hadoop for years, currently he is lead architect for Paid IM (Internet Marketing) in ebay, he leads the architecture design for Ads, marketing data and experimentation platform. He leads to use Hadoop, Spark, Kafka, Cassandra and other open source projects to build these platforms. Before joining ebay, he worked in Intel for 6 years as architect for Bigdata infrastructure, he led Hadoop... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Windsor

2:30pm

Spark SQL + Pig-Latin: Combine Query Language and Data Flow Language for Data Science - Jeff Zhang, Hortonworks
Data science is a very broad field which involves lots of techniques and knowledge. But overall we can split it as 2 steps: data munging and data analysis. SQL is pretty suitable for data analysis intrinsically, but it is not good at data munging. For data munging in spark ecosystem people have lots of options, like RDD API or DataSet API, but the learning curve for these apis is a little steep. We provide an alternative option: pig-latin. Pig-latin is a data flow language which is very suitable for data munging and easy to learn, originally it was designed for mapreduce engine. We make it support spark engine and make it share the same SparkContext with Spark SQL so that we can share data between spark and pig.

In this talk, I will describe how we integrate pig-latin with spark sql and demonstrate how it would help your team to get actionable insight from data.

Speakers
JZ

Jeff Zhang

Jeff has 8 years of experience in big data industry. He started to use Hadoop since 2009 and is apache Pig/Tez committer (Tez PMC). His past experience is not only on big data infrastructure, but also on how to leverage these big data tools to get insight. He speaks several times on big data conferences like hadoop summit, strata + hadoop world. Now he works in hortonworks as member of technical staff. Hortonworks is a leading innovator in the... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Alhambra

2:30pm

What It Takes to Process a Trillion Events a Day: Case-Studies in Scaling Stream Processing at LinkedIn - Jagadish Venkatraman, LinkedIn
In this talk, we will present practical case-studies of large scale stream processing applications at LinkedIn. Example applications discussed will include:
  • LinkedIn’s real-time communication platform that delivers relevant content at massive scale to our 450M members. 
  • The LinkedIn feed that processes billions of events each day, and keeps track of what members viewed on their news feed. 
We will present the hard scalability problems we had to solve in each of these applications and the techniques used to address them. Problems include scaling ingestion of events, partitioned processing, highly performant data access and performing efficient remote I/O. We will explain how we leveraged and improved Apache Samza in addressing these problems and how we scaled to process over a trillion events every day.

Speakers
avatar for Jagadish Venkatraman

Jagadish Venkatraman

Jagadish Venkatraman is an Apache Samza committer and a Senior Software Engineer in the Streams Infrastructure group at LinkedIn. He has been working on building, scaling and improving Apache Samza at LinkedIn. He has four years of experience working on practical problems at the intersection of large scale data, stream processing and storage infrastructure. Jagadish was a grad student and a research assistant at Stanford University where he... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Biscayne

2:30pm

Sponsor Showcase
Tuesday May 16, 2017 2:30pm - 7:00pm
Mezzanine

3:30pm

Real-World Tales of Repair with Apache Cassandra - Alexander Dejanovski, TheLastPickle
Distributed databases inevitably have to deal with entropy. Within Apache Cassandra, the Anti-Entropy process initiated via CLI tools is the way of ensuring consistency of data on disk. Over the many years of the Apache Cassandra project it has also been the biggest operator pain points. Without a solid repair process in place, you had no guarantee that deleted data will not come back to life, or that data is fully distributed to replicas when nodes fail.

In this talk Alexander Dejanovski, Consultant at The Last Pickle, will explain how Anti-Entropy works and why it should be run on your cluster. He will discuss the different types of repair and their effect on data consistency. He will also introduce tools such as Cassandra Reaper and the range repair script to manage scheduling and running repairs in the most efficient way.

Speakers
AD

Alexander Dejanovski

Consultant Apache Cassandra @TheLastPickle | | Alexander has been working as a software developer since 1998, mainly for Chronopost. He's been leading there the effort to build a Cassandra based architecture and migrate critical services to it from traditional RDBMS. He is involved in the Cassandra community through the development of a JDBC wrapper for the Datastax Java Driver. | He co-hosts a french podcast called "Big Data Hebdo" and has been... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Windsor

3:30pm

Large Scale Processing of Unstructured Text - Suneel Marthi, Red Hat
Natural Language Processing (NLP) practitioners often have to deal with analyzing large corpora of unstructured documents which is often tedious. Python tools like NLTK help with this to a certain extent but do not scale to very large data sets and cannot be plugged into a distributed scalable framework like Apache Flink. The Apache OpenNLP library is a popular machine learning toolkit for processing unstructured text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also includes maximum entropy and perceptron based machine learning. The audience will have a better understanding of how the various OpenNLP components can help processing large corpus of unstructured text.

Speakers
SM

Suneel Marthi

Principal Engineer, Red Hat
Suneel Marthi is a member of the Apache Software Foundation and a PMC member on Apache Mahout, Apache OpenNLP and Apache Pirk. He has previously presented at Apache Big Data, Hadoop Summit Europe and Flink Forward.


Tuesday May 16, 2017 3:30pm - 4:20pm
Balmoral

3:30pm

Writing Apache Spark Applications Using Apache Bahir - Luciano Resende & Leucir Marin, IBM
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk, we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy a Spark Application using the MQTT data source for the new Apache Spark Structure Streaming functionality.

Speakers
LM

Leucir Marin

Sr. Software Engineer, IBM
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for Apache Bahir, and also spend time mentoring newly created Apache Incubator projects. At IBM, he contributed to several IBM big data offerings, including... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Trianon

3:30pm

Leveraging Smart Meter Data for Electric Utilities: Comparison of Spark SQL with Hive - Yusuke Furuyama, Hitachi
Hitachi has focused on social innovation business. It has constantly evolved to create sustainable business products and solutions to enhance the quality of life across the globe. Now we are leveraging smart meter data for electric utilities. To meet their needs, we compared the performance of batch processing for aggregating data from smart meters using Hadoop (MapReduce) and Spark 1.6 and Spark 2.0 with changing some parameters (the amount of input data, the logic of processing, input file format, etc.).In this session, we report the results of performance test above.

Speakers
YF

Yusuke Furuyama

Yusuke Furuyama is a solution engineer at Hitachi. His team drives the utilization of Hadoop ecosystem and he is working on offering and co-creating progressive Hadoop solutions to customers who are going to build enterprise system. Now he is focusing on Apache Spark and Apache HBase.


Tuesday May 16, 2017 3:30pm - 4:20pm
Alhambra
  • Experience Level Any

3:30pm

The Continuing Story Of Batching To Streaming Analytics At Optimizely - Michael Borsuk, Optimizely
At Optimizely we track billions of user events, such as page views, clicks and custom events, on a daily basis to provide our customers with immediate access to key analytics and business insights. Because of this, we are constantly innovating on our data ingestion pipeline. Over the course of development we have moved from batch data ingestion processes, to streaming, to a hybrid or "lambda" approach and back to full streaming again. I will present the technical details and challenges in developing this system, which includes use of Apache Samza, Flume, Kafka, HBase and Hadoop, as well as some of the lessons learned along the way.

This talk will summarize the story we described in this blog post and present where we have gone since: http://highscalability.com/blog/2016/11/16/the-story-of-batching-to-streaming-analytics-at-optimizely.html

Speakers
MB

Michael Borsuk

Mike Borsuk is a software engineer with 12 years experience building software and hardware products. His focuses have been on pragmatic development of scalable services and distributed systems, efficient mobile products as well as application monitoring and measurement. He currently works as a Senior Engineer on the distributed systems team at Optimizely. Previously Mike has spoken at the CSUN Assistive Technology Conference while working for... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Biscayne

4:20pm

Coffee Break
Tuesday May 16, 2017 4:20pm - 4:40pm
Mezzanine

4:40pm

Cassandra Persistence for Online Systems, What Actually Works - John Sumsion, FamilySearch
In a project to port FamilySearch's billion-person tree from Oracle to Cassandra in AWS, a novel consistency model emerged. Many of the initial design assumptions ended up working well. However, some surprising errors occurred, which forced some adjustments.

In this presentation, John will review what worked and what didn't developing a system that achieved both low latencies and allowed for live updates.

The presentation will include: cassandra schema details, data consistency mechanisms, specific solutions to data consistency problems encountered.

Speakers
JS

John Sumsion

Principal Software Engineer, FamilySearch
John Sumsion is an experienced Software Engineer who has played key roles in making big-data projects that actually work. Much of John's experience has been gained in building several progressively better implementations of the billion-person tree for FamilySearch. John enjoys using and contributing as much as possible in FOSS. John has spoken on several topics in formal and informal venues: Cassandra Summit, NoSQL Matters, internal... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Windsor

4:40pm

HydraR: A R Based Scalable Machine Learning Framework - Alok Singh, IBM Spark Technology Center
R is de-facto standard for data analysis and statistics. We introduce HydraR a open source project, integrating Apache SystemML and Apache SparkR/Spark features so that the R community benefits from scalable machine learning framework. HydraR is the client side library written in R and build on top of R, SparkR and SystemML and allows one to create custom scalable machine learning algorithms apart from canned algorithms.

In this talk, we will provide the technical overview of HydraR (a R package), it’s API, supported canned algorithms, it’s integration to Spark and SystemML. We will walk through a small example of creating a custom algorithm and a demo. We will share our experience on using HydraR and it’s variant by IBM clients. The talk will conclude with pointers to how the audience can try out HydraR and discuss potential areas of community collaboration.

Speakers
AS

Alok Singh

Alok Singh is a Principal Engineer at the IBM Spark Technology Center, where he leads the HydraR project. He has built and architected multiple analytical frameworks and implemented machine learning algorithms. His interest is in creating Big Data and scalable machine learning software and algorithms and has presented on the those topics on various internal and external conferences.


Tuesday May 16, 2017 4:40pm - 5:30pm
Balmoral

4:40pm

Creating a Recommender System with ElasticSearch & Apache Spark - Alvaro Santos Andres, Ericsson
Recommender Systems have changed the way companies and people interact with each other. Does your organisation need a 360° view of their customer? Today it is possible to recommend the right products to customers or potential customers. For example, a film based on their previous interests or a new accessory that fits their model of smartphone.

The technology behind recommender systems has evolved significantly over the past 20 years and with the explosion of Big Data technologies, there are tools that can create very powerful recommender systems. This introduction will explain how Recommender Systems work, describing their main functionalities, and providing some basic algorithms frequently used in such systems. We will look at how to create a Recommender System using technologies like Apache Spark and ElasticSearch.

Speakers
avatar for Alvaro Santos Andres

Alvaro Santos Andres

Big Data Solution Architect, Ericsson
Big Data Software Architect with more than 10 years of experience. Since 3 years ago, I am focused 100% of the time on Big Data projects in which I have developed several Personalization services used by millions of users given them a better experience and Company Data transformations.Born with Java, now I am a great lover of Scala and Functional Programming Languages. If we are speaking about Big Data, I would say Spark and the Hadoop Apache... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Trianon

4:40pm

Efficient Columnar Storage with Apache Parquet - Ranganathan Balashanmugam, ThoughtWorks
Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.

Speakers
avatar for Ranganathan Balashanmugam

Ranganathan Balashanmugam

Head of Engineering - India, Aconex
Ranganathan has nearly twelve years of experience of developing awesome products and loves to works on full stack - from front end, to backend and scale. He is Head of Engineering - India at Aconex and prior to that was Technology Lead at ThoughtWorks. He is Microsoft MVP for Data Platform 2016. He runs one of the top technology meetups in Hyderabad - Hyderabad Scalability Meetup. He is very interested in exploring Big data technologies and a... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Alhambra

4:40pm

From Batch to Streaming ET(L) with Apache Apex - Thomas Weise, Atrato,io
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.

This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks.

Speakers
avatar for Thomas Weise

Thomas Weise

CTO, Atrato.io
Thomas is Apache Apex PMC Chair and CTO at Atrato. Prior to founding Atrato he was Architect at DataTorrent and lead the development of Apex from the beginning of the project. Before that he was member of the Hadoop Team at Yahoo! and contributed to several of the big data ecosystem projects. Thomas has developed distributed systems since 1997 and frequently speaks at international big data conferences like Hadoop Summit and ApacheCon and at... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Biscayne

5:30pm

6:00pm

PGP Key Signing: Expanding the Web of Trust
Why participate in the key signing? Among other things, all Apache releases are PGP-signed; but a key with no signatures attesting to its own authenticity isn't very useful. Bring your key (which you will have emailed to our special address at apachecon-keysigning@apache.org,) and sign. You will need a pen, and some manner of identification.

Please see the wiki page for more information:
http://wiki.apache.org/apachecon/PgpKeySigning

Tuesday May 16, 2017 6:00pm - 7:00pm
Mezzanine
 
Wednesday, May 17
 

7:00am

Morning Run
Please meet in the InterContinental Miami Lobby at 7am.  For any questions, contact: jfclere@gmail.com.

Wednesday May 17, 2017 7:00am - 8:00am
InterContinental Miami Lobby

8:00am

Breakfast
Wednesday May 17, 2017 8:00am - 9:00am
Mezzanine

8:00am

Registration
Wednesday May 17, 2017 8:00am - 6:00pm
Mezzanine

9:00am

Keynote to be announced
Wednesday May 17, 2017 9:00am - 9:20am
Versailles Ballroom

9:25am

Keynote Panel Discussion: How to Succeed in IoT 2.0 - Abhi Arunachalam, Battery Ventures; Sudip Chakrabarti, Lightspeed Venture Partners; James Pace, Runtime; Roman Shaposhnik, Pivotal
Moderators
RS

Roman Shaposhnik

Director of Open Source, Pivotal Inc.
Roman Shaposhnik is a Director of Open Source at Pivotal Inc and VP of Technology for ODPi at Linux Foundation. He is a committer on Apache Hadoop, co-creator of Apache Bigtop and contributor to various other Hadoop ecosystem projects. He is also an ASF member and a former Chair of Apache Incubator. In his copious free time he managed to co-author "Practical Graph Analytics with Apache Giraph" and he also posts to twitter as @rhatr. Roman has... Read More →

Speakers
AA

Abhi Arunachalam

Abhi Arunachalam is an investor at Battery Ventures. He focuses on early and growth stage investments in sectors such as security, big-data analytics and AI. He has 12+ years of technology and investment experience. Abhi is currently involved in Battery’s investments in InfluxData, Fungible, JFrog, Expel & Jask. Before joining Battery, Abhi was a Partner at Intel Capital. At Intel Capital, he helped lead investments in category defining... Read More →
avatar for Sudip Chakrabarti

Sudip Chakrabarti

Lightspeed Venture Partners, Partner
Sudip is a partner at Lightspeed Venture Partners where he focuses on enterprise and infrastructure software investments. Prior to joining Lightspeed, Sudip was a partner at Andreessen Horowitz where he invested in and worked with companies such as Actifio, Alluxio, Cumulus Networks, Databricks, DigitalOcean, Forward Networks, Mesosphere, Samsara, etc. He started his venture career at Osage University Partners where he invested in Menlo Security... Read More →
JP

James Pace

CEO, Runtime
James is CEO and Co-Founder of Runtime: an early stage company providing significant contributions to open source for the IoT and embedded community. Apache Mynewt, a project under the Apache Software Foundation, provides an OS and development framework for embedded developers everywhere! James has held a number of roles relevant to scaling and managing the IoT: at Silver Spring Networks, a company enabling deployment and management of 23 million... Read More →


Wednesday May 17, 2017 9:25am - 9:50am
Versailles Ballroom

9:50am

Coffee Break
Wednesday May 17, 2017 9:50am - 10:15am
Mezzanine

9:50am

Sponsor Showcase
Wednesday May 17, 2017 9:50am - 1:05pm
Mezzanine

10:15am

Using Apache Beam for Batch, Streaming, and Everything in Between - Frances Perry & Dan Halperin, Google
Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to easily tune requirements around completeness and latency and run the same pipeline across multiple runtime environments. In addition, Beam's model enables cutting edge optimizations, like dynamic work rebalancing and autoscaling, giving those runtimes the ability to be highly efficient.

This talk will cover the basics of Apache Beam, touch on its evolution, and describe the main concepts in its powerful programming model. We'll include detailed, concrete examples of how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios.

Speakers
DH

Dan Halperin

Google
Dan Halperin is a PMC member of Apache Beam. He has worked on Beam and Google Cloud Dataflow for 2 years. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in oceanography, astronomy, medical informatics, and the life sciences.
avatar for Frances Perry

Frances Perry

Software Engineer, Google
Frances Perry is a PMC member of Apache Beam and an engineer at Google who loves making big data processing easy, intuitive, and efficient. After many years working on Google’s internal data processing stack, she joined the Cloud Dataflow team to make this technology available to cloud customers. She’s been involved in Apache Beam since it’s inception.


Wednesday May 17, 2017 10:15am - 11:05am
Balmoral

10:15am

TensorFlow on YARN - Zhankun Tang, Intel
Deep learning is quickly emerging as one of the most important and promising big data applications. TensorFlow, with its well-designed APIs, rich libraries, and strong community, is arguably the most popular open source deep learning library. Although the majority of current TensorFlow use cases are in standalone mode, organizations are facing the pressing need to run TensorFlow jobs reusing existing big data infrastructure. In other words, in the coming year we envision TensorFlow will be productionalized at massive scale.

In this talk we will present TOY, our project to natively support TensorFlow in Apache Hadoop YARN. TOY greatly simplifies the process of submitting TensorFlow programs and provisioning TensorFlow clusters dynamically. Its architecture is also flexible to allow future extensions such as checkpoint-restart for long running trainings, and integration with TensorBoard.

Speakers
ZT

Zhankun Tang

Tang Zhankun joined Intel in 2013 as a software engineer. He's now focusing on Apache Hadoop YARN, machine learning and related area.


Wednesday May 17, 2017 10:15am - 11:05am
Alhambra

10:15am

Dataservices: Processing Big Data the Microservice Way - Tobias Polzer, QAware GmbH
We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.

Speakers

Wednesday May 17, 2017 10:15am - 11:05am
Windsor

10:15am

ZStream: Building a Real-Time Transactional Streaming Storage Over Apache DistributedLog - Kai S., Sequenced
Systems and infrastructure that aid and facilitate real-time data delivery, consumption and analysis play a crucial role in streaming data processing. ZStream is such one of the real-time infrastructures. ZStream is a real-time transactional streaming storage that is designed for storing fast ingested real-time data. ZStream uses Apache DistributedLog as the transaction log for both data and metadata. By leveraging Apache DistributedLog, ZStream is able to deliver millions of events in milliseconds.

In this presentation, Kai will offer an overview of ZStream and its features. He will also share the thoughts and lessons on how ZStream uses Apache DistributedLog to build a real-time streaming storage.

Speakers
KS

Kai S.

Kai S is the cofounder of Sequenced, a company focused on real-time data infrastructure. He is actively involving and contributing to several open source projects including Apache BookKeeper and Apache DistributedLog (incubating).


Wednesday May 17, 2017 10:15am - 11:05am
Biscayne
  • Experience Level Any

10:15am

Evolution of an Apache Spark Architecture for Processing Game Data - Nick Afshartous, Warner Brothers Interactive Entertainment (WBIE)
We discuss lessons learned from our first production deployment of a Spark Streaming pipeline for processing game data. Deployment is to the AWS Cloud where we use managed services (i.e. EMR, S3 and Redshift). However, having downstream dependencies with outages and unpredictable response latencies can pose significant challenges. To address, we evolved the architecture by separating data processing from post-processing tasks (i.e. copying data into Redshift). Post-processing tasks are sent downstream from Spark to a task executor that was built using Akka Streams and Reactive Kafka. The end result is a loosely coupled architecture where the Spark streaming job is a firehose to S3 and fault-tolerant when Redshift is unavailable.

Speakers
avatar for Nick Afshartous

Nick Afshartous

Tech Director, Warner Brothers Interactive
Nick Afshartous is a Tech Director at Warner Brothers Interactive Entertainment (WBIE) where he leads the Analytics Core Platform team.   Using Apache Spark, he's helping to build WBIE's next generation real-time analytics platform for processing game data. He's passionate about big data, functional programming, and contributes to the Reactive Kafka project.   


Wednesday May 17, 2017 10:15am - 11:05am
Trianon
  • Experience Level Any

11:15am

Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways - Davor Bonaci, Google & Jean-Baptiste Onofré, ASF
The world of Big Data involves an ever increasing field of players, from storage systems to processing engines and distributed programming models. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a standard for expressing both batch and streaming data processing pipelines in a variety of languages across a variety of platforms and engines. In this talk, we will show how Beam gives users the flexibility to choose the best environment for their needs and read data from any storage system; allows any Big Data API to execute in multiple environments; allows any processing engines to support multiple domain-specific user communities; and allows any storage system to read/write process data at massive scale. In a way, Apache Beam is a glue that connects the Big Data ecosystem together; it enables “anything to run anywhere”.

Speakers
DB

Davor Bonaci

Davor Bonaci is serving as a chair of the Apache Beam Project Management Committee, and have been regularly committing code to the project since its inception. He is working as a Senior Software Engineer at Google. Before Beam, Davor has been working on its predecessor, Google Cloud Dataflow, since its beginnings, most recently by leading the development of the Dataflow SDK for Java.
JO

Jean-Baptiste Onofré

Apache Software Foundation
JB is PMC member for Apache Beam. He is a long-tenured Apache member, serving on PMC/committer for about 15 projects that range from integration to big data.


Wednesday May 17, 2017 11:15am - 12:05pm
Balmoral

11:15am

Data Profiling in Apache Calcite - Julian Hyde, Hortonworks
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.

Wednesday May 17, 2017 11:15am - 12:05pm
Windsor

11:15am

Streaming Processing with Apache Apex - Sandeep Deshmukh & Bhupesh Chawda, DataTorrent
Apache Apex is a next generation Hadoop (YARN) native, data-in-motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include data ingestion, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc. Apache Apex separates operational logic from business logic which enables developers to concentrate on business logic, reducing time to market as well as total cost of ownership. In this tutorial, we would introduce you to Apache Apex and walk though development of a real-world application demonstrating stream processing. The attendees would also go through some advanced capabilities like dynamic scalability and run-time updates of application properties. By the end of the session, attendees would be able to write applications to cater to their own use cases.

Speakers
avatar for Bhupesh Chawda

Bhupesh Chawda

Software Engineer, DataTorrent Software India Pvt. Ltd.
Bhupesh Chawda is a Software Engineer at DataTorrent Software India Pvt. Ltd. He is also a committer on the Apache Apex project under the Apache Software Foundation. His current interests include big data and distributed systems, stream processing and machine learning. He has experience delivering talks at international conferences like EDBT (2013) and ACM IKDD CODS (2016). He holds an M.Tech. from IIT Bombay, India and a BE from University of... Read More →
avatar for Sandeep Deshmukh

Sandeep Deshmukh

Dr Sandeep Deshmukh completed his PhD from IIT Bombay and has been working in Big Data and Hadoop ecosystem for 7+ years. He has executed complex projects in different domains in a distributed computing environment. He loves teaching and interacting with people and has conducted numerous workshops on Hadoop and Apache Apex. He is also a trainer for Apache Apex. Currently Sandeep is committer for Apache Apex. In the past he has worked as Asst... Read More →


Wednesday May 17, 2017 11:15am - 12:05pm
Biscayne

11:15am

Fast Cars, Big Data - How Apache Can Help Formula 1 - Carol McDonald, MapR Technologies
Modern race cars produce lot of data, and all this in real time. In this presentation I will show you how data could be generated and used by various applications in the car, on the track or team head quarter. The demonstration will show how to move data using messaging systems like Apache Kafka, process the data using Apache Spark and Flink and use various storage technics: distributed file system, HBase. This presentation is a great opportunity to see how to build a " near real time big data application" with Apache projects. The code from this talk will be made available as open source.

Speakers
avatar for Carol McDonald

Carol McDonald

Solutions Architect, MapR Technologies
Carol Mcdonald is a solutions architect at MapR focusing on big data, Apache HBase, Apache Drill, Apache Spark, and machine learning in healthcare, finance, and telecom. Previously, Carol worked as a Technology Evangelist for Sun, an architect/developer on: a large health information exchange, a large loan application for a leading bank, pharmaceutical applications for Roche, telecom applications for HP, OSI messaging applications... Read More →


Wednesday May 17, 2017 11:15am - 12:05pm
Trianon

12:15pm

Concrete Big Data Use Cases Implemented with Apache Beam - Jean-Baptiste Onofré, Apache Software Foundation
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch or streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, Apache Apex, Apache Gearpump, and Google Cloud Dataflow.

This talk will show you how to use Beam Java SDK to implement concrete use cases like batch analytics, streaming data ingestion or fraud detection.

Speakers
JO

Jean-Baptiste Onofré

Apache Software Foundation
JB is PMC member for Apache Beam. He is a long-tenured Apache member, serving on PMC/committer for about 15 projects that range from integration to big data.


Wednesday May 17, 2017 12:15pm - 1:05pm
Balmoral

12:15pm

Extending Apache Mahout to Support Deep Learning on GPUs - Suneel Marthi
Data scientists love tools like R and Scikit-Learn, as they offer a convenient and familiar syntax for analysis tasks. However, these systems are limited to operating serially on data sets that can fit on a single node and do not allow for distributed execution. Mahout-Samsara is a linear algebra environment that offers both an easy-to-use Scala DSL and efficient distributed execution for linear algebra operations. Data scientists transitioning from R to Mahout can use the Samsara DSL for large-scale data sets with familiar R-like semantics. Machine Learning and Deep Learning algorithms built with the Mahout-Samsara DSL are automatically parallelized and optimized to execute on distributed processing engines accelerated natively by CUDA, OpenCL and OpenMP. ML practitioners will come away from this talk with a better understanding of how Samsara's linear algebra environment.

Speakers
SM

Suneel Marthi

Principal Engineer, Red Hat
Suneel Marthi is a member of the Apache Software Foundation and a PMC member on Apache Mahout, Apache OpenNLP and Apache Pirk. He has previously presented at Apache Big Data, Hadoop Summit Europe and Flink Forward.


Wednesday May 17, 2017 12:15pm - 1:05pm
Alhambra

12:15pm

Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy - Stuart Pook, Criteo
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising. Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster. This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.

Speakers
avatar for Stuart Pook

Stuart Pook

Senior DevOps Engineer, Criteo
Stuart loves storage (130 PB at Criteo) and is part of Criteo's Lake team that runs some small and two rather large Hadoop clusters. He also loves automation with Chef because configuring more than 2200 Hadoop nodes by hand is just too slow. Before discovering Hadoop he developed user interfaces and databases for biotech companies. Stuart has presented at ACM CHI 2000, Devoxx 2016, the NABD Conference 2016, Hadoop Summit Tokyo 2016, Apache Big... Read More →


Wednesday May 17, 2017 12:15pm - 1:05pm
Windsor

12:15pm

SQL and Streaming Systems - Atri Sharma, Microsoft
The talk shall focus on how SQL is useful on streaming systems and how SQL can help streaming analytics to be faster and better. The talk shall focus on how to use Calcite to implement SQL on streaming systems and some use cases for the same.

Speakers
avatar for Atri Sharma

Atri Sharma

Software Engineer, Azure Data Lake, Microsoft
An Apache Apex committer where he is engaged in designing and implementing next generation features and performing reviews.A learning PostgreSQL hacker who is currently engaged in various aspects of Postgres.He has been an active contributor,implementing ordered set functions, implementing grouping sets in Postgresql, improving sort and hashjoin performance and OLAP performance.He is also a committer for Apache HAWQ, Apache MADLib.


Wednesday May 17, 2017 12:15pm - 1:05pm
Biscayne

12:15pm

Actionable Insights with Apache Apex - Devendra Tagare, DataTorrent Inc.
In this talk I would like to cover how Apache Apex is used to deliver actionable insights in realtime for Ad-tech. The talk would include a reference Apex architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture cover concepts around Apex, kafka and dimensional compute.Real time streaming problems and challenges would also be covered.Some operational aspects of a streaming system will also be touched upon.

Speakers
DT

Devendra Tagare

Hi I am a data platform engineer & Apache committer focussed on, | | Solutions Architecture for low latency, high scalability data streaming systems. | | Rapid prototyping for real world streaming use-cases. | | Backend engineering development for Apache Apex & DataTorrent. | Working on end to end Big Data stack from Ingestion, Aggregation to Analytics. | | Platform as a service implementation for Big Data streaming. | Data processing... Read More →


Wednesday May 17, 2017 12:15pm - 1:05pm
Trianon

1:05pm

Lunch ( Attendees on Own)
Wednesday May 17, 2017 1:05pm - 2:30pm
TBA

2:30pm

Nexmark, a Unified Framework to Evaluate Big Data Processing Systems with Apache Beam - Ismael Mejia & Etienne Chauchot, Talend
Big Data processing in real-time is on the rise at Apache with projects like Apache Spark, Apache Flink or Apache Apex. However at this moment we don’t have a unified framework to evaluate the correctness and the performance of these systems. Apache Beam implements a unified model to write both Batch and Streaming jobs with a single API and execute them independently in any of the supported platforms (runners), this makes Beam an ideal candidate to support an evaluation framework.

In this talk we will present Nexmark, a benchmark framework to evaluate queries over data streams. An implementation of Nexmark was donated by Google as part of the Apache Beam incubation process. Nexmark bridges the gap for evaluating data processing frameworks, but also serves as a rich integration test to evaluate the correct implementation of both the Beam runners and the new features of the Beam SDK.

Speakers
EC

Etienne Chauchot

Etienne has been working in software engineering and architecture for more than 13 years in domains such as retail or financial groups. He has been focusing on Big Data for a few years on technologies such as Apache Cassandra, ElasticSearch or Apache Spark. He is an Open Source fan and now works at Talend France where he contributes to Apache projects such as Apache Beam.
avatar for Ismael Mejia

Ismael Mejia

Open Source Software Engineer, Talend
Ismaël Mejía is an Apache Beam committer and a software engineer at Talend. He loves to tackle complex problems and build simple and elegant solutions. His main area of focus is distributed systems (Big Data and Cloud). He has been working on web services and large scale systems since 2004 and works actively on open source projects since 2014.


Wednesday May 17, 2017 2:30pm - 3:20pm
Balmoral
  • Experience Level Any

2:30pm

Biophotonics Using Apache PredictionIO, Spark and Deep Learning - Prajod Vettiyattil, Wipro Technologies
Biophotonics is the study of microscopic life, like biological cells, using optical methods. It has applications in medicine, agriculture and environmental sciences. In this session we will see how Deep Learning and Big Data software and can help analyze images captured using tools like high end microscopes, used in biophotonics. Thus accelerating medical research. Medical research labs and diagnostic centers use high end microscopes and deep human knowledge to observe living cells and perform life cycle analysis on them. These workflows involve time consuming, iterative and complex processes. This session will explain the application of deep learning to automatically detect microscopic cells from samples of digital images, and provide automatic classification. This will be of immense help for medical diagnostics. The solution uses PredictionIO, Spark, OpenCV and Deeplearning4j

Speakers
avatar for Prajod Vettiyattil

Prajod Vettiyattil

Architect, Wipro Technologies
Prajod is a Senior Architect in the open source solutions group of Wipro Technologies, responsible for research and solution development in the area of Big Data and Analytics. His current work involves analyzing image and video content using machine learning, to solve hard problems. He has presented at multiple open source conferences (ApacheCon BigData EU 2016, OSI Days, GIDS, JUDCon, WSO2Con). He has also written articles on technology, in... Read More →


Wednesday May 17, 2017 2:30pm - 3:20pm
Alhambra

2:30pm

Khermes: An Open-Source and Distributed Data Generator for Apache Kafka - Alberto Rodriguez & Emilio Ambrosio, Stratio
Today, companies and organisations with large amounts of data are increasingly faced with the need to produce user-defined data for different types of data stores or to understand how their systems will perform under a heavy data-load. We have created Khermes, an open-source distributed data-generator, to simplify this process. Using Apache Kafka, Khermes can generate large amounts of user-defined data that can be stored anywhere. It can also be used as a “stress tool” to measure the performance of systems in a heavy-load environment: Users can increase the strain on their Apache Kafka clusters and monitor their performance. Through use cases and demos, you will discover Khermes features and how it works.

Speakers
avatar for Emilio Ambrosio

Emilio Ambrosio

Software Engineer, Stratio
As Software Engineer at Stratio, I have participated in different cutting edge projects and some modules included within the Stratio's platform, particularly those related to real-time streaming and data ingestion, based on Apache Spark Streaming and Apache Flume respectively.
avatar for Alberto Rodriguez

Alberto Rodriguez

Software Architect, Stratio
Working as a Big Data Architect at Stratio, Alberto Rodriguez has been involved in the inception and evolution of some modules included within the Stratio's platform, specially those related to data visualization, real-time, streaming and complex event processing. I am also proud commiter of the Apache Metamodel project.Speaking experience: Apache Big Data Europe 2015, Big Data Spain 2015 and several meetups.


Wednesday May 17, 2017 2:30pm - 3:20pm
Windsor

2:30pm

Building Streaming Data Pipelines with Stateful Operations - Chandni Singh, Simplifi.it
There are a few streaming platforms which provide the exactly-once processing guarantee. This is done by checkpointing the state of the functional units (operators) that make up the streaming pipeline. Many real-world big data pipelines are typically composed of operators which maintain a large ever-growing state. However, periodically checkpointing the state of these operators is only practical when their state is small. To solve this problem, I created Managed State for the Apache Apex project, which is an incrementally checkpointed key-value data structure. Additionally, the community has developed a layer ontop of Managed State (Spillable Datastructures), which allows us to incrementally checkpoint a variety of common data structures. This presentation will cover the challenges of implementing fault-tolerant incremental checkpoint in Managed State.

Speakers
CS

Chandni Singh

I’m a software engineer who likes to build distributed frameworks/applications which are fault-tolerant and scalable. I am a PMC member and committer of Apache Apex project and have worked with few other distributed platforms and have co-founded a company which creates big data micro-services.


Wednesday May 17, 2017 2:30pm - 3:20pm
Biscayne

2:30pm

New Development in HBase - Zhihong Yu, Apache HBase PMC
W-TinyLFU records the frequency in a counting sketch, ages periodically by halving the counters, and orders entries by SLRU. An entry is discarded by comparing the frequency of the new arrival (candidate) to the SLRU's victim, and keeping the one with the highest frequency. This allows the operations to be performed in O(1) time and, though the use of a compact sketch, a much larger history is retained beyond the current working set. In a variety of real world traces the policy had near optimal hit rates.

Backup / Restore is standard feature for RDBMS. HBase adds support for Backup / Restore through a series of phases: HBASE-7912 (phase 1), HBASE-14123 (phase 2). Technical approach for implementing backup / restore would be covered along with typical command line usages.

Speakers
ZY

Zhihong Yu

Staff Engineer, Hortonworks
I have been Apache HBase PMC for 5 and half years. | | I am also committer for Apache Slider and Apache Bahir. | | I contribute to Apache Phoenix and Apache Spark. | | I have presented at the past 3 ApacheCon NA events.


Wednesday May 17, 2017 2:30pm - 3:20pm
Trianon

2:30pm

Sponsor Showcase
Wednesday May 17, 2017 2:30pm - 4:40pm
Mezzanine

3:30pm

Deep Learning: Using VFIO to Leverage Virtual Machine GPU's - Bram Steurtewagen, Ghent University
Deep Learning (Neural Networks) is definitely the hot topic in the Big Data world. To speed up computations in this field, GPU's are being leveraged more and more. However, enterprise-grade GPU's with VM support are expensive and require extensive licensing. For this reason and respecting the spirit of using "commodity hardware" for Big Data purposes, the speaker proposes to leverage PCIe passthrough and VFIO to attach GPU's straight to a virtual machine. This allows us to utilise the GPU's computational capacity as if it were physically attached to the host and enables us to run multiple accelerated VM's on a single node, without the use of enterprise-grade GPU's.

Utilising this technique, we do not lose any significant performance compared to a bare-metal approach, as shown with a Spark-based elephas demo.

Speakers
avatar for Bram Steurtewagen

Bram Steurtewagen

Ghent University
Bram Steurtewagen received his M.Sc. degree in Commercial Engineering (2013) and his M.Sc. degree in Marketing Analytics (2014) from Ghent University in Belgium. Since then, he has been pursuing a PhD in Marketing Analytics at the Faculty of Economics and Business Adminstration of Ghent University. | | He is currently employed at Klarrio, an IoT analytics company. | His interests lie mainly in predictive and prescriptive analytics in the... Read More →


Wednesday May 17, 2017 3:30pm - 4:20pm
Alhambra

3:30pm

Log-Island on the Rocks! Realtime Pattern Mining at Scale - Thomas Bailet, Hurence
LogIsland is an event mining platform based on Spark and Kafka to handle a huge amount of log files. (https://github.com/Hurence/log-island). This framework alleviates the burden of deploying and managing complex stream processing applications at the big data level. It works especially well in conjunction with Apache NIFI which can be used to route the raw data into Kafka topic, then Logisland streams all the data into its distributed processors (parsers, complex analysers, aggregators, alerters) which generates events to go into other Kafka topics for further async processing. The strength of the solution is the high throughput of events within a few nodes and the ability to write complex distributed processing plugins in a few lines. The presentation will show the framework at work on an Hadoop cluster with a stream search percolator and an outlier detection processor.

Speakers
TB

Thomas Bailet

I've been into 3D realtime rendering, Java EE Information Systems, Big Data & machine learning projects for more than 15 years now. (https://www.linkedin.com/in/thomas-bailet-95aa598). I've published a book on software architecture (http://www.editions-eni.fr/livres/architecture-logicielle-pour-une-approche-organisationnelle-fonctionnelle-et-technique-2e-edition/.0e9. | I'm actually the lead designer of https://github.com/Hurence/logisland a... Read More →


Wednesday May 17, 2017 3:30pm - 4:20pm
Windsor

3:30pm

The Rise of Real-Time: Apache DistributedLog and Its Stream Store - Sijie Guo, Twitter
Data growth is exponential and organizations are producing it in a myriad of formats. Instead of storing and processing the data at some regular cadence, many in the industry are realizing the benefits of real-time data analytics via stream processing. The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of storage for real-time applications, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Sijie Guo will describe Apache DistributedLog - a high throughput and low latency replicated stream store, discuss what are the challenges on building a stream store for real-time applications, and explore the future of Apache DistributedLog and the big data ecosystem.

Speakers
SG

Sijie Guo

Twitter
Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.


Wednesday May 17, 2017 3:30pm - 4:20pm
Biscayne
  • Experience Level Any

3:30pm

Genetic Algorithms in All Their Shapes and Forms - Julien Sebrien, Geneticio Expertise
We will talk about genetic algorithms, inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms. Genetic algorithms are used to generate solutions to optimization and search problems by relying on bio-inspired operators and follow this process:
  • Randomly generate a population of individuals
  • Evaluation 
  • Termination checks 
then, iteratively: 
  • Selection
  • Crossover
  • Mutation
  • Evaluation 
  • Termination checks
Genetic algorithms behavior will be illustrated by playful use-cases, such as ToBeOrNotToBe, or Smart Rockets., etc.

Speakers
JS

Julien Sebrien

Julien is an experienced consultant who works on challenging development projects for top financial clients and startups. Julien also likes to work on open source technologies such as Cassandra, Spark or Elastic Search with a strong interest in artificial intelligence. Julien cofounded Geneticio, distributing an innovative solution implementing genetic algorithms.


Wednesday May 17, 2017 3:30pm - 4:20pm
Trianon

4:20pm

Coffee Break
Wednesday May 17, 2017 4:20pm - 4:40pm
Mezzanine

4:40pm

Expanding Apache Zeppelin into Your Cluster - Jongyoul Lee, ZEPL
Apache Zeppelin is one of tools to help users enrich their analysis with beautiful visualization without any additional work. But from now it has had critical issues to use Apache Zeppelin in production environment. Apache Zeppelin runs on a single server only which means SPOF, and users suffer from a shortage of resources because of running on a single machine. Apache Zeppelin has tried to overcome this inconvenience and now supports to launch your job in a cluster. You don't have to think of and suffer from resources anymore when you run many jobs on Apache Zeppelin. Through using your cluster, one instance is enough for all your colleagues. This talk has two parts. The first describes how Apache Zeppelin launches interpreters in a cluster and what happens internally. The second introduces Helium plugin to support third party visualization and how to install them.

Speakers
avatar for Jongyoul Lee

Jongyoul Lee

Software Development Engineer, ZEPL
I'm a member of PMC of Apache Zeppelin and works at ZEPL. In Apache Zeppelin, I focus on stabilizing Apache Zeppelin to be used in production level, developing some enterprise features and enhancing Apache Spark/JDBC features. Personally, I'm really interested in distributed and fault-tolerant systems. I had four speaking experiences last year. Two of them were a main speaker at Apache Kylin meet-up at Shenzhen, China and Apache Zeppelin... Read More →


Wednesday May 17, 2017 4:40pm - 5:30pm
Balmoral

4:40pm

Leveraging the GPU on Spark - Josef Adersberger, QAware GmbH
GPUs are a great resource of computing power but yet not accessible from Apache Spark. We present a RDD implementation we've open sourced to leverage GPU computing power with Spark. We'll share the experiences we gained along the way implementing the RDD, and a real-world application using the RDD: What's the best way to bridge from Java to GPU code (OpenCL or CUDA)? From an architectural perspective - what's the best way to integrate a GPU processing facility into Spark? How much faster are typical Spark actions when using the GPU? What Spark actions are best processed on a GPU? Java-to-GPU bridges, best way to integrate GPU processing into Spark and performance evaluation.

Wednesday May 17, 2017 4:40pm - 5:30pm
Alhambra

4:40pm

Automation of Rolling Upgrade for Hadoop Cluster without Data Lost and Job Failures - Hiroyuki Adachi & Hiroshi Yamaguchi, Yahoo Japan Corporation
We present how we automated rolling upgrade for our production Hadoop cluster without data lost and job failures. Apache Ambari can perform rolling upgrade, however it does not consider data lost and effects for running jobs. Therefore, we decided to customize it for our environment and created upgrade procedures with more secure checking. First, we made a custom service for Ambari which operates some functions such as NameNode F/O and load balancer In/Out. Second, we used Ansible which is a configuration management tool to control upgrading task. It automates calling Ambari APIs including the custom service functions, checking cluster statuses (e.g., missing blocks), and running service check jobs while upgrading each component. Consequently, we achieved the automatic rolling upgrade, and we reduced operating costs and minimized inconvenience to users.

Speakers
HA

Hiroyuki Adachi

Hiroyuki Adachi is in charge of DevOps at Hadoop of Yahoo! JAPAN.
avatar for Hiroshi Yamaguchi

Hiroshi Yamaguchi

Hiroshi Yamaguchi is in charge of DevOps at Hadoop of Yahoo! JAPAN.


Wednesday May 17, 2017 4:40pm - 5:30pm
Windsor

4:40pm

Apache Tika: What's New with 2.0? - Nick Burch, Quanticate
Apache Tika detects and extracts metadata and text from a huge range of file formats and types. From Search to Big Data, single file to internet scale, if you've got files, Tika can help you get out useful information! Apache Tika has been around for nearly 10 years now, and with the passage of all that time, plus the new 2.0 release, a lot has changed. Not only has there been a huge increase in the number of supported formats, but the ways of using Tika have expanded, and some of the philosophies on the best way to handle things have altered with experience. Tika has gained support for a wide range of programming languages to, and more recently, Big-Data scale support. Whether you're an old-hand with Tika looking to know what's hot or different with 2.0, or someone new looking to learn more about the power of Tika, this talk will have something in it for you!

Speakers
NB

Nick Burch

CTO, Apache Software Foundation
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in "Content" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance. | | Nick is CTO at Quanticate, a Clinical Research Organisation (CRO) with a strong focus on data and statistics. | | Nick has spoken at most ApacheCons since 2007, and as well as many... Read More →


Wednesday May 17, 2017 4:40pm - 5:30pm
Biscayne

5:40pm

Helium makes Zeppelin Fly! - Moon Soo Lee, NFLabs
Apache Zeppelin is interactive data analytics environment for computing system. It integrates many different data processing frameworks like Apache Spark and provides beautiful interactive web-based interface, data visualization, collaborative work environment to make your data science lifecycle more fun and enjoyable.

Since 0.7.0, Zeppelin has framework called 'Helium' with two new pluggable components: Visualization, Spell. Visualization extends built-in visualization and Spell provides lightweight way to extend interpreter and display system in Zeppelin.

In this talk we'll see how visualization and spell can be created and used. Also Zeppelin community provides Helium online registry by leveraging NPM package registry for publishing Visualization and Sell. We'll take a look how community manages online registry service and how to publish package to online registry.

Speakers
MS

Moon Soo Lee

Moon Soo Lee is a creator for Apache Zeppelin (incubating) and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and building healthy business around of it.


Wednesday May 17, 2017 5:40pm - 6:30pm
Balmoral
  • Experience Level Any

5:40pm

TensorFlow in the Wild: From Cucumber Farmer to Global Insurance Firm - Kazunori Sato, Google
One of the largest global insurance firm recently introduced TensorFlow, the open source library from Google for machine intelligence, for classifying car drivers who has high likelihood on major accidents with deep neural network. The model provides 2x higher accuracy compared with existing random forest model, gives them a possibility to lower the insurance price significantly. Also, a cucumber farmer in Japan has been using TensorFlow to build a hand-made sorter that classifies cucumbers into 9 classes based on its length, shape and color. At this session, we'll look at how TensorFlow democratizes the power of machine intelligence and is changing the world with many different real-world use cases of the technology.

Speakers
avatar for Kazunori Sato

Kazunori Sato

Staff Developer Advocate, Google Inc
Kaz Sato is Staff Developer Advocate at Cloud Platform team, Google Inc. He leads the developer advocacy team for Machine Learning and Data Analytics products, such as TensorFlow, Cloud ML, and BigQuery. Speaking at major events including Google I/O 2016, Hadoop Summit 2016, Strata+Hadoop World 2016 San Jose and NYC, ODSC East/West 2016, Google Next 2015 NYC and Tel Aviv. Kaz also has been leading and supporting developer communities for Google... Read More →


Wednesday May 17, 2017 5:40pm - 6:30pm
Alhambra

5:40pm

Routing Trillion Messages Per Day @Twitter - Lohit Vijayarenu & Gary Steelman, Twitter
Twitter collects more than Trillion messages per day. These messages are grouped into hundreds of categories which have different properties. Messages are routed based on category to various nodes in a cluster until they reach storage systems serving Analytics and Streaming workloads. The scale of messages with different delivery guarantees pose unique challenges at Twitter.

Twitter’s Log Collection framework has been built using Scribe over the years. Message delivery guarantee, priority and multiplexing add complexities to routing. Additionally Twitter scale introduces unique challenges for management of the logging framework. In this talk we discuss about challenges we face and effort to improving our logging framework using Apache Flume. Apache Flume with its pluggable architecture provides many building blocks to implement various features for our collection framework.

Speakers
GS

Gary Steelman

Gary Steelman is a Software Engineer and has been working on Hadoop and related projects at Twitter. He has a master's degree from the University of Texas at Dallas specialized in intelligent systems, AI, and machine learning.
LV

Lohit VijayaRenu

Software Engineer, Twitter
Lohit VijayaRenu is a Software Engineer at Twitter Hadoop team. He has masters degree from Stony Brook University. He has been working on Hadoop and related projects at Yahoo!, MapR and Twitter.


Wednesday May 17, 2017 5:40pm - 6:30pm
Biscayne
 
Thursday, May 18
 

7:00am

Morning Run
Thursday May 18, 2017 7:00am - 8:00am
InterContinental Miami Lobby

8:00am

Breakfast
Thursday May 18, 2017 8:00am - 9:00am
Mezzanine

8:00am

Sponsor Showcase
Thursday May 18, 2017 8:00am - 11:20am
Mezzanine

8:00am

Registration
Thursday May 18, 2017 8:00am - 4:30pm
Mezzanine

9:00am

Venturing into Large Hadoop Clusters - Varun Saxena & Naganarasimha Garla, Huawei Technologies
Hadoop clusters are continuously becoming larger with several thousand machines,running thousands of jobs concurrently on 1000-1500 queues divided by different tenants and crunching higher volume of data than before.Hence,maintaining good performance of such large clusters,ensuring fast recovery times,upgrading them and debugging them becomes a major challenge.With larger clusters,enterprises expect even more efficient cluster utilisation.The fact that jobs are in turn executed as part of a workflow adds to the complexity. As time progresses, clusters would become even larger, i.e. have several tens of thousands of machines.

In this talk, we plan to share issues we came across while handling large clusters and the optimizations we had to make to resolve them.We would also talk about a few upcoming features in Hadoop which aim to overcome challenges posed by clusters at gigantic scale.

Speakers
NG

Naganarasimha Garla

System Architect, Huawei Technologies Pvt Ltd
I am a Big Data Enthusiast and have experience in developing Big Data Hadoop applications and platforms since 5 years. I have 12 years of experience as a Java Software Developer. | | I have been actively contributing for Hadoop YARN and Map Reduce since 2.5 years and currently Apache Hadoop Committer. Further details : http://people.apache.org/~naganarasimha_gr/ & http://in.linkedin.com/in/naganarasimha-garla-a620297 .
VS

Varun Saxena

Senior Technical Leader, Huawei Technologies
I am currently working as a Senior Tech Lead in Huawei's Hadoop Team which provides big data solutions to multiple product lines in Huawei and contributes to Hadoop community. I am also an Apache Hadoop Committer and have been contributing to YARN for almost 2.5 years. Overall, I have 8 years of experience developing fault tolerant, distributed systems.


Thursday May 18, 2017 9:00am - 9:50am
Alhambra

9:00am

Java 9 Support in Apache Hadoop - Akira Ajisaka, NTT DATA
Java 9 is the next major version and will be GA in July 2017, and it's very important for Apache Hadoop to support Java 9 earlier. Hadoop has many downstream projects and it makes the projects to support Java 9 easily. Java 9 has more incompatible changes than any earlier releases. For example, Project Coin (JEP 213) banned '_' as an identifier and Hadoop Web UI is affected. In this session, Akira will introduce what are the incompatible changes and what we need to do to support Java 9 in Hadoop. Classpath isolation is also an important issue for Hadoop. Hadoop has many dependencies, and the developers who write applications running on Hadoop need to be careful not to conflict the classpath. Java 9 Jigsaw feature is expected to solve this 'jar hell' problem but Hadoop does not use the feature for now. Akira will also introduce how Hadoop community solves the problem without Jigsaw.

Speakers
avatar for Akira Ajisaka

Akira Ajisaka

Software Engineer, NTT DATA Corporation
Akira Ajisaka is a software engineer working at NTT DATA, Japan. He belongs to OSS Professional Services team and deploys and operates Hadoop clusters for customers. He sometimes troubleshoots them by investigating source code and creating patches to fix the problem. He is an Apache Hadoop committer and PMC member, and he involves in various components of Hadoop for improving usability and supportability. He wrote a blog post about activities... Read More →


Thursday May 18, 2017 9:00am - 9:50am
Balmoral

9:00am

Challenges of Monitoring Distributed Systems - Nenad Bozic, SmartCat
Back in the days, you had a single machine and you could scroll down the single log file to figure out what is going on. In this Big Data world you need to combine a lot of logs together to figure out what is going on. Data is coming in huge volumes, with high speed so choosing important information and getting rid of noise becomes real challenge. There is a need for a centralized monitoring platform which will aid the engineers operating the systems, and serve the right information at the right time. This talk will focus on monitoring stack we like to use including Riemann, InfluxDB, ELK and Grafana. Cassandra will be used as an example of distributed system. Problem will be separated in two domains: metric collection and log collection and we will finish with example how you can combine both to pinpoint issues.

Speakers
avatar for Nenad Bozic

Nenad Bozic

Co-Founder & Senior Consultant, SmartCat
Big Data enthusiast and Apache Cassandra fan. DataStax MVP for Apache Cassandra for 2017. Craftsman with more than 10 years of experience, all arounder but when he does backend coding (mostly in Java) he feels right at home. Strong believer in balance between good technical skills and soft skills. Striving for knowledge is his main drive, which is why he enjoys learning new tools and languages, blogging, working on open source, presenting at... Read More →


Thursday May 18, 2017 9:00am - 9:50am
Biscayne

9:00am

Apache Rya – A Scalable RDF Triple Store - Adina Crainiceanu, US Naval Academy
Apache Rya (incubating) is a scalable database management system designed for storing and searching very large Resource Description Framework (RDF) data. In its most basic form, RDF data is a triple. Due to its flexibility, RDF is the current standard for storing a many different types of information. With the explosive increases in the size of available data, scalable solutions are needed to efficiently store and query very large RDF graphs within big data architectures. Apache Rya is an RDF triple store built on top of Apache Accumulo. We introduce storage methods, indexing schemes, query optimization, and query evaluation techniques that allow Rya to scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL.

Speakers
AC

Adina Crainiceanu

Adina Crainiceanu is an Associate Professor in the Computer Science Department at the US Naval Academy. She received her Ph.D. in Computer Science from Cornell University. She has conducted database and distributed systems related research for more than 15 years, and has published papers in premiere database conferences and journals. She gave numerous presentations at conferences and professional meetings. She is one of the founders of Rya and... Read More →


Thursday May 18, 2017 9:00am - 9:50am
Trianon

10:00am

A Funny Thing Happened on the Way to Full Text Search: I Shook my Search Engine and Analytics Fell Out! - Patrick Hoeffel, Polaris Alpha
Search engines are not just for text anymore. Apache Solr has become a powerful Business Intelligence and Analytics tool, answering a much broader array of questions than was possible in the past. We’ll explore use cases that you may not have realized Solr could address, such as Graph Traversal and Machine Learning through Text Classification. We’ll also discuss the key BI and Analytics differentiator - Faceting, and discuss how that one feature can transform your analytics landscape. Then we’ll look at Solr’s new Parallel SQL interface, which allows you to use Tableau and other traditional BI tools right out of the box to perform analysis tasks that never could have been possible before with a Full Text index. During the talk we demonstrate Facets, plus how you can use the SQL interface to set up a simple alerting engine right within Solr so that you can be productive right away.

Speakers
avatar for Patrick Hoeffel

Patrick Hoeffel

Senior Software Engineer, Polaris Alpha
Patrick is a Senior Software Engineer at Polaris Alpha. A veteran of commercial software solutions for over 25 years, Patrick has been involved products ranging from Online Services to early Internet Startups to Enterprise Applications to Military Intelligence. He has consulted the US, Europe and the Middle East, and developed code at all levels of the application stack. The one constant across all this experience is his belief in the primacy... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Alhambra

10:00am

Hadoop Cluster Governance - Vimal Sharma, Hortonworks
Apache Atlas is the one stop solution for data governance and metadata management on enterprise Hadoop clusters. Atlas has a scalable and extensible architecture which can plug into many Hadoop components to manage their metadata in a central repository. Vimal Sharma will review the challenges associated with managing large datasets on Hadoop clusters and demonstrate how Atlas solves the problem. Vimal will focus on Cross Component lineage tracking capability of Apache Atlas. Vimal will discuss the upcoming features and roadmap of Apache Atlas.

Speakers
avatar for Vimal Sharma

Vimal Sharma

Software Engineer, Hortonworks
Vimal Sharma is an Apache Atlas Committer at Hortonworks. Vimal graduated from IIT Kanpur with a B.Tech in Computer Science. Vimal is highly passionate about Hadoop stack and has previously worked on scaling backend systems at WalmartLabs using Spark and Kafka. Vimal regularly speaks at company events on topics like Apache Atlas, Apache Spark and Apache Sqoop. | | Apache ID : svimal2106@apache.org


Thursday May 18, 2017 10:00am - 10:50am
Balmoral

10:00am

One-Click Production Deployment of Tensorflow AI and Spark ML Models Using 100% Open Source Jupyter Notebook, Kubernetes, and NetflixOSS - Chris Fregly, PipelineIO
In this completely demo-based talk, Chris Fregly from PipelineIO will demo the latest 100% open source research in high-scale, fault-tolerant, distributed model training, testing, and serving using Tensorflow, Spark ML, Jupyter Notebook, Docker, Kubernetes, and NetflixOSS Microservices. This talk will discuss the trade-offs of mutable vs. immutable model deployments, on-the-fly JVM byte-code generation, global request batching, miroservice circuit breakers, and dynamic cluster scaling - all from within a Jupyter notebook. All code and docker images are available from Github and DockerHub at http://pipeline.io.

Speakers
avatar for Chris Fregly

Chris Fregly

Research Scientist, PipelineAI
Chris Fregly is a Research Scientist at PipelineIO - a Machine Learning and Artificial Intelligence Startup in San Francisco. | | Chris is an Apache Spark Contributor, Netflix Open Source Committer, Founder of the Advanced Spark and TensorFlow Meetup, Author of the upcoming book, Advanced Spark, and Creator of the upcoming O'Reilly video series, Deploying and Scaling Distributed TensorFlow in Production. | | Previously, Chris was a... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Biscayne

10:00am

Apache Ignite SQL Grid: Hot Blend of Traditional SQL and Swift Data Grid - Denis Magda, GridGain Systems Inc
In-memory data grids bring exceptional performance and scalability gains to applications built on top of them. The applications truly achieve 10x more performance improvement and become easily scalable and fault-tolerant thanks to the unique data grids architecture. However, because of this particular architecture, a majority of data grids have to sacrifice traditional SQL support requiring application developers to completely rewrite their SQL-based code to support data grid specific APIs.

This, however, is not true for all data grids. In this presentation, Denis will introduce Apache Ignite SQL Grid component that combines the best of two worlds - performance and scalability of data grids and traditional ANSI-99 SQL support of relational databases. Moreover, Denis will take an existing application that works with a relational database and will show how to run it on top of Ignite.

Speakers
avatar for Denis Magda

Denis Magda

Product Manager, GridGain Systems Inc
Denis is an expert in distributed systems and platforms who developed his skills by consistently contributing to Apache Ignite In-Memory Data Fabric and helping GridGain In-Memory Data Fabric customers build a distributed and fault-tolerant solution on top of their platform.Before joining GridGain and becoming a part of Apache Ignite community, Denis worked for Oracle Inc. where he led Java ME Embedded Porting Team helping Java opening new... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Windsor

10:00am

Big Data Analytics Using Apache (Py)Spark For Analyzing IPO Tweets - Dirk Van den Poel, Ghent University
In this talk, we share our experience in researching and practicing Business Analytics with a strong emphasis on descriptive and predictive analytics. We discuss the usefulness of these open-source analytics platforms by means of a real-life case study in Finance and Marketing: Analyzing the interaction between tweets and the success of an initial public offering (IPO), and the post-IPO price evolution. Moreover, we build a predictive model to determine whether a tweet will be retweeted. We present our findings using a series of platforms ranging from (1) dedicated Apache Spark clusters using Python Zeppelin Notebooks to (2) Databricks’ cloud platform.

Speakers
avatar for Dirk Van den Poel

Dirk Van den Poel

Professor of Data Analytics, Ghent University
Dirk Van den Poel (PhD) is Senior Full Professor of Data Analytics/Big Data at Ghent University, Belgium. He teaches courses such as Statistical Computing, Big Data, Predictive and Prescriptive Analytics. He co-authored 80+ international peer-reviewed publications in journals such as Journal of Statistical Software, Journal of Applied Econometrics, European Journal of Operational Research. He co-founded the first predictive analytics master... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Trianon

10:50am

Coffee Break
Thursday May 18, 2017 10:50am - 11:20am
Mezzanine

11:20am

General Durable Object and Native Computing Model for Apache Big Data Platforms - Gang Wang, Intel
The real challenges of the JVM based high performance real time streaming/massive data processing are how to remove those major bottlenecks as whole, the local or small optimization doesn't work in most cases due to intrinsic problems about the fitting of hardware platform with software abstract layers/patterns. The Mnemonic project proposed higher abstract models to address those problems as a whole, its creative concepts target to come up with a new standards of Big-data platforms that could leverage full advantages of latest server platforms to resolve those bottlenecks e.g. SerDe/marshalling, Garbage Collection(GC) performance issues, viewpoint difference between memory space and storage space, massive object caching, object sharing across clustering and kernel caching issues. They are also looking forward to optimizing large scale neural networking architecture on top of that.

Speakers
GW

Gang Wang

Software Engineer with +12 industrial development experience. | | He is one of major developers of Apache Mnemonic (incubating) project.


Thursday May 18, 2017 11:20am - 12:10pm
Alhambra

11:20am

Cluster Continuous Delivery with Oozie - Clay Baenziger, Bloomberg
Deploying software to secure, clustered Hadoop environments is a challenge. Particularly, one must distribute keytabs, user identities and cluster configuration to build systems like Jenkins; to speak nothing of network access to clusters. At Bloomberg, we ensure our clusters are defined via configuration management and can be automatically configured, operated. Application (HBase, Spark) deployment is a key part of this as well!

We have extended Oozie to provide deployment mechanisms for Git and plans for Maven artifacts allowing us to specify all cluster configuration including software deployed to that cluster. Often this consists of an Oozie workflow to deploy software, allowing for deployments to happen as the permissioned role account and not as a superuser.

Clay will walk through the process of these deployments and the code necessary to make these first-class Oozie actions.

Speakers
CB

Clay Baenziger

Clay Baenziger - is an architect for the Hadoop Infrastructure Team at Bloomberg. Clay comes from a diverse background in systems infrastructure and analytics. At Sun Microsystems, his team built out an automated bare-metal Solaris deployment tool for Solaris engineering labs and later his contributions were core to the OpenSolaris Automated Installer. Providing a good introduction to Hadoop, his team at Opera Solutions built out a financial... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Balmoral

11:20am

Introduction to Cluster Management Framework and Metrics in Apache Solr - Anshum Gupta, IBM Watson
Cluster management APIs have been consistently added to recent versions of Apache Solr to make designing of monitoring system for Solr easier. However, those APIs have always required an advanced level of knowledge of the pre-checks and the APIs themselves. The cluster management framework in Solr is aimed at making cluster management easier. Combination of metrics reporting, triggers, and recipes would allow users to configure actions based on triggers comprising of metrics or changes to the cluster state e.g. auto-addition of replicas to achieve a desired replication factor when a new node is added to the SolrCloud cluster. In this presentation, I would provide an overview of the cluster management framework, how it works, and its components i.e. metrics, triggers, and recipes. I will also talk about ways to extend those components to suit specific use-cases.

Speakers
avatar for Anshum Gupta

Anshum Gupta

Sr. Software Engineer, IBM Watson
Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and also the co-creator of AWS CloudSearch - the first search as a service offering by AWS.He has spoken at multiple international conferences, including Apache Big Data... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Biscayne

11:20am

From Open Data to Open Information - Thomas Vanhove, Qrama
Smart cities gather massive amounts of data from IoT sensors all over the city and external data sources provides by city services. These data sets are often made available to the public as open data sets, but while the data is openly available, using the data for practical use cases still requires infrastructure.

Thomas Vanhove will present the City of Things, a smart city project in the city of Antwerp (Belgium), and how people gain access to open data and can run their own analysis with the Tengu platform. Tengu provides the functionality to create custom big data frameworks through automated installation, configuration and integration of big data technologies for storage and analysis. In the City of Things project this not only allows users access to open data but to infrastructure and analysis as well.

Speakers
avatar for Thomas Vanhove

Thomas Vanhove

Co-founder - CEO, Qrama
Thomas obtained his master's degree in Computer Science from Ghent University, Belgium in July 2012. In August 2012, he started his PhD at the Information Technology department, researching the means for reaching true dynamic storage and polyglot persistence as to increase application performance. He is expected to defend in October 2017.During his PhD he developed an initial version of the Tengu platform, a toolset that aids researchers and... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Trianon

12:20pm

Construct a Sharable GPU Farm for Data Scientists - Layne Peng, EMC
With the development of machine learning algorithms, GPUs are winning the favor of data scientists. But the high cost of GPU devices and low utilization caused by statically allocation are heavy burdens both in financial and management aspects in introducing GPUs to the data science team. In this presentation, we will introduce our latest research topic of how we enable GPU virtualization, chaining GPUs into one shared logical instance based on an intelligent queue model. In this model, the logical server can present a GPU service to one or more clients that represents GPUs local to the data center, GPUs in the cloud or some hybrid combination of local and remote GPUs executing the client application. The allocation of GPU resources is intelligent controlled based on attributes of the task, running concurrently where possible on a GPU or pre-empted to manage higher priority activity.

Speakers
LP

Layne Peng

EMC
Principal Technologist, Architect in EMC. Leading Cloud Management & Orchestration, Converged Infrastructure initiatives in EMC Office of CTO China. Thirteen patents related to cloud, SDDC and big data. One of the author of book Big Data Strategy, Technology and Application.


Thursday May 18, 2017 12:20pm - 1:10pm
Alhambra

12:20pm

MOHA: Many-Task Computing Framework on Hadoop - Soonwook Hwang, Korea Institute of Science and Technology Information
In this talk, we present design and implementation of MOHA (MTC on Hadoop) framework which can effectively combine Many-Task Computing (MTC) technologies with Big Data platform Hadoop to enable more rich data analytics workflows in the ecosystem. MTC is a new computing paradigm that can consist of, e.g., millions of small tasks where each task communicates through files resulting in another type of data-intensive workload. MOHA is developed as one of YARN applications so that it can transparently cohost existing MTC applications with other Big Data processing frameworks in a single Hadoop cluster. MOHA can substantially reduce the overall execution time of many-task processing with minimal amount of resources compared to an existing Hadoop YARN application by effectively exploiting open-source distributed message queues (Apache ActiveMQ, Kafka) and streamlined task dispatching mechanism.

Speakers
avatar for Soonwook Hwang

Soonwook Hwang

Principal Researcher, KISTI
Dr. Soonwook Hwang is a principal researcher at Korea Institute of Science and Technology Information (KISTI), where he is responsible for the research and development of enabling technologies for the realization of cyber infrastructure for Korea. KISTI is running the biggest national supercomputing facility in Korea, providing expertise as well as computational resources aimed to enable scientific discovery for Korean scientists and engineers... Read More →


Thursday May 18, 2017 12:20pm - 1:10pm
Balmoral

12:20pm

Distributed Resource Scheduling Frameworks, Is There a Clear Winner? - Naganarasimha Garla & Varun Saxena, Huawei Technologies
Coming from Hadoop world we were aware of only YARN as a distributed resource scheduling frameworks but off late we have come across several other scheduling frameworks such as Mesos, Kubernetes etc. It is challenging to pick the right scheduling framework for an enterprise, as superficially all look the same. As part of this presentation, we want to provide overview of architectures of prominent scheduling frameworks, and then compare each of them functionally. We also plan to present which suits better in which scenarios and brief overview of community activities about these projects.

Speakers
NG

Naganarasimha Garla

System Architect, Huawei Technologies Pvt Ltd
I am a Big Data Enthusiast and have experience in developing Big Data Hadoop applications and platforms since 5 years. I have 12 years of experience as a Java Software Developer. | | I have been actively contributing for Hadoop YARN and Map Reduce since 2.5 years and currently Apache Hadoop Committer. Further details : http://people.apache.org/~naganarasimha_gr/ & http://in.linkedin.com/in/naganarasimha-garla-a620297 .
VS

Varun Saxena

Senior Technical Leader, Huawei Technologies
I am currently working as a Senior Tech Lead in Huawei's Hadoop Team which provides big data solutions to multiple product lines in Huawei and contributes to Hadoop community. I am also an Apache Hadoop Committer and have been contributing to YARN for almost 2.5 years. Overall, I have 8 years of experience developing fault tolerant, distributed systems.


Thursday May 18, 2017 12:20pm - 1:10pm
Biscayne

12:20pm

Presto - Swiss Army SQL Knife on Hadoop - Marek Gawiński & Dariusz Eliasz, Allegro Group
Waiting for Hive queries to finish teaches Your analysts patience and respect to technology. Unfortunately it is not what they expect and not what You get paid for. Interactive SQL on Hadoop has been The Holy Grail within Hadoop community and our analysts at Allegro - the biggest ecommerce platform in central-eastern Europe. We have read several benchmark papers regarding alternatives to Hive and we have run benchmarks on our own but they did not answer the question - which one to choose and is it worth adding Hive alternative to existing stack. Some technologies performed better with Parquet, others with ORC. None of the benchmarks consider user experience, new technology adoption within existing stack, and productivity of query development. In this talk we present how we ended up with Presto and our tips and tricks to hack it.

Speakers
avatar for Dariusz Eliasz

Dariusz Eliasz

Senior Data Platform Engineer, Allegro
Mainly interested in: | - big data platform architecture | - data governance | | Enthusiast of scalable distributed solutions, processing large amounts of data and continuous improvement.
MG

Marek Gawiński

Senior Data Platform Engineer, Allegro Group Sp. z o.o.
Since 6 years in Infrastructure and Services Maintenance Team where he takes care of technical support for the scrum teams and maintenance of multiple services included in the Allegro Group's portfolio. He is now developing big data solutions. Passionate about web technologies and open source. Now he is Senior Data Platform Engineer and deals with Hadoop ecosystem in Allegro Group. His responsibilities include maintenance private and public... Read More →


Thursday May 18, 2017 12:20pm - 1:10pm
Windsor

12:20pm

ING CoreIntel: On The Bank Secret Service - Krzysztof Adamski, ING
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components. Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals. Why choosing good data format matters? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to implement persistency in an OpenShift cluster.

Thursday May 18, 2017 12:20pm - 1:10pm
Trianon

1:10pm

Lunch (Attendees on Own)
Thursday May 18, 2017 1:10pm - 2:40pm
TBA

2:40pm

Performance Benchmarking in Open-Source at Amazon EMR - Stephen Tak Lon Wu, Amazon AWS EMR
Amazon EMR is a cloud-based provider that allows companies, research centers and academic divisions to leverage managed clusters at massive scale. In order to maintain and achieve performance in the open-source world of big data processing, Amazon EMR built an automatic performance benchmarking pipeline to aid in validating a new release prior to release. Why do we need this performance benchmarking pipeline? Open source communities move fast; innovations and implementations often need multiple iterations in order to effectively work at massive scale. Amazon EMR aims to provide a stable service; historical performance metrics help us to preview and capture the issues of each product before releasing to the market, meanwhile Amazon EMR is following closely to the open source releases.

Speakers
ST

Stephen Tak Lon Wu

Tak Lon (Stephen) Wu is a software development engineer of Amazon EMR. Before joining the company, he was working toward his PhD at Indiana University and got his candidate in late 2015. His research interests are Big data application analysis, MapReduce, data mining and performance benchmarking. While working at Amazon EMR, Stephen is a member of EMR application team that builds releases, contributes patches internally and externally to open... Read More →


Thursday May 18, 2017 2:40pm - 3:30pm
Alhambra

2:40pm

Transactions in Hadoop - Andreas Neumann, Cask
In the age of NoSQL, big data storage engines such as HBase have given up ACID semantics of traditional relational databases, in exchange for high scalability and availability. However, it turns out that in practice, many applications require consistency guarantees to protect data from concurrent modification in a massively parallel environment. In the past few years, several transaction engines have been proposed as add-ons to HBase: Three different engines, namely Omid, Tephra, and Trafodion were open-sourced within the Apache ecosystem alone. In this talk, Andreas Neumann will introduce and compare the different approaches from various perspectives including scalability, efficiency, operability and portability, and make recommendations pertaining to different use cases.

Speakers
avatar for Andreas Neumann

Andreas Neumann

Cask
Andreas Neumann develops big data software at Cask, and has formerly done so at places that are known for massive scale. He was the chief architect for Hadoop at Yahoo! and also for the foundational content management system that Yahoo! built on Hadoop. Previously he was a research engineer at Yahoo! and a search architect at IBM. Andreas holds a doctoral degree in computer science for his work on querying XML documents.


Thursday May 18, 2017 2:40pm - 3:30pm
Balmoral

2:40pm

Lessons Learned with Spark & Cassandra - Matthias Niehoff, codecentric AG
We built multiple applications based Apache Cassandra and Apache Spark. During the project we encountered a number of challenges and problems with both technologies as well as with the Spark-Cassandra-Connector In this talk we want to outline a few of those problems and our actions to solve them. Furthermore we want to give best practices which turned out to be useful in our projects. Topics include are not limited to:
  • Cassandra Bucketing
  • Spark Partitioning
  • Efficient Queries
  • Spark Join With Cassandra Table
  • Spark Data Locality

Speakers
avatar for Matthias Niehoff

Matthias Niehoff

IT Consultant, codecentric AG
Matthias works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.


Thursday May 18, 2017 2:40pm - 3:30pm
Biscayne

2:40pm

Scala + SQL = Union of Two Equals in Spark - Jayesh Thakrar, Conversant
Spark's capabilities as a better and faster Hadoop, as a distributed Scala platform, and as an interactive, batch and streaming environment are quite well known. But its prowess to be all that as a multilingual platform have not received sufficient spotlight. Traditionally RDBMS environments needed to glue together set oriented SQL with row-level specialized procedural languages (e.g. Pl/SQL), or use APIs in non-SQL languages e.g. JDBC. In spark however, the confluence of Scala and SQL is that of two equals as both are set or collection oriented, but have their own unique strengths.  This presentation will illustrate with background and examples on how to exploit this fusion of Scala and SQL in a way that takes advantage of both their strengths as well as boosts productivity.

Speakers
JT

Jayesh Thakrar

Officially, Jayesh Thakrar is a Sr. Data Engineer at Conversant (http://www.conversantmedia.com/). But in reality he is a data geek who gets to build and play with large data systems consisting of Hadoop, HBase, Ambari, Flume and Kafka. To rest after a good day's work, he uses OpenTSDB to keep an eye on all the systems.


Thursday May 18, 2017 2:40pm - 3:30pm
Windsor

2:40pm

Secure, UI-Driven Spark/Flink/Kafka-as-a-Service - Jim Dowling, Royal Institute of Technology
Since June 2016, SICS Swedish ICT has provided Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service to researchers in Sweden. We have developed a UI-driven multi-tenant platform (Apache v2 licensed) in which researchers securely develop and run their applications. Applications can be either deployed as jobs (batch or streaming) or written and run directly from Notebooks in Apache Zeppelin. All applications are run on YARN within a security framework built on project-based multi-tenancy. A project is simply a grouping of users and datasets. Datasets are first-class entities that can be securely shared between projects. Our platform also introduces a necessary condition for elasticity: pricing. Application execution time in YARN is metered and charged to projects, that also have HDFS quotas for disk usage. We also support project-specific Kafka topics that can also be securely shared.

Speakers
avatar for Jim Dowling

Jim Dowling

Jim Dowling is an Associate Professor at KTH Royal Institute of Technology in Stockholm as well as a Senior Researcher at SICS Swedish ICT. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is lead architect of Hops, a next generation, open-source distribution of Hadoop with support for distributed metadata. He is a regular speaker at Hadoop industry events.


Thursday May 18, 2017 2:40pm - 3:30pm
Trianon

2:40pm

Sponsor Showcase
Thursday May 18, 2017 2:40pm - 4:40pm
Mezzanine

3:40pm

Streamline Hadoop DevOps with Apache Ambari - Alejandro Fernandez, Hortonworks
Apache Ambari has become an indispensable tool for operating Hadoop clusters ranging from 20 to 2000 nodes. Ambari’s knowledge of the Hadoop stack allows it to deploy a cluster within minutes and manage the entire lifecycle: scaling, security, upgrades, and more. The speaker will discuss central features like deploying clusters with Blueprints, adding custom services, scaling the number of hosts, adding High Availability, securing with MIT kerberos, and upgrading the Hadoop stack with features like Rolling & Express Upgrade, and using the REST API to automate workflows. For users and data scientists, Ambari provides LDAP sync, Role-Based Access Control to handle user permissions, and a framework to host Ambari Views. Lastly, he will cover how to monitor the health of the cluster via Alerts and troubleshoot by using LogSearch and Ambari Metrics Systems integrated with Grafana UI.

Speakers
AF

Alejandro Fernandez

Alejandro Fernandez has been a PMC for the Apache Ambari project since 2014 and is a software engineer at Hortonworks. He has made significant code contributions to Apache Ambari, has organized and participated in hackathons, and has been a speaker at the Hadoop Summit in San Jose, Melbourne, and at the Global Big Data Conference. He graduated from Carnegie Mellon University, where he got his Bachelor of Science in Computer Science and additional... Read More →


Thursday May 18, 2017 3:40pm - 4:30pm
Balmoral

3:40pm

A Practical Approach to Using Graph Databases and Analytics - Greg Jordan, Graph Story
While graph databases have become a standard for social networking and recommendation engines, the practical use of graphs in other areas beyond consumer applications is growing. In this presentation - with the support of use cases - we will explore how graph databases can applied to other domains, such as logistics and healthcare, as well as a look at where graphs can leverage other data systems. The presentation will also cover the role of graphs in going beyond predictive analytics to providing prescriptive analytics.

Speakers
avatar for Greg Jordan

Greg Jordan

CEO, Graph Story
Greg Jordan is the Founder & CEO of Graph Story, author of Practical Neo4j and has over 15 years of programming experience in various languages with a focus on data analytics and mobile projects. Greg is avid speaker and writer on the topic of graph databases and has been working with graph databases since 2011. Greg holds two Master's Degrees and is a Ph.D. candidate at the University of Memphis.


Thursday May 18, 2017 3:40pm - 4:30pm
Biscayne

3:40pm

A Smarter Pig - Eli Levine, Salesforce & Julian Hyde, Hortonworks
What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.

Speakers
avatar for Julian Hyde

Julian Hyde

Architect, Hortonworks
Julian Hyde is an expert in query optimization, in-memory analytics, and streaming. He was the initial developer of Apache Calcite and is a PMC member of Drill, Kylin and Eagle. He is an architect at Hortonworks.
avatar for Eli Levine

Eli Levine

Architect, Salesforce
Eli Levine is an architect at Salesforce building large scale storage and compute systems. He is a PMC member of Apache Phoenix.


Thursday May 18, 2017 3:40pm - 4:30pm
Windsor

3:40pm

Applying Apache Big Data Stack for Science-Centric Use Cases - Suresh Marru, Indiana University
This talk will discuss adaptation of Apache Big Data Technologies to analyze large, self-described, structured scientific data sets. We will present initial results for the problem of analyzing petabytes of weather forecasting simulation data produced as part of National Oceanic and Atmospheric Administration's annual Hazardous Weather Testbed. The challenge is to enable weather researchers to perform investigative queries over the full forecast simulation outputs to find the signatures for severe weather phenomena like tornadogenesis. Given the size of the data and the complexity of weather phenomena, these data sets are candidates for exploration by machine learning techniques that can identify heretofore unknown relationships in the dozens of weather parameters generated by the simulations, guiding researchers into developing new scientific models.

Speakers
avatar for Suresh Marru

Suresh Marru

Member, Apache Software Foundation
Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain science; computational and distributed systems and has authored or co-authored over 75 peer-reviewed conference papers and journal articles in these areas. He gets his... Read More →


Thursday May 18, 2017 3:40pm - 4:30pm
Trianon

4:40pm

Docker on Hadoop - Daniel Templeton, Cloudera, Inc.
Apache Hadoop is a powerful platform for processing large volumes of structured, semi-structured, and unstructured data. Docker is an exciting technology for containerizing workloads. Combining the two can solve a number of issues for big data practitioners. In this talk, Daniel Templeton will walk the audience through the current level of Docker support in Hadoop, where it fall short, and how best to take advantage of it. Daniel will also cover the ongoing community work, including impact and expected availability.

Speakers
DT

Daniel Templeton

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support, and is a Hadoop committer. Daniel has spoken at numerous JavaOne, Hadoop Summit, Strata+Hadoop World, and various other conferences.


Thursday May 18, 2017 4:40pm - 5:30pm
Balmoral

4:40pm

Multi-Model Big Data Platform for Complex Real Estate Analytics - Karthik Karuppaiya, Ten-X
Building an online real-estate marketplace is an extremely complex high touch business. The data that the business deals with varies from scanned PDFs and complex excel spread sheets to transactional RDBMSes(?) and click stream data. Data engineering at Ten-X has spent the last couple of years building a highly effective multi-model data platform that brings all of this data together and analyses it to help the business make better decisions and move faster. In this talk we will talk about how our data platform evolved, including the technology choices we made and why we made them. Our data lake is built as a multi-model platform on top of technologies including Hadoop, JanusGraph, Spark, Hive, Cassandra and HBase. We will also introduce you to some of the complex pattern matching algorithms and Natural Language Processing techniques we have implemented on our platform.

Speakers
avatar for Karthik Karuppaiya

Karthik Karuppaiya

Sr. Engineering Manager, Data and Analytics, Ten-X
Leading the Data Engineering team at Ten-X. Have been working on Hadoop and NoSQL technologies since 2010. Currently helping to build the next generation Data Platform for Ten-X using Hadoop, Kafka, JanusGraph, Spark and Cassandra. Prior to Ten-X, I led the Big Data Engineering team at Symantec and helped build their multi peta byte scale real time and batch analytics platform that brought in 70TB of new data every day. Presented in multiple... Read More →


Thursday May 18, 2017 4:40pm - 5:30pm
Windsor
  • Experience Level Any

4:40pm

Advertising on Google and Traffic Experimentation Platform in eBay - Yi Liu & Martin Zhang, eBay
eBay is one of largest e-commerce company in the world, providing C2C and B2C sales services via the Internet. ebay has more than 400 million users (160 million active) and more than 1 billions sales items on ebay site. We built advertising and experimentation platform for search network, like Google and Bing, based on Hadoop, Spark, Kafka, etc. In this session, we introduce our advertising and experimentation platform, how the experimentation platform supports A/B test and running different science models.

Speakers
YL

Yi Liu

Architect, ebay
Yi Liu (刘轶) is the committer and PMC member of Apache Hadoop for years, currently he is lead architect for Paid IM (Internet Marketing) in ebay, he leads the architecture design for Ads, marketing data and experimentation platform. He leads to use Hadoop, Spark, Kafka, Cassandra and other open source projects to build these platforms. Before joining ebay, he worked in Intel for 6 years as architect for Bigdata infrastructure, he led Hadoop... Read More →


Thursday May 18, 2017 4:40pm - 5:30pm
Trianon