Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
Apache: Big Data North America 2017 will be held at the Intercontinental Miami in Miami, Florida. 

Register now for the event taking place May 16-18, 2017. 
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Sunday, May 14
 

9:00am

Apache Traffic Server and Traffic Control Summit (separate RSVP and Registration Required)
The Apache Traffic Server and Traffic Control Summit is a two-day event taking place just prior to ApacheCon North America. Further details and information, including schedule details can be found on the Apache Traffic Server Wiki page.

Registration and a $150 fee is required for this Summit.

Sunday May 14, 2017 9:00am - 5:00pm
Alhambra / Escorial
 
Monday, May 15
 

7:00am

Morning Run
Please meet in the InterContinental Miami Lobby at 7am.  For any questions, contact: jfclere@gmail.com.

Monday May 15, 2017 7:00am - 8:00am
InterContinental Miami Lobby

9:00am

Apache Traffic Server and Traffic Control Summit (separate RSVP and Registration Required)
The Apache Traffic Server and Traffic Control Summit is a two-day event taking place just prior to ApacheCon North America. Further details and information, including schedule details can be found on the Apache Traffic Server Wiki page.

Registration and a $150 fee is required for this Summit.

Monday May 15, 2017 9:00am - 5:00pm
Alhambra / Escorial

9:30am

BarCampApache
BarCampApache is a BarCamp being facilitated by a group of people involved in the Apache Software Foundation (ASF). All topics are still welcome however! As the ASF is helping to organize, there will be a lot of people around who know a lot about Apache projects / communities / technologies, so there are normally quite a few sessions proposed on those areas. It's not exclusively Apache though, so everyone should come, and talk about fun new ideas, projects and technologiesBarCampApache will be a dynamic get together open to the public. Like other unconferences, the schedule will be determined by the participants, both Apache and non! We strongly encourage lots of people to come along and share their knowledge and ideas. We want it to be a great day of sharing for everyone, not just those at the event. Everyone coming in for the conference is encouraged to come early, as it will be a great day for all.

(Please note:
While BarCamp Apache is free to attend, you will need to register for Apache: Big Data if you wish to attend the conference sessions.)

Monday May 15, 2017 9:30am - 3:00pm
Rafael
  • Experience Level Any

4:00pm

Pre-registration Open
Monday May 15, 2017 4:00pm - 6:00pm
Mezzanine
 
Tuesday, May 16
 

7:00am

Morning Run
Please meet in the InterContinental Miami Lobby at 7am.  For any questions, contact: jfclere@gmail.com.

Tuesday May 16, 2017 7:00am - 8:00am
InterContinental Miami Lobby

8:00am

Breakfast
Tuesday May 16, 2017 8:00am - 9:00am
Ballroom Foyer

8:00am

Sponsor Showcase
Tuesday May 16, 2017 8:00am - 12:55pm
Ballroom Foyer

8:00am

Registration
Tuesday May 16, 2017 8:00am - 6:00pm
Mezzanine

9:00am

Keynote: State of the Feather - Sam Ruby, President, Apache Software Foundation
Speakers
avatar for Sam Ruby

Sam Ruby

President, Apache Software Foundation
Sam Ruby is a prominent software developer who has made significant contributions to many of the Apache Software Foundation‘s open source software projects, and to the standardization of web feeds via his involvement with the Atom web feed standard and the feedvalidator. org web service. He is the co-chair of the... Read More →


Tuesday May 16, 2017 9:00am - 9:20am
Versailles Ballroom

9:25am

Keynote: Training Our Team in the Apache Way - Alan Gates, Co-founder, Hortonworks

Hortonworks contributes to a number of Apache projects.  When we started we depended on our many experienced Apache community members to train their fellow Hortonworkers in the Apache Way.  However, as we grew we found that training "by osmosis" was no longer sufficient.  So we have instituted training for our teams in what Apache is, how it works, their responsibilities as part of Apache and how that meshes with their responsibilities as Hortonworkers, as well as a practical list of best practices and what to avoid. This talk will share some thoughts on the need for this training, give an overview of the content, talk about the results we have seen, and discuss how we are now working to role this out beyond engineering into the rest of the company.


Speakers
avatar for Alan Gates

Alan Gates

Co-founder and Architect, Hortonworks
Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics fro... Read More →


Tuesday May 16, 2017 9:25am - 9:45am
Versailles Ballroom

9:50am

Keynote: Apache CouchDB: A Tale of Community, Cooperation and Code - Adam Kocoloski, Fellow and CTO for the Watson Data Platform, IBM

There are no shortage of reasons why an open source project can stagnate. Yet despite confronting many of these challenges, Apache CouchDB has been resilient in the nearly 10 years since becoming an Apache Software Foundation project, to the point where today, its codebase and community are about as strong as they’ve ever been. The constant thread throughout the life of the project has been the consistent support of the ASF and IBM.

Adam Kocoloski, CTO of IBM Watson Data Platform, co-founder of Cloudant and PMC member for CouchDB, shares his perspective on what IBM finds so valuable about the Apache Software Foundation, through the lens of projects like CouchDB, Apache Spark, Apache Edgent and Apache OpenWhisk.


Speakers
avatar for Adam Kocoloski

Adam Kocoloski

Fellow and CTO for the Watson Data Platform organization, IBM
Adam is an IBM Fellow and CTO for the Watson Data Platform organization. He joined IBM in 2014 via the acquisition of Cloudant, where he built a highly available, scalable database and drove the development of the systems required to offer the database as a service. Adam's record... Read More →


Tuesday May 16, 2017 9:50am - 10:05am
Versailles Ballroom

10:10am

Keynote: Digital Psychometrics and it's Future Effects on Technology - Sandra Matz, Computational Social Scientist

The importance of digital psychometrics – that is the assessment of psychological characteristics via digital footprints – was highlighted recently in the context of Trump’s unexpected victory during the U.S. presidential election. According to international media reports, Trump’s campaign used detailed psychological profiles of 220 million US citizens to target them with more than 175,000 different versions of personalized ads that catered to their values and preferences. In line with the public debate around the effectiveness as well as broader implications of such predictive technologies, this talk focuses on the following three questions: (1) How does digital psychometrics work (2) What are the potential benefits and dangers of digital psychometrics? (3) And finally, what does the future of digital psychometrics hold and how will it affect technology?


Speakers
avatar for Sandra Matz

Sandra Matz

Computational Social Scientist, University of Cambridge
Sandra Matz is currently enrolled as a PhD student at the Department of Psychology. After spending a year at the University of Cambridge in 2011/2012, she graduated from the University in Freiburg (Germany) with a 1st Class honours degree in Psychology (BSc) in 2013. Sandra is funded by the German National Academic Foundation which is Germany’s largest and most prestigious funding body.Combining a strong background in methods and statistics with an interest in real-life business needs, her research applies fundamental psychological theory to business contexts. Working with companies around the world, she currently investigates the potential of using predictions of individual differences (mainly personality) from digital footprints in digital marketing and recruitment.In June 2014 Sandra was awarded the... Read More →


Tuesday May 16, 2017 10:10am - 10:30am
Versailles Ballroom

10:30am

Coffee Break
Tuesday May 16, 2017 10:30am - 11:05am
Ballroom Foyer

11:05am

OODT 2.0: The Future Of Distributed Data Management - Tom Barber, Meteorite Consulting
OODT, originally developed by NASA, provides distributed data management. In this talk we will look at the history of OODT and what is coming in OODT 2.0 to provide a more modern infrastructure to manage your data and metadata.

OODT 2.0 will offer much improved big data connectivity, workflow processing and deployment techniques, allowing for easier distribution and scaling of the platform. We will run through a sample deployment and show how beneficial using OODT to process your incoming data can be.

Speakers
avatar for Tom Barber

Tom Barber

Technical Director, Spicule LTD
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Alhambra

11:05am

Support Apache Cassandra in Production - Anuj Wadehra, Ericsson
One of the prime challenges in using an Open Source database like Apache Cassandra is to build an effective support for production deployments. In this presentation, Anuj Wadehra, who is currently working as a Cassandra Designer at Ericsson, will explain the challenges associated with an Open Source distributed database such as Apache Cassandra, operational best practices, some common issues you can expect in production and how to overcome such issues.

Speakers
avatar for Anuj Wadehra

Anuj Wadehra

Architect, Ericsson
Anuj Wadehra is an Apache Cassandra enthusiast with around 10 years of IT experience. Currently, he works as an Architect and Cassandra SME with Ericsson R&D. He is an active contributor on Apache Cassandra mailing lists. He has designed and implemented multiple distributed, faul... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Windsor

11:05am

Apache​ ​Mahout:​ ​An​ ​Extendable​ ​Machine​ ​Learning​ ​Framework​ ​for​ ​Spark​ ​and​ ​Flink - Trevor Grant, IBM
 A serious issue when developing distributed machine learning algorithms is the lack of people who understand the mathematics, distributed data, AND have free time. Further, most distributed engines have APIs that were not designed to be mathematically expressive, implementations are hard to follow; another qualified person must review. The Mahout project has spent two years building modular system bindings for distributed engines such as Apache Spark and Apache Flink, native solvers to enable CPU/GPU acceleration, an abstracted R-Like Scala DSL for tensor algebra on distributed matrices, and a consistent API to implement distributed algorithms. This creates an extendable and new-contributor friendly framework for machine learning. We’ll also discuss the project vision for creating a CRAN like repository of user contributed algorithms and how we are evangelizing this vision.    Mail Merge Schedule ABD submissions ABD proposals from AC Stuff that doesn't belong to ABD Mail Merge Logs  Explore A serious issue when developing distributed machine learning algorithms is the lack of people who understand the mathematics, distributed data, AND have free time. Further, most distributed engines have APIs that were not designed to be mathematically expressive, implementations are hard to follow; another qualified person must review. The Mahout project has spent two years building modular system bindings for distributed engines such as Apache Spark and Apache Flink, native solvers to enable CPU/GPU acceleration, an abstracted R-Like Scala DSL for tensor algebra on distributed matrices, and a consistent API to implement distributed algorithms. This creates an extendable and new-contributor friendly framework for machine learning. We’ll also discuss the project vision for creating a CRAN like repository of user contributed algorithms and how we are evangelizing this vision.


Speakers
TG

Trevor Grant

IBM
Trevor Grant is PMC Member on the Apache Mahout project, and contributor on Apache Streams (incubating), Apache Zeppelin, and Apache Flink projects. By day he is an Open Source Technical Evangelist at IBM. In former rolls he called himself a data scientist, but the term is so ov... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Balmoral

11:05am

Starting with Apache Spark, Best Practices and Learning from the Field - Felix Cheung, Microsoft
Apache Spark is one of the most popular Big Data platform. In this talk we will have a quick introduction of some of the high-level concepts in Spark and its various modules: SQL, Streaming, ML, Graph and Structured Streaming.

Then we will go through some of the current Best Practices to operationalize Spark for better performance in production, and tips to detect and avoid some of the most common issues.

And lastly we will explore how some enterprises are building solutions with Spark.

Speakers
avatar for Felix Cheung

Felix Cheung

Principal Engineer, Microsoft
Felix Cheung is a Committer of Apache Spark, a PMC/Committer of Apache Zeppelin and PPMC/Committer of Apache MXNet (incubating). He has been active in the Big Data space for 3+ years, he is a co-organizer of the Seattle Spark Meetup, presented several times and he was a teaching... Read More →


Tuesday May 16, 2017 11:05am - 11:55am
Trianon

11:05am

eBay Real-time Business Insight with Streaming Engine Built on Apache Kylin - Ken Wang, eBay

Real-time data insight is getting more important for trend capturing and just-in-time decision making.  eBay as one of the world’s largest and most vibrant marketplace, it relies on real-time data analysis at multiple domains to run the business, like user info protection, promotion prediction as well as site performance detect and monitor etc.

In this session Ken will introduce a new zero latency streaming OLAP engine built on Apache Kylin and how the streaming OLAP engine serves eBay's real-time data analysis business.  The new Kylin streaming engine uses column based storage and indexes as well as in-memory query technique to make real-time data be visible with no latency.  The new streaming engine will also provide exactly once delivery semantics to make sure data quality when used together with Apache Kafka.


Speakers
MW

Mingming Wang

eBay
Ken works at eBay as senior architect in eBay for more than 9 years, focus on data platform infrastructure, like real-time streaming, MOLAP on Hadoop, SQL on Hadoop etc.


Tuesday May 16, 2017 11:05am - 11:55am
Biscayne

12:05pm

Cassandra on ARMv8 - A Comparison with x86 and Other Processor Platforms - Manish Singh, MityLytics
In this paper we present our results from evaluating Cassandra on ARMv8 based servers in the context of building real-time analytics platforms and apps. A platform built for real-time analytics is part of an ecosystem which typically consists of a Kafka based ingestion engine and spark stream-processing engine in addition to Cassandra. We use apps from several benchmark suites to compare our results to x86 platforms and GPU systems which have recently become quite popular. Our studies focus on not just performance but also a cost benefit analysis.

Speakers
avatar for Manish Singh

Manish Singh

CTO, Co-founder, MityLytics
Manish is CTO and co-founder of MityLytics which develops products to help customers make the transition to Big Data platforms and to continue to grow and tune their Big Data analytics platforms and apps using MityLytics software. He has built, deployed and maintained massively d... Read More →


Tuesday May 16, 2017 12:05pm - 12:55pm
Windsor

12:05pm

Apache Hivemall: Scalable Machine Learning Library for Apache Hive/Spark/Pig - Makoto Yui, Treasure Data, Inc. & Takeshi Yamamuro, NTT
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig. Apache Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive and is very easy to use as every machine learning step is done within HiveQL.

Hivemall recently started incubating at Apache Incubator from Sept 13, 2016 and the project plans the initial Apache release in Q1, 2017. In this talk, Makoto Yui will give a walk-through of features, usages, and future roadmaps of Apache Hivemall and Takeshi Yamamuro will introduce Hivemall on Apache Spark in depth.

We consider that this talk is particularly interesting and relevant to people already familiar with Apache Hive and/or Apache Spark and working on big data analytics.

Speakers
avatar for Takeshi Yamamuro

Takeshi Yamamuro

NTT
Takeshi Yamamuro is a Research Engineer of NTT, a telecommunication company in Japan, working on Database backends and SIMD/GPU-aware algorithms. He is a contributor of Hivemall. He worked on porting Hivemall functions to Apache Spark and developing a Parameter Mixing module that... Read More →
avatar for Makoto Yui

Makoto Yui

Treasure Data, Inc., Treasure Data, Inc.
Makoto YUI is a Research Engineer of a Hadoop-as-a-Service startup, Treasure Data, Inc. He is leading the development of Apache Hivemall, an open source library for scalable machine learning on Apache Hive, Apache Spark, and Apache Pig. He holds a PhD degree in computer science f... Read More →


Tuesday May 16, 2017 12:05pm - 12:55pm
Balmoral

12:05pm

Profiling Spark Applications - Jayesh Thakrar, Conversant
Are you interested in harnessing and analyzing the data that drives the Spark Web UI? Are you keen to use that data to tune your applications or understand fluctuations in runtime of your production applications? Do you want to understand the efficiency of your Spark executors and system resources?

This presentation will help you do that and more, by walking through the wealth of data in Spark application events. This data can be used as a foundation for a Spark profiler and advisor that analyzes application events in batch or real-time.

Speakers
avatar for Jayesh Thakrar

Jayesh Thakrar

Sr. Software Engineer, Conversant
Jayesh Thakrar is a Sr. Data Engineer at Conversant (http://www.conversantmedia.com/). He is a data geek who gets to build and play with large data systems consisting of Hadoop, Spark, HBase, Cassandra, Flume and Kafka. To rest after a good day's work, he uses OpenTSDB with 500... Read More →



Tuesday May 16, 2017 12:05pm - 12:55pm
Trianon

12:05pm

Even Faster: When Presto Meets Parquet @ Uber - Zhenxiao Luo, Uber
As Uber continues to grow, our big data systems need to grow in scalability, reliability, and performance, to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Since 2016, we put Presto in production. Now Presto is serving ~100K queries per day @ Uber, and it becomes a key component for interactive SQL queries on big data. In this presentation, we would like to talk about our experiences and engineering efforts, we start with general introduction about Hadoop Infrastructure & Analytics @ Uber, then comes a brief introduction to Presto, the Interactive SQL engine for big data. We will focus on how we build the New Parquet Reader for Presto, and the detail techniques, Columnar Reads, Lazy Reads, Nested Column Pruning. We will show performance improvements and Uber's Use Cases. Finally, we would like to share our ongoing work.

Speakers
ZL

Zhenxiao Luo

Uber
Zhenxiao Luo is a software engineer at Uber. He leads interactive SQL engine projects for Hadoop, specifically, Presto and Parquet. Before joining Uber, he led the development and operations of Presto at Netflix. Zhenxiao has big data experience at Facebook, Cloudera, and Vertica... Read More →


Tuesday May 16, 2017 12:05pm - 12:55pm
Alhambra
  • Experience Level Any

12:05pm

Continuous Applications with Apache Spark 2.0 - Peyman Mohajerian, Databricks

Most streaming engines focus on performing computations on a stream: for example, one can map a stream to run a function on each record, reduce it to aggregate events by time, etc. However, as we worked with users, we found that virtually no use case of streaming engines only involved performing computations on a stream. Instead, stream processing happens as part of a larger application, which we’ll call a continuous application.

Online machine learning and serving real-time data are examples that show streaming computations are part of larger applications that include serving, storage, or batch jobs. Unfortunately, in current systems, streaming computations run on their own, in an engine focused just on streaming. This leaves developers responsible for the complex tasks of interacting with external systems (e.g. managing transactions) and making their result consistent with the the rest of the application (e.g., batch jobs). This is what we’d like to solve with continuous applications.


Speakers
PM

Peyman Mohajerian

Databricks
Peyman is a Solution Architect at Databricks in the Southern California region. Prior to Databricks he had numerous consulting roles working for MapR and Teradata as a Big Data Engineer in the areas of data architecture, analytic and data science. Prior to Teradata at Fox Filmed... Read More →


Tuesday May 16, 2017 12:05pm - 12:55pm
Biscayne

12:55pm

Lunch ( Attendees on Own)
Tuesday May 16, 2017 12:55pm - 2:30pm
TBA

2:30pm

Online and Offline Analytics on Cassandra in eBay - DongQian Liu, eBay
ebay is one of largest e-commerce company in the world, providing C2C and B2C sales services via the Internet. We use Cassandra to store large tables for online query. To reduce the Cassandra load, we do offline Analytics of Cassandra table, we dump sstables to HDFS and transform to Hadoop file formats. In this session, we introduce how we build high-performance, cross datacenter Cassandra cluster for online query, and for offline Analytics, we introduce how we implement splittable input format for sstables and transform to Hadoop file formats. We also introduce how we use bulk loader tool to load data from Hadoop to Cassandra quickly.

Speakers

Tuesday May 16, 2017 2:30pm - 3:20pm
Windsor

2:30pm

Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning - Evans Ye, Yahoo

Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we'll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.


Speakers
avatar for Evans Ye

Evans Ye

Sr. Software Engineer, Yahoo!
Evans Ye is currently PMC chair of Apache Bigtop. He works at Yahoo Taiwan to develop E-Commerce data solutions. He loves to code, automate things, and develop big data applications. Aside engineering stuff, he is also an enthusiast in giving talks to share software innovations a... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Balmoral

2:30pm

Writing Apache Spark Applications Using Apache Bahir - Luciano Resende & Leucir Marin, IBM
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk, we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy a Spark Application using the MQTT data source for the new Apache Spark Structure Streaming functionality.

Speakers
LM

LEUCIR MARIN

Sr. Software Engineer, IBM
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair f... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Trianon

2:30pm

Spark SQL + Pig-Latin: Combine Query Language and Data Flow Language for Data Science - Jeff Zhang, Hortonworks
Data science is a very broad field which involves lots of techniques and knowledge. But overall we can split it as 2 steps: data munging and data analysis. SQL is pretty suitable for data analysis intrinsically, but it is not good at data munging. For data munging in spark ecosystem people have lots of options, like RDD API or DataSet API, but the learning curve for these apis is a little steep. We provide an alternative option: pig-latin. Pig-latin is a data flow language which is very suitable for data munging and easy to learn, originally it was designed for mapreduce engine. We make it support spark engine and make it share the same SparkContext with Spark SQL so that we can share data between spark and pig.

In this talk, I will describe how we integrate pig-latin with spark sql and demonstrate how it would help your team to get actionable insight from data.

Speakers
JZ

Jeff Zhang

Jeff has 8 years of experience in big data industry. He started to use Hadoop since 2009 and is apache Pig/Tez committer (Tez PMC). His past experience is not only on big data infrastructure, but also on how to leverage these big data tools to get insight. He speaks several times... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Alhambra

2:30pm

What It Takes to Process a Trillion Events a Day: Case-Studies in Scaling Stream Processing at LinkedIn - Jagadish Venkatraman, LinkedIn
In this talk, we will present practical case-studies of large scale stream processing applications at LinkedIn. Example applications discussed will include:
  • LinkedIn’s real-time communication platform that delivers relevant content at massive scale to our 450M members. 
  • The LinkedIn feed that processes billions of events each day, and keeps track of what members viewed on their news feed. 
We will present the hard scalability problems we had to solve in each of these applications and the techniques used to address them. Problems include scaling ingestion of events, partitioned processing, highly performant data access and performing efficient remote I/O. We will explain how we leveraged and improved Apache Samza in addressing these problems and how we scaled to process over a trillion events every day.

Speakers
avatar for Jagadish Venkatraman

Jagadish Venkatraman

Jagadish Venkatraman is an Apache Samza committer and a Senior Software Engineer in the Streams Infrastructure group at LinkedIn. He has been working on building, scaling and improving Apache Samza at LinkedIn. He has four years of experience working on practical problems at the... Read More →


Tuesday May 16, 2017 2:30pm - 3:20pm
Biscayne

2:30pm

Sponsor Showcase
Tuesday May 16, 2017 2:30pm - 7:00pm
Ballroom Foyer

3:30pm

Real-World Tales of Repair with Apache Cassandra - Alexander Dejanovski, TheLastPickle
Distributed databases inevitably have to deal with entropy. Within Apache Cassandra, the Anti-Entropy process initiated via CLI tools is the way of ensuring consistency of data on disk. Over the many years of the Apache Cassandra project it has also been the biggest operator pain points. Without a solid repair process in place, you had no guarantee that deleted data will not come back to life, or that data is fully distributed to replicas when nodes fail.

In this talk Alexander Dejanovski, Consultant at The Last Pickle, will explain how Anti-Entropy works and why it should be run on your cluster. He will discuss the different types of repair and their effect on data consistency. He will also introduce tools such as Cassandra Reaper and the range repair script to manage scheduling and running repairs in the most efficient way.

Speakers
AD

Alexander Dejanovski

The Last Pickle
Consultant Apache Cassandra @TheLastPickle | | Alexander has been working as a software developer since 1998, mainly for Chronopost. He's been leading there the effort to build a Cassandra based architecture and migrate critical services to it from traditional RDBMS. He is invol... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Windsor

3:30pm

Understanding Apache MXNet - Dominic Divakaruni, Amazon Web Services

Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding, and recommendation engines. Apache MXNet is a deep learning framework that allows you to define, train, and deploy deep neural networks on a wide array of devices, from cloud infrastructure to mobile devices. It is fast, highly scalable, supports a flexible programming model and multiple languages. This session offers an introduction to Apache MXNet, its benefits and how to quickly get started using it.


Speakers
avatar for Dominic Divakaruni

Dominic Divakaruni

Dominic Divakaruni is a product manager at AWS and contributes to Apache MXNet. He has over 13 years of experience creating products and building solutions that delight customers. Dominic has experience in implementing and growing massive scale solutions using open source projects including OpenStack and ECOMP. Dominic has a... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Balmoral

3:30pm

Creating a Recommender System with ElasticSearch & Apache Spark - Alvaro Santos Andres, Ericsson
Recommender Systems have changed the way companies and people interact with each other. Does your organisation need a 360° view of their customer? Today it is possible to recommend the right products to customers or potential customers. For example, a film based on their previous interests or a new accessory that fits their model of smartphone.

The technology behind recommender systems has evolved significantly over the past 20 years and with the explosion of Big Data technologies, there are tools that can create very powerful recommender systems. This introduction will explain how Recommender Systems work, describing their main functionalities, and providing some basic algorithms frequently used in such systems. We will look at how to create a Recommender System using technologies like Apache Spark and ElasticSearch.

Speakers
avatar for Alvaro Santos Andres

Alvaro Santos Andres

Big Data Solution Architect, Ericsson
Big Data Software Architect with more than 10 years of experience. Since 3 years ago, I am focused 100% of the time on Big Data projects in which I have developed several Personalization services used by millions of users given them a better experience and Company Data transforma... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Trianon

3:30pm

Leveraging Smart Meter Data for Electric Utilities: Comparison of Spark SQL with Hive - Yusuke Furuyama, Hitachi
Hitachi has focused on social innovation business. It has constantly evolved to create sustainable business products and solutions to enhance the quality of life across the globe. Now we are leveraging smart meter data for electric utilities. To meet their needs, we compared the performance of batch processing for aggregating data from smart meters using Hadoop (MapReduce) and Spark 1.6 and Spark 2.0 with changing some parameters (the amount of input data, the logic of processing, input file format, etc.).In this session, we report the results of performance test above.

Speakers
avatar for Yusuke Furuyama

Yusuke Furuyama

Hitachi, Ltd.
Yusuke Furuyama is a solution engineer at Hitachi. His team drives the utilization of Hadoop ecosystem and he is working on offering and co-creating progressive Hadoop solutions to customers who are going to build enterprise system. Now he is focusing on Apache Spark and Apache H... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Alhambra
  • Experience Level Any

3:30pm

The Continuing Story Of Batching To Streaming Analytics At Optimizely - Michael Borsuk, Optimizely
At Optimizely we track billions of user events, such as page views, clicks and custom events, on a daily basis to provide our customers with immediate access to key analytics and business insights. Because of this, we are constantly innovating on our data ingestion pipeline. Over the course of development we have moved from batch data ingestion processes, to streaming, to a hybrid or "lambda" approach and back to full streaming again. I will present the technical details and challenges in developing this system, which includes use of Apache Samza, Flume, Kafka, HBase and Hadoop, as well as some of the lessons learned along the way.

This talk will summarize the story we described in this blog post and present where we have gone since: http://highscalability.com/blog/2016/11/16/the-story-of-batching-to-streaming-analytics-at-optimizely.html

Speakers
MB

Michael Borsuk

Distributed Systems Software Engineer, Optimizely
Mike Borsuk is a software engineer with 12 years experience building software and hardware products. His focuses have been on pragmatic development of scalable services and distributed systems, efficient mobile products as well as application monitoring and measurement. He curren... Read More →


Tuesday May 16, 2017 3:30pm - 4:20pm
Biscayne

4:20pm

Coffee Break
Tuesday May 16, 2017 4:20pm - 4:40pm
Ballroom Foyer

4:40pm

Cassandra Persistence for Online Systems, What Actually Works - John Sumsion, FamilySearch
In a project to port FamilySearch's billion-person tree from Oracle to Cassandra in AWS, a novel consistency model emerged. Many of the initial design assumptions ended up working well. However, some surprising errors occurred, which forced some adjustments.

In this presentation, John will review what worked and what didn't developing a system that achieved both low latencies and allowed for live updates.

The presentation will include: cassandra schema details, data consistency mechanisms, specific solutions to data consistency problems encountered.

Speakers
avatar for John Sumsion

John Sumsion

Principal Software Engineer, FamilySearch
John Sumsion is an experienced Software Engineer who has played key roles in making big-data projects that actually work. Much of John's experience has been gained in building several progressively better implementations of the billion-person tree for FamilySearch. John enjoys us... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Windsor

4:40pm

R4ML: A R Bridge to Apache SystemML and SparkR - Alok Singh, IBM Spark Technology Center

R is the de factor standard for statistics and data analysis. In this talk, we introduce R4ML, a new open-source R package from IBM. R4ML provides a bridge between R and Apache SystemML, allowing R scripts to invoke custom algorithms developed in SystemML's R-like domain specific language. This capability also provides a bridge to the algorithm scripts that ship with Apache SystemML, effectively adding a new library of prebuilt scalable algorithms for R on Apache Spark. R4ML integrates seamlessly SparkR, so data scientists can use the best features of SparkR and SystemML together in the same script. In addition, the R4ML package provides a number of useful new R functions that simplify common data cleaning and statistical analysis tasks.

Our talk will begin with an overview of the R4ML package, its API, supported canned algorithms, and the integration to Spark and SystemML. We will walk through a small example of creating a custom algorithm and a demo. We will share our experiences using R4ML technology with IBM clients. The talk will conclude with pointers to how the audience can try out R4ML and discuss potential areas of community collaboration.


Speakers
AS

Alok Singh

IBM
Alok Singh is a Principal Engineer at the IBM Spark Technology Center, where he leads the HydraR project. He has built and architected multiple analytical frameworks and implemented machine learning algorithms. His interest is in creating Big Data and scalable machine learning so... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Balmoral

4:40pm

Efficient Columnar Storage with Apache Parquet - Ranganathan Balashanmugam, ThoughtWorks
Apache Parquet brings the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. Apache Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. Apache Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Apache Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented. This talk highlights the internal implementation of Apache Parquet.

Speakers
avatar for Ranganathan Balashanmugam

Ranganathan Balashanmugam

Head of Engineering - India, Aconex
Ranganathan has nearly twelve years of experience of developing awesome products and loves to works on full stack - from front end, to backend and scale. He is Head of Engineering - India at Aconex and prior to that was Technology Lead at ThoughtWorks. He is Microsoft MVP for Dat... Read More →



Tuesday May 16, 2017 4:40pm - 5:30pm
Alhambra

4:40pm

From Batch to Streaming ET(L) with Apache Apex - Thomas Weise, Atrato,io
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.

This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks.

Speakers
avatar for Thomas Weise

Thomas Weise

CTO, Atrato.io
Thomas is Apache Apex PMC Chair and CTO at Atrato. Prior to founding Atrato he was Architect at DataTorrent and lead the development of Apex from the beginning of the project. Before that he was member of the Hadoop Team at Yahoo! and contributed to several of the big data ecosys... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Biscayne

4:40pm

Apache Airavata: A General Purpose Distributed Task Execution Framework - Suresh Marru & Marcus Christie, Indiana University
The talk will focus on building a multi-tenanted, elastically scalable, fault-tolerant Platform as a Service using various Apache Projects. Using Apache Airavata as a case study, the talk will discuss hands-on experiences of building a distributed microservices based software system for managing the remote execution of data analysis and scientific applications on computing clouds, supercomputers, clusters, and computational grids. The talk will introduce several architectural challenges as well as opportunities to leverage and collaborate with other Apache projects. We summarize best practices that we have learned for managing multiple components with complicated state in elastic virtual machine and container environments.

Speakers
avatar for Suresh Marru

Suresh Marru

Member, Indiana University
Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain... Read More →


Tuesday May 16, 2017 4:40pm - 5:30pm
Trianon

5:30pm

6:00pm

PGP Key Signing: Expanding the Web of Trust
Why participate in the key signing? Among other things, all Apache releases are PGP-signed; but a key with no signatures attesting to its own authenticity isn't very useful. Bring your key (which you will have emailed to our special address at apachecon-keysigning@apache.org,) and sign. You will need a pen, and some manner of identification.

Please see the wiki page for more information:
http://wiki.apache.org/apachecon/PgpKeySigning

Tuesday May 16, 2017 6:00pm - 7:00pm
Ballroom Foyer
 
Wednesday, May 17
 

7:00am

Morning Run
Please meet in the InterContinental Miami Lobby at 7am.  For any questions, contact: jfclere@gmail.com.

Wednesday May 17, 2017 7:00am - 8:00am
InterContinental Miami Lobby

8:00am

Breakfast
Wednesday May 17, 2017 8:00am - 9:00am
Ballroom Foyer

8:00am

Registration
Wednesday May 17, 2017 8:00am - 6:00pm
Mezzanine

9:00am

Keynote Panel Discussion: How to Succeed in IoT 2.0 - Abhi Arunachalam, Battery Ventures; Sudip Chakrabarti, Lightspeed Venture Partners; James Pace, Runtime; Roman Shaposhnik, Pivotal
Moderators
RS

Roman Shaposhnik

Director of Open Source, Pivotal
Roman Shaposhnik is a Director of Open Source at Pivotal Inc and VP of Technology for ODPi at Linux Foundation. He is a committer on Apache Hadoop, co-creator of Apache Bigtop and contributor to various other Hadoop ecosystem projects. He is also an ASF member and a former Chair... Read More →

Speakers
AA

Abhi Aruna

Vice President, Battery Ventures
Abhi Arunachalam is an investor at Battery Ventures. He focuses on early and growth stage investments in sectors such as security, big-data analytics and AI. He has 12+ years of technology and investment experience. Abhi is currently involved in Battery’s investments in InfluxData, Fungible, JFrog, Expel... Read More →
avatar for Sudip Chakrabarti

Sudip Chakrabarti

Lightspeed Venture Partners, Lightspeed Venture Partners
Sudip is a partner at Lightspeed Venture Partners where he focuses on enterprise and infrastructure software investments. Prior to joining Lightspeed, Sudip was a partner at Andreessen Horowitz where he invested in and worked with companies such as Actifio, Alluxio, Cumulus Networks, Databricks, DigitalOcean, Forward Networks, Mesosphere, Samsara, etc. He started his venture career at Osage University Partners where he invested in Menlo Security, Infinio, and Skytree. Earlier in his career, Sudip co-founded two startups and developed circuit simulation software. Sudip has a PhD in Computer Engineering from Georgia Tech, an MBA from Wharton and a B.Tech from IIT Kharagpur. Outside of work and family, Sudip is a huge sports fan and a diehard cricket fanatic. | | For Sudip’s random musings on enterprise technology check out his blog... Read More →
JP

James Pace

CEO, Runtime
James is CEO and Co-Founder of Runtime: an early stage company providing significant contributions to open source for the IoT and embedded community. Apache Mynewt, a project under the Apache Software Foundation, provides an OS and development framework for embedded developers ev... Read More →


Wednesday May 17, 2017 9:00am - 9:45am
Versailles Ballroom

9:45am

Coffee Break
Wednesday May 17, 2017 9:45am - 10:15am
Ballroom Foyer

9:45am

Sponsor Showcase
Wednesday May 17, 2017 9:45am - 1:05pm
Ballroom Foyer

10:15am

Using Apache Beam for Batch, Streaming, and Everything in Between - Dan Halperin, Google
Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to easily tune requirements around completeness and latency and run the same pipeline across multiple runtime environments. In addition, Beam's model enables cutting edge optimizations, like dynamic work rebalancing and autoscaling, giving those runtimes the ability to be highly efficient.

This talk will cover the basics of Apache Beam, touch on its evolution, and describe the main concepts in its powerful programming model. We'll include detailed, concrete examples of how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios.

Speakers
DH

Daniel Halperin

Google
Dan Halperin is a PMC member of Apache Beam. He has worked on Beam and Google Cloud Dataflow for 2 years. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in... Read More →


Wednesday May 17, 2017 10:15am - 11:05am
Balmoral

10:15am

Dataservices: Processing Big Data the Microservice Way - Josef Adersberger, QAware GmbH
We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.

Speakers
avatar for Josef Adersberger

Josef Adersberger

CTO, QAware GmbH
Josef Adersberger has been a software engineering fanatic for over 10 years. He studied computer science in Rosenheim and Munich and holds a doctoral degree in software engineering. He is co-founder and CTO of QAware, a German software development company. He is a lecturer at sev... Read More →



Wednesday May 17, 2017 10:15am - 11:05am
Windsor

10:15am

The Rise of Real-Time: Apache DistributedLog and Its Stream Store - Sijie Guo, Twitter
Data growth is exponential and organizations are producing it in a myriad of formats. Instead of storing and processing the data at some regular cadence, many in the industry are realizing the benefits of real-time data analytics via stream processing. The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of storage for real-time applications, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Sijie Guo will describe Apache DistributedLog - a high throughput and low latency replicated stream store, discuss what are the challenges on building a stream store for real-time applications, and explore the future of Apache DistributedLog and the big data ecosystem.

Speakers
SG

Sijie Guo

Twitter
Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.


Wednesday May 17, 2017 10:15am - 11:05am
Biscayne
  • Experience Level Any

10:15am

Evolution of an Apache Spark Architecture for Processing Game Data - Nick Afshartous, Warner Brothers Interactive Entertainment (WBIE)
We discuss lessons learned from our first production deployment of a Spark Streaming pipeline for processing game data. Deployment is to the AWS Cloud where we use managed services (i.e. EMR, S3 and Redshift). However, having downstream dependencies with outages and unpredictable response latencies can pose significant challenges. To address, we evolved the architecture by separating data processing from post-processing tasks (i.e. copying data into Redshift). Post-processing tasks are sent downstream from Spark to a task executor that was built using Akka Streams and Reactive Kafka. The end result is a loosely coupled architecture where the Spark streaming job is a firehose to S3 and fault-tolerant when Redshift is unavailable.

Speakers
avatar for Nick Afshartous

Nick Afshartous

Tech Director, Warner Brothers Interactive
Nick Afshartous is a Tech Director at Warner Brothers Interactive Entertainment (WBIE) where he leads the Analytics Core Platform team.   Using Apache Spark, he's helping to build WBIE's next generation real-time analytics platform for processing game data. He's passionate abou... Read More →


Wednesday May 17, 2017 10:15am - 11:05am
Trianon
  • Experience Level Any

11:15am

Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways - Davor Bonaci, Google & Jean-Baptiste Onofré, ASF
The world of Big Data involves an ever increasing field of players, from storage systems to processing engines and distributed programming models. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a standard for expressing both batch and streaming data processing pipelines in a variety of languages across a variety of platforms and engines. In this talk, we will show how Beam gives users the flexibility to choose the best environment for their needs and read data from any storage system; allows any Big Data API to execute in multiple environments; allows any processing engines to support multiple domain-specific user communities; and allows any storage system to read/write process data at massive scale. In a way, Apache Beam is a glue that connects the Big Data ecosystem together; it enables “anything to run anywhere”.

Speakers
DB

Davor Bonaci

Google Inc.
Davor Bonaci is serving as a chair of the Apache Beam Project Management Committee, and have been regularly committing code to the project since its inception. He is working as a Senior Software Engineer at Google. Before Beam, Davor has been working on its predecessor, Google Cl... Read More →
JO

Jean-Baptiste Onofré

Talend
JB is PMC member for Apache Beam. He is a long-tenured Apache member, serving on PMC/committer for about 15 projects that range from integration to big data.


Wednesday May 17, 2017 11:15am - 12:05pm
Balmoral

11:15am

Developer on the Rise: Blurring the Line Between the Developer and the Data Scientist with PixieDust - David Taieb, IBM

Ready to dip your toe into data science? Yes? But where and how do you start? Well we have an answer – Notebooks and PixieDust! PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds data manipulation and display with features like auto-visualization of Spark DataFrames, real-time Spark Job progress monitoring directly from the Notebook, seamless integration to cloud services, and automated local install of Python and Scala kernels running with Spark. And if you prefer working with a Scala Notebook - no problem! PixieDust can also run on a Scala Kernel - imagine being able to visualize your favorite Python chart engines from a Scala Notebook!


Speakers
avatar for David Taieb

David Taieb

STSM, Watson Data Platform Developer Advocacy team, IBM
David Taieb is the STSM for the Watson Data Platform Developer Advocacy team at IBM, leading a team of avid technologists with the mission of educating developers on the art of possible with Watson Cognitive Services, Big Data Analytics and Cloud technologies. He’s passionate abo... Read More →


Wednesday May 17, 2017 11:15am - 12:05pm
Biscayne

11:15am

Data Profiling in Apache Calcite - Julian Hyde, Hortonworks
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.

Wednesday May 17, 2017 11:15am - 12:05pm
Windsor

11:15am

Fast Cars, Big Data - How Apache Can Help Formula 1 - Carol McDonald, MapR Technologies
Modern race cars produce lot of data, and all this in real time. In this presentation I will show you how data could be generated and used by various applications in the car, on the track or team head quarter. The demonstration will show how to move data using messaging systems like Apache Kafka, process the data using Apache Spark and Flink and use various storage technics: distributed file system, HBase. This presentation is a great opportunity to see how to build a " near real time big data application" with Apache projects. The code from this talk will be made available as open source.

Speakers
avatar for Carol McDonald

Carol McDonald

Solutions Architect, MapR Technologies
Carol Mcdonald is a solutions architect at MapR focusing on big data, Apache HBase, Apache Drill, Apache Spark, and machine learning in healthcare, finance, and telecom. Previously, Carol worked as a Technology Evangelist for Sun, an architect/developer on: a large health informa... Read More →


Wednesday May 17, 2017 11:15am - 12:05pm
Trianon

12:15pm

Concrete Big Data Use Cases Implemented with Apache Beam - Jean-Baptiste Onofré, Apache Software Foundation
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines. The same Beam pipelines work in batch or streaming, and on a variety of open source and private cloud big data processing backends including Apache Flink, Apache Spark, Apache Apex, Apache Gearpump, and Google Cloud Dataflow.

This talk will show you how to use Beam Java SDK to implement concrete use cases like batch analytics, streaming data ingestion or fraud detection.

Speakers
JO

Jean-Baptiste Onofré

Talend
JB is PMC member for Apache Beam. He is a long-tenured Apache member, serving on PMC/committer for about 15 projects that range from integration to big data.


Wednesday May 17, 2017 12:15pm - 1:05pm
Balmoral

12:15pm

Standards-Compliant Cloud Orchestration with Apache AriaTosca - Tal Liron, GigaSpaces

Cloud orchestration is no longer a wild-west of proprietary solutions. The enterprise and NFV industries are moving towards standards compliance with efforts focused on the OASIS TOSCA standard, which offers a policy-driven YAML-based language to design flexible and extensible cloud topologies, comprising compute nodes (VMs and containers), VNFs (Virtual Network Functions), as well as user-defined node types. This talk will introduce the Apache AriaTosca project, a compliant TOSCA parser and orchestrator. As well as being a fully functional orchestrator in itself, AriaTosca serves as a platform and SDK for building TOSCA-based solutions in Apache and beyond.


Speakers
avatar for Tal Liron

Tal Liron

GigaSpaces
Tal Liron. Senior Engineer, GigaSpaces. Tal Liron is an AriaTosca committer and a long-time contributer to free and open source software. At GigaSpaces he works on AriaTosca and Cloudify, as well as integrating ARIA into the Linux Foundation's ONAP NFV orchestration project (a co... Read More →


Wednesday May 17, 2017 12:15pm - 1:05pm
Trianon

12:15pm

Pilot Hadoop Towards 2500 Nodes and Cluster Redundancy - Stuart Pook, Criteo
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising. Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster. This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.

Speakers
avatar for Stuart Pook

Stuart Pook

Senior DevOps Engineer, Criteo
Stuart loves storage (130 PB at Criteo) and is part of Criteo's Lake team that runs some small and two rather large Hadoop clusters. He also loves automation with Chef because configuring more than 2200 Hadoop nodes by hand is just too slow. Before discovering Hadoop he develop... Read More →


Wednesday May 17, 2017 12:15pm - 1:05pm
Windsor

12:15pm

Building Streaming Data Pipelines with Stateful Operations - Chandni Singh, Simplifi.it
There are a few streaming platforms which provide the exactly-once processing guarantee. This is done by checkpointing the state of the functional units (operators) that make up the streaming pipeline. Many real-world big data pipelines are typically composed of operators which maintain a large ever-growing state. However, periodically checkpointing the state of these operators is only practical when their state is small. To solve this problem, I created Managed State for the Apache Apex project, which is an incrementally checkpointed key-value data structure. Additionally, the community has developed a layer ontop of Managed State (Spillable Datastructures), which allows us to incrementally checkpoint a variety of common data structures. This presentation will cover the challenges of implementing fault-tolerant incremental checkpoint in Managed State.

Speakers
CS

Chandni Singh

Simplifi.it
I’m a software engineer who likes to build distributed frameworks/applications which are fault-tolerant and scalable. I am a PMC member and committer of Apache Apex project and have worked with few other distributed platforms and have co-founded a company which creates big dat... Read More →



Wednesday May 17, 2017 12:15pm - 1:05pm
Biscayne

1:05pm

Lunch ( Attendees on Own)
Wednesday May 17, 2017 1:05pm - 2:30pm
TBA

2:30pm

Nexmark, a Unified Framework to Evaluate Big Data Processing Systems with Apache Beam - Ismael Mejia & Etienne Chauchot, Talend
Big Data processing in real-time is on the rise at Apache with projects like Apache Spark, Apache Flink or Apache Apex. However at this moment we don’t have a unified framework to evaluate the correctness and the performance of these systems. Apache Beam implements a unified model to write both Batch and Streaming jobs with a single API and execute them independently in any of the supported platforms (runners), this makes Beam an ideal candidate to support an evaluation framework.

In this talk we will present Nexmark, a benchmark framework to evaluate queries over data streams. An implementation of Nexmark was donated by Google as part of the Apache Beam incubation process. Nexmark bridges the gap for evaluating data processing frameworks, but also serves as a rich integration test to evaluate the correct implementation of both the Beam runners and the new features of the Beam SDK.

Speakers
avatar for Etienne Chauchot

Etienne Chauchot

Talend
Etienne has been working in software engineering and architecture for more than 13 years in domains such as retail or financial groups. He has been focusing on Big Data for a few years on technologies such as Apache Cassandra, ElasticSearch or Apache Spark. He is an Open Source f... Read More →
avatar for Ismael Mejia

Ismael Mejia

Open Source Software Engineer, Talend
Ismaël Mejía is an Apache Beam committer and a software engineer at Talend. He loves to tackle complex problems and build simple and elegant solutions. His main area of focus is distributed systems (Big Data and Cloud). He has been working on web services and large scale system... Read More →


Wednesday May 17, 2017 2:30pm - 3:20pm
Balmoral
  • Experience Level Any

2:30pm

Khermes: An Open-Source and Distributed Data Generator for Apache Kafka - Alberto Rodriguez & Emilio Ambrosio, Stratio
Today, companies and organisations with large amounts of data are increasingly faced with the need to produce user-defined data for different types of data stores or to understand how their systems will perform under a heavy data-load. We have created Khermes, an open-source distributed data-generator, to simplify this process. Using Apache Kafka, Khermes can generate large amounts of user-defined data that can be stored anywhere. It can also be used as a “stress tool” to measure the performance of systems in a heavy-load environment: Users can increase the strain on their Apache Kafka clusters and monitor their performance. Through use cases and demos, you will discover Khermes features and how it works.

Speakers
avatar for Emilio Ambrosio

Emilio Ambrosio

Software Engineer, Stratio
As Software Engineer at Stratio, I have participated in different cutting edge projects and some modules included within the Stratio's platform, particularly those related to real-time streaming and data ingestion, based on Apache Spark Streaming and Apache Flume respectively.
avatar for Alberto Rodriguez

Alberto Rodriguez

Software Engineer, Stratio
Working as a Big Data Architect at Stratio, Alberto Rodriguez has been involved in the inception and evolution of some modules included within the Stratio's platform, specially those related to data visualization, real-time, streaming and complex event processing. I am also proud... Read More →


Wednesday May 17, 2017 2:30pm - 3:20pm
Windsor

2:30pm

Routing Trillion Messages Per Day @Twitter - Lohit Vijayarenu & Gary Steelman, Twitter
Twitter collects more than Trillion messages per day. These messages are grouped into hundreds of categories which have different properties. Messages are routed based on category to various nodes in a cluster until they reach storage systems serving Analytics and Streaming workloads. The scale of messages with different delivery guarantees pose unique challenges at Twitter.

Twitter’s Log Collection framework has been built using Scribe over the years. Message delivery guarantee, priority and multiplexing add complexities to routing. Additionally Twitter scale introduces unique challenges for management of the logging framework. In this talk we discuss about challenges we face and effort to improving our logging framework using Apache Flume. Apache Flume with its pluggable architecture provides many building blocks to implement various features for our collection framework.

Speakers
GS

Gary Steelman

Twitter
Gary Steelman is a Software Engineer and has been working on Hadoop and related projects at Twitter. He has a master's degree from the University of Texas at Dallas specialized in intelligent systems, AI, and machine learning.
LV

Lohit Vijayarenu

Software Engineer, Twitter
Lohit VijayaRenu is a Software Engineer at Twitter Hadoop team. He has masters degree from Stony Brook University. He has been working on Hadoop and related projects at Yahoo!, MapR and Twitter.


Wednesday May 17, 2017 2:30pm - 3:20pm
Biscayne

2:30pm

HBase Backup and Restore - Zhihong Yu, Apache HBase PMC
W-TinyLFU records the frequency in a counting sketch, ages periodically by halving the counters, and orders entries by SLRU. An entry is discarded by comparing the frequency of the new arrival (candidate) to the SLRU's victim, and keeping the one with the highest frequency. This allows the operations to be performed in O(1) time and, though the use of a compact sketch, a much larger history is retained beyond the current working set. In a variety of real world traces the policy had near optimal hit rates.

Backup / Restore is standard feature for RDBMS. HBase adds support for Backup / Restore through a series of phases: HBASE-7912 (phase 1), HBASE-14123 (phase 2). Technical approach for implementing backup / restore would be covered along with typical command line usages.

Speakers
ZY

Zhihong Yu

Staff Engineer, Hortonworks
I have been Apache HBase PMC for 5 and half years. | | I am also committer for Apache Slider and Apache Bahir. | | I contribute to Apache Phoenix and Apache Spark. | | I have presented at the past 3 ApacheCon NA events.


Wednesday May 17, 2017 2:30pm - 3:20pm
Trianon

2:30pm

Sponsor Showcase
Wednesday May 17, 2017 2:30pm - 4:40pm
Ballroom Foyer

3:30pm

General Durable Object and Native Computing Model for Apache Big Data Platforms - Johnu George, Cisco

Most big data processing frameworks are JVM based. A big gap in such systems is to efficiently map the software layers/patterns to the underlying hardware, especially for newer technologies like Non Volatile Memory (NVM), and remove performance bottlenecks. The Apache Mnemonic project presents abstract models that help resolve memory bottlenecks e.g. SerDe/marshalling, Garbage Collection(GC) performance issues, memory-storage mapping, massive object caching, object sharing across clustering and kernel caching issues. In this talk we present Mnemonic, its architecture and the programming models and their applications (including integrations with Apache Hadoop and Apache Spark).


Speakers
avatar for Johnu George

Johnu George

Senior Software Engineer, Cisco
Johnu is a senior software engineer with 6+ years of industry experience. His research interests include distributed systems and other big data technologies. He is one of the active contributors of Apache Mnemonic project.



Wednesday May 17, 2017 3:30pm - 4:20pm
Windsor

3:30pm

Biophotonics Using Apache PredictionIO, Spark and Deep Learning - Prajod Vettiyattil, Wipro Technologies
Biophotonics is the study of microscopic life, like biological cells, using optical methods. It has applications in medicine, agriculture and environmental sciences. In this session we will see how Deep Learning and Big Data software and can help analyze images captured using tools like high end microscopes, used in biophotonics. Thus accelerating medical research. Medical research labs and diagnostic centers use high end microscopes and deep human knowledge to observe living cells and perform life cycle analysis on them. These workflows involve time consuming, iterative and complex processes. This session will explain the application of deep learning to automatically detect microscopic cells from samples of digital images, and provide automatic classification. This will be of immense help for medical diagnostics. The solution uses PredictionIO, Spark, OpenCV and Deeplearning4j

Speakers
avatar for Prajod Vettiyattil

Prajod Vettiyattil

Architect, Wipro
Prajod is a Senior Architect in the open source solutions group of Wipro Technologies, responsible for research and solution development in the area of Big Data and Analytics. His current work involves analyzing image and video content using machine learning, to solve hard proble... Read More →


Wednesday May 17, 2017 3:30pm - 4:20pm
Balmoral

3:30pm

Actionable Insights with Apache Apex - Devendra Tagare, DataTorrent Inc.
In this talk I would like to cover how Apache Apex is used to deliver actionable insights in realtime for Ad-tech. The talk would include a reference Apex architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture cover concepts around Apex, kafka and dimensional compute.Real time streaming problems and challenges would also be covered.Some operational aspects of a streaming system will also be touched upon.

Speakers
avatar for Devendra Tagare

Devendra Tagare

Data Engineer, DATATORRENT INC
Hi I am a data platform engineer & Apache committer focussed on,Solutions Architecture for low latency, high scalability data streaming systems.Rapid prototyping for real world streaming use-cases.Backend engineering development for Apache Apex & DataTorrent.Working on end to end... Read More →



Wednesday May 17, 2017 3:30pm - 4:20pm
Biscayne

3:30pm

Genetic Algorithms in All Their Shapes and Forms - Julien Sebrien, Geneticio Expertise
We will talk about genetic algorithms, inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms. Genetic algorithms are used to generate solutions to optimization and search problems by relying on bio-inspired operators and follow this process:
  • Randomly generate a population of individuals
  • Evaluation 
  • Termination checks 
then, iteratively: 
  • Selection
  • Crossover
  • Mutation
  • Evaluation 
  • Termination checks
Genetic algorithms behavior will be illustrated by playful use-cases, such as ToBeOrNotToBe, or Smart Rockets., etc.

Speakers
JS

Julien Sebrien

Julien is an experienced consultant who works on challenging development projects for top financial clients and startups. Julien also likes to work on open source technologies such as Cassandra, Spark or Elastic Search with a strong interest in artificial intelligence. Julien cof... Read More →


Wednesday May 17, 2017 3:30pm - 4:20pm
Trianon

4:20pm

Coffee Break
Wednesday May 17, 2017 4:20pm - 4:40pm
Ballroom Foyer

4:40pm

Expanding Apache Zeppelin into Your Cluster - Jongyoul Lee, ZEPL
Apache Zeppelin is one of tools to help users enrich their analysis with beautiful visualization without any additional work. But from now it has had critical issues to use Apache Zeppelin in production environment. Apache Zeppelin runs on a single server only which means SPOF, and users suffer from a shortage of resources because of running on a single machine. Apache Zeppelin has tried to overcome this inconvenience and now supports to launch your job in a cluster. You don't have to think of and suffer from resources anymore when you run many jobs on Apache Zeppelin. Through using your cluster, one instance is enough for all your colleagues. This talk has two parts. The first describes how Apache Zeppelin launches interpreters in a cluster and what happens internally. The second introduces Helium plugin to support third party visualization and how to install them.

Speakers
avatar for Jongyoul Lee

Jongyoul Lee

Software Development Engineer, ZEPL
I'm a member of PMC of Apache Zeppelin and works at ZEPL. In Apache Zeppelin, I focus on stabilizing Apache Zeppelin to be used in production level, developing some enterprise features and enhancing Apache Spark/JDBC features. Personally, I'm really interested in distributed and... Read More →


Wednesday May 17, 2017 4:40pm - 5:30pm
Balmoral

4:40pm

Leveraging the GPU on Spark - Tobias Polzer, QAware GmbH
GPUs are a great resource of computing power but yet not accessible from Apache Spark. We present a RDD implementation we've open sourced to leverage GPU computing power with Spark. We'll share the experiences we gained along the way implementing the RDD, and a real-world application using the RDD: What's the best way to bridge from Java to GPU code (OpenCL or CUDA)? From an architectural perspective - what's the best way to integrate a GPU processing facility into Spark? How much faster are typical Spark actions when using the GPU? What Spark actions are best processed on a GPU? Java-to-GPU bridges, best way to integrate GPU processing into Spark and performance evaluation.

Speakers
TP

Tobias Polzer

Master's student, Friedrich-Alexander University Erlangen-Nuremberg/QAware


Wednesday May 17, 2017 4:40pm - 5:30pm
Trianon

4:40pm

Automation of Rolling Upgrade for Hadoop Cluster without Data Lost and Job Failures - Hiroyuki Adachi & Hiroshi Yamaguchi, Yahoo Japan Corporation
We present how we automated rolling upgrade for our production Hadoop cluster without data lost and job failures. Apache Ambari can perform rolling upgrade, however it does not consider data lost and effects for running jobs. Therefore, we decided to customize it for our environment and created upgrade procedures with more secure checking. First, we made a custom service for Ambari which operates some functions such as NameNode F/O and load balancer In/Out. Second, we used Ansible which is a configuration management tool to control upgrading task. It automates calling Ambari APIs including the custom service functions, checking cluster statuses (e.g., missing blocks), and running service check jobs while upgrading each component. Consequently, we achieved the automatic rolling upgrade, and we reduced operating costs and minimized inconvenience to users.

Speakers
avatar for Hiroyuki Adachi

Hiroyuki Adachi

Yahoo Japan Corporaion
Hiroyuki Adachi is in charge of DevOps at Hadoop of Yahoo! JAPAN.
avatar for Hiroshi Yamaguchi

Hiroshi Yamaguchi

Yahoo Japan Corporation
Hiroshi Yamaguchi is in charge of DevOps at Hadoop of Yahoo! JAPAN.


Wednesday May 17, 2017 4:40pm - 5:30pm
Windsor

4:40pm

Streaming Processing with Apache Apex - Bhupesh Chawda, DataTorrent
Apache Apex is a next generation Hadoop (YARN) native, data-in-motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include data ingestion, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc. Apache Apex separates operational logic from business logic which enables developers to concentrate on business logic, reducing time to market as well as total cost of ownership. In this tutorial, we would introduce you to Apache Apex and walk though development of a real-world application demonstrating stream processing. The attendees would also go through some advanced capabilities like dynamic scalability and run-time updates of application properties. By the end of the session, attendees would be able to write applications to cater to their own use cases.

Speakers
avatar for Bhupesh Chawda

Bhupesh Chawda

Software Engineer, DataTorrent Software India Pvt. Ltd.
Bhupesh Chawda is a Software Engineer at DataTorrent Software India Pvt. Ltd. He is also a committer on the Apache Apex project under the Apache Software Foundation. His current interests include big data and distributed systems, stream processing and machine learning. He has exp... Read More →


Wednesday May 17, 2017 4:40pm - 5:30pm
Biscayne

5:40pm

Helium makes Zeppelin Fly! - Moon Soo Lee, Ahyoung Ryu and Hoon Park, NFLabs
Apache Zeppelin is interactive data analytics environment for computing system. It integrates many different data processing frameworks like Apache Spark and provides beautiful interactive web-based interface, data visualization, collaborative work environment to make your data science lifecycle more fun and enjoyable.

Since 0.7.0, Zeppelin has framework called 'Helium' with two new pluggable components: Visualization, Spell. Visualization extends built-in visualization and Spell provides lightweight way to extend interpreter and display system in Zeppelin.

In this talk we'll see how visualization and spell can be created and used. Also Zeppelin community provides Helium online registry by leveraging NPM package registry for publishing Visualization and Sell. We'll take a look how community manages online registry service and how to publish package to online registry.

Speakers
MS

Moon soo Lee

ZEPL, inc
Moon Soo Lee is a creator for Apache Zeppelin (incubating) and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and building healthy business around of it.


Wednesday May 17, 2017 5:40pm - 6:30pm
Balmoral
  • Experience Level Any

5:40pm

Construct a Sharable GPU Farm for Data Scientists - Layne Peng, EMC
With the development of machine learning algorithms, GPUs are winning the favor of data scientists. But the high cost of GPU devices and low utilization caused by statically allocation are heavy burdens both in financial and management aspects in introducing GPUs to the data science team. In this presentation, we will introduce our latest research topic of how we enable GPU virtualization, chaining GPUs into one shared logical instance based on an intelligent queue model. In this model, the logical server can present a GPU service to one or more clients that represents GPUs local to the data center, GPUs in the cloud or some hybrid combination of local and remote GPUs executing the client application. The allocation of GPU resources is intelligent controlled based on attributes of the task, running concurrently where possible on a GPU or pre-empted to manage higher priority activity.

Speakers
LP

Layne Peng

EMC
Principal Technologist, Architect in EMC. Leading Cloud Management & Orchestration, Converged Infrastructure initiatives in EMC Office of CTO China. Thirteen patents related to cloud, SDDC and big data. One of the author of book Big Data Strategy, Technology and Application.


Wednesday May 17, 2017 5:40pm - 6:30pm
Windsor

5:40pm

TensorFlow in the Wild: From Cucumber Farmer to Global Insurance Firm - Kazunori Sato, Google
One of the largest global insurance firm recently introduced TensorFlow, the open source library from Google for machine intelligence, for classifying car drivers who has high likelihood on major accidents with deep neural network. The model provides 2x higher accuracy compared with existing random forest model, gives them a possibility to lower the insurance price significantly. Also, a cucumber farmer in Japan has been using TensorFlow to build a hand-made sorter that classifies cucumbers into 9 classes based on its length, shape and color. At this session, we'll look at how TensorFlow democratizes the power of machine intelligence and is changing the world with many different real-world use cases of the technology.

Speakers
avatar for Kazunori Sato

Kazunori Sato

Staff Developer Advocate, Google Inc
Kaz Sato is Staff Developer Advocate at Cloud Platform team, Google Inc. He leads the developer advocacy team for Machine Learning and Data Analytics products, such as TensorFlow, Cloud ML, and BigQuery. Speaking at major events including Google I/O 2016, Hadoop Summit 2016, Stra... Read More →


Wednesday May 17, 2017 5:40pm - 6:30pm
Biscayne

5:40pm

A Practical Approach to Using Graph Databases and Analytics - Greg Jordan, Graph Story
While graph databases have become a standard for social networking and recommendation engines, the practical use of graphs in other areas beyond consumer applications is growing. In this presentation - with the support of use cases - we will explore how graph databases can applied to other domains, such as logistics and healthcare, as well as a look at where graphs can leverage other data systems. The presentation will also cover the role of graphs in going beyond predictive analytics to providing prescriptive analytics.

Speakers
avatar for Greg Jordan

Greg Jordan

CEO, Graph Story
Greg Jordan is the Founder & CEO of Graph Story, author of Practical Neo4j and has over 15 years of programming experience in various languages with a focus on data analytics and mobile projects. Greg is avid speaker and writer on the topic of graph databases and has been working... Read More →


Wednesday May 17, 2017 5:40pm - 6:30pm
Trianon

6:30pm

6:30pm

6:30pm

6:30pm

BoF: Apache Hivemall: Test it Out (Bring your laptop)
Bring your own laptop and try Hivemall on Apache Spark! 

Wednesday May 17, 2017 6:30pm - 7:15pm
Merrick II

6:30pm

BoF: Apache Mahout
  • Linear Algebra
  • Machine Learning
  • GPU Acceleration
  • Buzz Words

Wednesday May 17, 2017 6:30pm - 7:15pm
Escorial

6:30pm

BoFs: Working with Downstream Packaging
  • Deptartment Management/Convergence
  • Builds from SRC
  • RPMs, Debs, Docker, etc.

Wednesday May 17, 2017 6:30pm - 7:30pm
Sandringham
 
Thursday, May 18
 

7:00am

Morning Run
Thursday May 18, 2017 7:00am - 8:00am
InterContinental Miami Lobby

8:00am

Breakfast
Thursday May 18, 2017 8:00am - 9:00am
Ballroom Foyer

8:00am

Sponsor Showcase
Thursday May 18, 2017 8:00am - 11:20am
Ballroom Foyer

8:00am

Registration
Thursday May 18, 2017 8:00am - 4:30pm
Mezzanine

9:00am

Venturing into Large Hadoop Clusters - Varun Saxena & Naganarasimha Garla, Huawei Technologies
Hadoop clusters are continuously becoming larger with several thousand machines,running thousands of jobs concurrently on 1000-1500 queues divided by different tenants and crunching higher volume of data than before.Hence,maintaining good performance of such large clusters,ensuring fast recovery times,upgrading them and debugging them becomes a major challenge.With larger clusters,enterprises expect even more efficient cluster utilisation.The fact that jobs are in turn executed as part of a workflow adds to the complexity. As time progresses, clusters would become even larger, i.e. have several tens of thousands of machines.

In this talk, we plan to share issues we came across while handling large clusters and the optimizations we had to make to resolve them.We would also talk about a few upcoming features in Hadoop which aim to overcome challenges posed by clusters at gigantic scale.

Speakers
NG

Naganarasimha Garla

System Architect, Huawei Technologies
I am a Big Data Enthusiast and have experience in developing Big Data Hadoop applications and platforms since 5 years. I have 12 years of experience as a Java Software Developer. | | I have been actively contributing for Hadoop YARN and Map Reduce since 2.5 years and currently A... Read More →
VS

Varun Saxena

Senior Technical Leader, Huawei Technologies
I am currently working as a Senior Tech Lead in Huawei's Hadoop Team which provides big data solutions to multiple product lines in Huawei and contributes to Hadoop community. I am also an Apache Hadoop Committer and have been contributing to YARN for almost 2.5 years. Overall, I... Read More →


Thursday May 18, 2017 9:00am - 9:50am
Windsor

9:00am

Java 9 Support in Apache Hadoop - Akira Ajisaka, NTT DATA
Java 9 is the next major version and will be GA in July 2017, and it's very important for Apache Hadoop to support Java 9 earlier. Hadoop has many downstream projects and it makes the projects to support Java 9 easily. Java 9 has more incompatible changes than any earlier releases. For example, Project Coin (JEP 213) banned '_' as an identifier and Hadoop Web UI is affected. In this session, Akira will introduce what are the incompatible changes and what we need to do to support Java 9 in Hadoop. Classpath isolation is also an important issue for Hadoop. Hadoop has many dependencies, and the developers who write applications running on Hadoop need to be careful not to conflict the classpath. Java 9 Jigsaw feature is expected to solve this 'jar hell' problem but Hadoop does not use the feature for now. Akira will also introduce how Hadoop community solves the problem without Jigsaw.

Speakers
avatar for Akira Ajisaka

Akira Ajisaka

Software Engineer, NTT DATA Corporation
Akira Ajisaka is a software engineer working at NTT DATA, Japan. He belongs to OSS Professional Services team and deploys and operates Hadoop clusters for customers. He sometimes troubleshoots them by investigating source code and creating patches to fix the problem. He is an Apa... Read More →



Thursday May 18, 2017 9:00am - 9:50am
Balmoral

9:00am

Challenges of Monitoring Distributed Systems - Nenad Bozic, SmartCat
Back in the days, you had a single machine and you could scroll down the single log file to figure out what is going on. In this Big Data world you need to combine a lot of logs together to figure out what is going on. Data is coming in huge volumes, with high speed so choosing important information and getting rid of noise becomes real challenge. There is a need for a centralized monitoring platform which will aid the engineers operating the systems, and serve the right information at the right time. This talk will focus on monitoring stack we like to use including Riemann, InfluxDB, ELK and Grafana. Cassandra will be used as an example of distributed system. Problem will be separated in two domains: metric collection and log collection and we will finish with example how you can combine both to pinpoint issues.

Speakers
avatar for Nenad Bozic

Nenad Bozic

Co-Founder & Senior Consultant, SmartCat
Big Data enthusiast and Apache Cassandra fan. DataStax MVP for Apache Cassandra for 2017. Craftsman with more than 10 years of experience, all arounder but when he does backend coding (mostly in Java) he feels right at home. Strong believer in balance between good technical skill... Read More →


Thursday May 18, 2017 9:00am - 9:50am
Biscayne

9:00am

Apache Rya – A Scalable RDF Triple Store - Adina Crainiceanu, US Naval Academy
Apache Rya (incubating) is a scalable database management system designed for storing and searching very large Resource Description Framework (RDF) data. In its most basic form, RDF data is a triple. Due to its flexibility, RDF is the current standard for storing a many different types of information. With the explosive increases in the size of available data, scalable solutions are needed to efficiently store and query very large RDF graphs within big data architectures. Apache Rya is an RDF triple store built on top of Apache Accumulo. We introduce storage methods, indexing schemes, query optimization, and query evaluation techniques that allow Rya to scale to billions of triples across multiple nodes, while providing fast and easy access to the data through conventional query mechanisms such as SPARQL.

Speakers
AC

Adina Crainiceanu

U.S. Naval Academy
Adina Crainiceanu is an Associate Professor in the Computer Science Department at the US Naval Academy. She received her Ph.D. in Computer Science from Cornell University. She has conducted database and distributed systems related research for more than 15 years, and has publishe... Read More →


Thursday May 18, 2017 9:00am - 9:50am
Trianon

10:00am

A Funny Thing Happened on the Way to Full Text Search: I Shook my Search Engine and Analytics Fell Out! - Patrick Hoeffel, Polaris Alpha
Search engines are not just for text anymore. Apache Solr has become a powerful Business Intelligence and Analytics tool, answering a much broader array of questions than was possible in the past. We’ll explore use cases that you may not have realized Solr could address, such as Graph Traversal and Machine Learning through Text Classification. We’ll also discuss the key BI and Analytics differentiator - Faceting, and discuss how that one feature can transform your analytics landscape. Then we’ll look at Solr’s new Parallel SQL interface, which allows you to use Tableau and other traditional BI tools right out of the box to perform analysis tasks that never could have been possible before with a Full Text index. During the talk we demonstrate Facets, plus how you can use the SQL interface to set up a simple alerting engine right within Solr so that you can be productive right away.

Speakers
avatar for Patrick Hoeffel

Patrick Hoeffel

Senior Software Engineer, Polaris Alpha
Patrick is a Senior Software Engineer at Polaris Alpha. A veteran of commercial software solutions for over 25 years, Patrick has been involved products ranging from Online Services to early Internet Startups to Enterprise Applications to Military Intelligence. He has consulted t... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Biscayne

10:00am

Hadoop Cluster Governance - Vimal Sharma, Hortonworks
Apache Atlas is the one stop solution for data governance and metadata management on enterprise Hadoop clusters. Atlas has a scalable and extensible architecture which can plug into many Hadoop components to manage their metadata in a central repository. Vimal Sharma will review the challenges associated with managing large datasets on Hadoop clusters and demonstrate how Atlas solves the problem. Vimal will focus on Cross Component lineage tracking capability of Apache Atlas. Vimal will discuss the upcoming features and roadmap of Apache Atlas.

Speakers
avatar for Vimal Sharma

Vimal Sharma

Software Engineer, Hortonworks
Vimal Sharma is an Apache Atlas Committer at Hortonworks. Vimal graduated from IIT Kanpur with a B.Tech in Computer Science. Vimal is highly passionate about Hadoop stack and has previously worked on scaling backend systems at WalmartLabs using Spark and Kafka. Vimal regularly sp... Read More →



Thursday May 18, 2017 10:00am - 10:50am
Balmoral

10:00am

Apache Ignite SQL Grid: Hot Blend of Traditional SQL and Swift Data Grid - Denis Magda, GridGain Systems Inc
In-memory data grids bring exceptional performance and scalability gains to applications built on top of them. The applications truly achieve 10x more performance improvement and become easily scalable and fault-tolerant thanks to the unique data grids architecture. However, because of this particular architecture, a majority of data grids have to sacrifice traditional SQL support requiring application developers to completely rewrite their SQL-based code to support data grid specific APIs.

This, however, is not true for all data grids. In this presentation, Denis will introduce Apache Ignite SQL Grid component that combines the best of two worlds - performance and scalability of data grids and traditional ANSI-99 SQL support of relational databases. Moreover, Denis will take an existing application that works with a relational database and will show how to run it on top of Ignite.

Speakers
avatar for Denis Magda

Denis Magda

Product Manager, GridGain
Denis is an expert in distributed systems and platforms who developed his skills by consistently contributing to Apache Ignite In-Memory Data Fabric and helping GridGain In-Memory Data Fabric customers build a distributed and fault-tolerant solution on top of their platform.Befor... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Windsor

10:00am

Big Data Analytics Using Apache (Py)Spark For Analyzing IPO Tweets - Dirk Van den Poel, Ghent University
In this talk, we share our experience in researching and practicing Business Analytics with a strong emphasis on descriptive and predictive analytics. We discuss the usefulness of these open-source analytics platforms by means of a real-life case study in Finance and Marketing: Analyzing the interaction between tweets and the success of an initial public offering (IPO), and the post-IPO price evolution. Moreover, we build a predictive model to determine whether a tweet will be retweeted. We present our findings using a series of platforms ranging from (1) dedicated Apache Spark clusters using Python Zeppelin Notebooks to (2) Databricks’ cloud platform.

Speakers
avatar for Dirk Van den Poel

Dirk Van den Poel

Professor of Data Analytics, Ghent University
Dirk Van den Poel (PhD) is Senior Full Professor of Data Analytics/Big Data at Ghent University, Belgium. He teaches courses such as Statistical Computing, Big Data, Predictive and Prescriptive Analytics. He co-authored 80+ international peer-reviewed publications in journals suc... Read More →


Thursday May 18, 2017 10:00am - 10:50am
Trianon

10:50am

Coffee Break
Thursday May 18, 2017 10:50am - 11:20am
Ballroom Foyer

11:20am

Cluster Continuous Delivery with Oozie - Clay Baenziger, Bloomberg
Deploying software to secure, clustered Hadoop environments is a challenge. Particularly, one must distribute keytabs, user identities and cluster configuration to build systems like Jenkins; to speak nothing of network access to clusters. At Bloomberg, we ensure our clusters are defined via configuration management and can be automatically configured, operated. Application (HBase, Spark) deployment is a key part of this as well!

We have extended Oozie to provide deployment mechanisms for Git and plans for Maven artifacts allowing us to specify all cluster configuration including software deployed to that cluster. Often this consists of an Oozie workflow to deploy software, allowing for deployments to happen as the permissioned role account and not as a superuser.

Clay will walk through the process of these deployments and the code necessary to make these first-class Oozie actions.

Speakers
CB

Clay Baenziger

Clay Baenziger - is an architect for the Hadoop Infrastructure Team at Bloomberg. Clay comes from a diverse background in systems infrastructure and analytics. At Sun Microsystems, his team built out an automated bare-metal Solaris deployment tool for Solaris engineering labs and... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Balmoral

11:20am

Introduction to Cluster Management Framework and Metrics in Apache Solr - Anshum Gupta, IBM Watson
Cluster management APIs have been consistently added to recent versions of Apache Solr to make designing of monitoring system for Solr easier. However, those APIs have always required an advanced level of knowledge of the pre-checks and the APIs themselves. The cluster management framework in Solr is aimed at making cluster management easier. Combination of metrics reporting, triggers, and recipes would allow users to configure actions based on triggers comprising of metrics or changes to the cluster state e.g. auto-addition of replicas to achieve a desired replication factor when a new node is added to the SolrCloud cluster. In this presentation, I would provide an overview of the cluster management framework, how it works, and its components i.e. metrics, triggers, and recipes. I will also talk about ways to extend those components to suit specific use-cases.

Speakers
avatar for Anshum Gupta

Anshum Gupta

Sr. Software Engineer, IBM Watson
Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks a... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Biscayne

11:20am

Presto - Swiss Army SQL Knife on Hadoop - Marek Gawiński & Dariusz Eliasz, Allegro Group
Waiting for Hive queries to finish teaches Your analysts patience and respect to technology. Unfortunately it is not what they expect and not what You get paid for. Interactive SQL on Hadoop has been The Holy Grail within Hadoop community and our analysts at Allegro - the biggest ecommerce platform in central-eastern Europe. We have read several benchmark papers regarding alternatives to Hive and we have run benchmarks on our own but they did not answer the question - which one to choose and is it worth adding Hive alternative to existing stack. Some technologies performed better with Parquet, others with ORC. None of the benchmarks consider user experience, new technology adoption within existing stack, and productivity of query development. In this talk we present how we ended up with Presto and our tips and tricks to hack it.

Speakers
avatar for Dariusz Eliasz

Dariusz Eliasz

Senior Data Platform Engineer, Grupa Allegro Sp. z o.o.
Mainly interested in: | - big data platform architecture | - data governance | | Enthusiast of scalable distributed solutions, processing large amounts of data and continuous improvement.
MG

Marek Gawiński

Senior Data Platform Engineer, Allegro Group Sp. z o.o.
Since 6 years in Infrastructure and Services Maintenance Team where he takes care of technical support for the scrum teams and maintenance of multiple services included in the Allegro Group's portfolio. He is now developing big data solutions. Passionate about web technologies an... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Windsor

11:20am

From Open Data to Open Information - Thomas Vanhove, Qrama
Smart cities gather massive amounts of data from IoT sensors all over the city and external data sources provides by city services. These data sets are often made available to the public as open data sets, but while the data is openly available, using the data for practical use cases still requires infrastructure.

Thomas Vanhove will present the City of Things, a smart city project in the city of Antwerp (Belgium), and how people gain access to open data and can run their own analysis with the Tengu platform. Tengu provides the functionality to create custom big data frameworks through automated installation, configuration and integration of big data technologies for storage and analysis. In the City of Things project this not only allows users access to open data but to infrastructure and analysis as well.

Speakers
avatar for Thomas Vanhove

Thomas Vanhove

Co-founder - CEO, Qrama
Thomas obtained his master's degree in Computer Science from Ghent University, Belgium in July 2012. In August 2012, he started his PhD at the Information Technology department, researching the means for reaching true dynamic storage and polyglot persistence as to increase applic... Read More →


Thursday May 18, 2017 11:20am - 12:10pm
Trianon

12:20pm

MOHA: Many-Task Computing Framework on Hadoop - Soonwook Hwang, Korea Institute of Science and Technology Information
In this talk, we present design and implementation of MOHA (MTC on Hadoop) framework which can effectively combine Many-Task Computing (MTC) technologies with Big Data platform Hadoop to enable more rich data analytics workflows in the ecosystem. MTC is a new computing paradigm that can consist of, e.g., millions of small tasks where each task communicates through files resulting in another type of data-intensive workload. MOHA is developed as one of YARN applications so that it can transparently cohost existing MTC applications with other Big Data processing frameworks in a single Hadoop cluster. MOHA can substantially reduce the overall execution time of many-task processing with minimal amount of resources compared to an existing Hadoop YARN application by effectively exploiting open-source distributed message queues (Apache ActiveMQ, Kafka) and streamlined task dispatching mechanism.

Speakers
avatar for Soonwook Hwang

Soonwook Hwang

Principal Researcher, KISTI
Dr. Soonwook Hwang is a principal researcher at Korea Institute of Science and Technology Information (KISTI), where he is responsible for the research and development of enabling technologies for the realization of cyber infrastructure for Korea. KISTI is running the biggest nat... Read More →


Thursday May 18, 2017 12:20pm - 1:10pm
Balmoral

12:20pm

Distributed Resource Scheduling Frameworks, Is There a Clear Winner? - Naganarasimha Garla & Varun Saxena, Huawei Technologies
Coming from Hadoop world we were aware of only YARN as a distributed resource scheduling frameworks but off late we have come across several other scheduling frameworks such as Mesos, Kubernetes etc. It is challenging to pick the right scheduling framework for an enterprise, as superficially all look the same. As part of this presentation, we want to provide overview of architectures of prominent scheduling frameworks, and then compare each of them functionally. We also plan to present which suits better in which scenarios and brief overview of community activities about these projects.

Speakers
NG

Naganarasimha Garla

System Architect, Huawei Technologies
I am a Big Data Enthusiast and have experience in developing Big Data Hadoop applications and platforms since 5 years. I have 12 years of experience as a Java Software Developer. | | I have been actively contributing for Hadoop YARN and Map Reduce since 2.5 years and currently A... Read More →
VS

Varun Saxena

Senior Technical Leader, Huawei Technologies
I am currently working as a Senior Tech Lead in Huawei's Hadoop Team which provides big data solutions to multiple product lines in Huawei and contributes to Hadoop community. I am also an Apache Hadoop Committer and have been contributing to YARN for almost 2.5 years. Overall, I... Read More →


Thursday May 18, 2017 12:20pm - 1:10pm
Biscayne

12:20pm

ING CoreIntel: On The Bank Secret Service - Krzysztof Adamski, ING
Security is at the core of every bank activity. ING set an ambitious goal to have an insight into the overall network data activity. The purpose is to quickly recognize and neutralize unwelcomed guests such as malware, viruses and to prevent data leakage or track down misconfigured software components. Since the inception of the CoreIntel project we knew we were going to face the challenges of capturing, storing and processing vast amount of data of a various type from all over the world. In our session we would like to share our experience in building scalable, distributed system architecture based on Kafka, Spark Streaming, Hadoop and Elasticsearch to help us achieving these goals. Why choosing good data format matters? Why dealing with Elasticsearch is a love-hate relationship for us or how we just managed to implement persistency in an OpenShift cluster.

Thursday May 18, 2017 12:20pm - 1:10pm
Trianon

12:20pm

Podling Shark Tank - Jim Jagielski, Capital One; Sally Khudairi, ASF; Justin Mclean, Class Software; Roman Shaposhnik, Pivotal Inc.
Is it a panel? Is it a talk? It is a Podling Shark Tank! Back by popular demand with even sharkier judges! What is it, you ask? Well, this is just like Shark Tank TV show (think speed dating between entrepreneurs and investors) but instead of Squirrel Boss and Man Candle (don't forget to look those up!) you'll be hearing pitches for Apache Incubator projects. Also instead of Mark Cuban and Kevin O'Leary you'll be pitching to the panel of ASF elders (trying to convince them that your project is worthy of their esteemed attention and endorsement). There will be snark, there will be prizes, there will be reciting of Apache Way creed. But most of all there will be fun. We guarantee that!

Moderators
RS

Roman Shaposhnik

Director of Open Source, Pivotal
Roman Shaposhnik is a Director of Open Source at Pivotal Inc and VP of Technology for ODPi at Linux Foundation. He is a committer on Apache Hadoop, co-creator of Apache Bigtop and contributor to various other Hadoop ecosystem projects. He is also an ASF member and a former Chair... Read More →

Speakers
avatar for Jim Jagielski

Jim Jagielski

Director, Apache Software Foundation
Jim is a well known and acknowledged expert and visionary in Open Source, an accomplished coder, and frequent engaging presenter on all things Open, Web and Cloud related. As a developer, he’s made substantial code contributions to just about every core technology behind the In... Read More →
avatar for Sally Khudairi

Sally Khudairi

VP Marketing & Publicity, The Apache Software Foundation
Sally Khudairi is Vice President of Marketing & Publicity at The Apache Software Foundation (ASF) where, in 2002, she was elected its first female and non-technical Member. She is responsible for elevating the ASF’s visibility, and counsels 350+ Apache projects and initiatives... Read More →
avatar for Justin Mclean

Justin Mclean

Founder, Class Software
Justin Mclean has more than 25 years experience in developing web based applications and is involved in the open source hardware movement. He runs his own consulting company Class Software and has spoken at numerous conferences in Australia and overseas including previous ApacheC... Read More →


Thursday May 18, 2017 12:20pm - 1:10pm
Windsor

1:10pm

Lunch (Attendees on Own)
Thursday May 18, 2017 1:10pm - 2:40pm
TBA

2:40pm

Transactions in Hadoop - Andreas Neumann, Cask
In the age of NoSQL, big data storage engines such as HBase have given up ACID semantics of traditional relational databases, in exchange for high scalability and availability. However, it turns out that in practice, many applications require consistency guarantees to protect data from concurrent modification in a massively parallel environment. In the past few years, several transaction engines have been proposed as add-ons to HBase: Three different engines, namely Omid, Tephra, and Trafodion were open-sourced within the Apache ecosystem alone. In this talk, Andreas Neumann will introduce and compare the different approaches from various perspectives including scalability, efficiency, operability and portability, and make recommendations pertaining to different use cases.

Speakers
avatar for Andreas Neumann

Andreas Neumann

Cask
Andreas Neumann develops big data software at Cask, and has formerly done so at places that are known for massive scale. He was the chief architect for Hadoop at Yahoo! and also for the foundational content management system that Yahoo! built on Hadoop. Previously he was a resear... Read More →


Thursday May 18, 2017 2:40pm - 3:30pm
Balmoral

2:40pm

Lessons Learned with Spark & Cassandra - Matthias Niehoff, codecentric AG
We built multiple applications based Apache Cassandra and Apache Spark. During the project we encountered a number of challenges and problems with both technologies as well as with the Spark-Cassandra-Connector In this talk we want to outline a few of those problems and our actions to solve them. Furthermore we want to give best practices which turned out to be useful in our projects. Topics include are not limited to:
  • Cassandra Bucketing
  • Spark Partitioning
  • Efficient Queries
  • Spark Join With Cassandra Table
  • Spark Data Locality

Speakers
avatar for Matthias Niehoff

Matthias Niehoff

IT Consultant, codecentric AG
Matthias works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and... Read More →


Thursday May 18, 2017 2:40pm - 3:30pm
Biscayne

2:40pm

Scala + SQL = Union of Two Equals in Spark - Jayesh Thakrar, Conversant
Spark's capabilities as a better and faster Hadoop, as a distributed Scala platform, and as an interactive, batch and streaming environment are quite well known. But its prowess to be all that as a multilingual platform have not received sufficient spotlight. Traditionally RDBMS environments needed to glue together set oriented SQL with row-level specialized procedural languages (e.g. Pl/SQL), or use APIs in non-SQL languages e.g. JDBC. In spark however, the confluence of Scala and SQL is that of two equals as both are set or collection oriented, but have their own unique strengths.  This presentation will illustrate with background and examples on how to exploit this fusion of Scala and SQL in a way that takes advantage of both their strengths as well as boosts productivity.

Speakers
avatar for Jayesh Thakrar

Jayesh Thakrar

Sr. Software Engineer, Conversant
Jayesh Thakrar is a Sr. Data Engineer at Conversant (http://www.conversantmedia.com/). He is a data geek who gets to build and play with large data systems consisting of Hadoop, Spark, HBase, Cassandra, Flume and Kafka. To rest after a good day's work, he uses OpenTSDB with 500... Read More →



Thursday May 18, 2017 2:40pm - 3:30pm
Windsor

2:40pm

Secure, UI-Driven Spark/Flink/Kafka-as-a-Service - Jim Dowling, Royal Institute of Technology
Since June 2016, SICS Swedish ICT has provided Hadoop/Spark/Flink/Kafka/Zeppelin-as-a-service to researchers in Sweden. We have developed a UI-driven multi-tenant platform (Apache v2 licensed) in which researchers securely develop and run their applications. Applications can be either deployed as jobs (batch or streaming) or written and run directly from Notebooks in Apache Zeppelin. All applications are run on YARN within a security framework built on project-based multi-tenancy. A project is simply a grouping of users and datasets. Datasets are first-class entities that can be securely shared between projects. Our platform also introduces a necessary condition for elasticity: pricing. Application execution time in YARN is metered and charged to projects, that also have HDFS quotas for disk usage. We also support project-specific Kafka topics that can also be securely shared.

Speakers
avatar for Jim Dowling

Jim Dowling

Senior Researcher, SICS RISE / KTH
Jim Dowling is an Associate Professor at KTH Royal Institute of Technology in Stockholm as well as a Senior Researcher at SICS Swedish ICT. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is lead architect of... Read More →



Thursday May 18, 2017 2:40pm - 3:30pm
Trianon

2:40pm

Sponsor Showcase
Thursday May 18, 2017 2:40pm - 4:40pm
Ballroom Foyer

3:40pm

Performance Benchmarking in Open-Source at Amazon EMR - Stephen Tak Lon Wu, Amazon AWS EMR
Amazon EMR is a cloud-based provider that allows companies, research centers and academic divisions to leverage managed clusters at massive scale. In order to maintain and achieve performance in the open-source world of big data processing, Amazon EMR built an automatic performance benchmarking pipeline to aid in validating a new release prior to release. Why do we need this performance benchmarking pipeline? Open source communities move fast; innovations and implementations often need multiple iterations in order to effectively work at massive scale. Amazon EMR aims to provide a stable service; historical performance metrics help us to preview and capture the issues of each product before releasing to the market, meanwhile Amazon EMR is following closely to the open source releases.

Speakers
TW

TAKLON WU

Amazon EMR
Tak Lon (Stephen) Wu is a software development engineer of Amazon EMR. Before joining the company, he was working toward his PhD at Indiana University and got his candidate in late 2015. His research interests are Big data application analysis, MapReduce, data mining and performa... Read More →


Thursday May 18, 2017 3:40pm - 4:30pm
Biscayne

3:40pm

Streamline Hadoop DevOps with Apache Ambari - Alejandro Fernandez, Hortonworks
Apache Ambari has become an indispensable tool for operating Hadoop clusters ranging from 20 to 2000 nodes. Ambari’s knowledge of the Hadoop stack allows it to deploy a cluster within minutes and manage the entire lifecycle: scaling, security, upgrades, and more. The speaker will discuss central features like deploying clusters with Blueprints, adding custom services, scaling the number of hosts, adding High Availability, securing with MIT kerberos, and upgrading the Hadoop stack with features like Rolling & Express Upgrade, and using the REST API to automate workflows. For users and data scientists, Ambari provides LDAP sync, Role-Based Access Control to handle user permissions, and a framework to host Ambari Views. Lastly, he will cover how to monitor the health of the cluster via Alerts and troubleshoot by using LogSearch and Ambari Metrics Systems integrated with Grafana UI.

Speakers
avatar for Alejandro Fernandez

Alejandro Fernandez

Staff Software Engineer, Hortonworks
Alejandro Fernandez has been a PMC for the Apache Ambari project since 2014 and is a software engineer at Hortonworks. He has made significant code contributions to Apache Ambari, has organized and participated in hackathons, and has been a speaker at the Hadoop Summit in San Jos... Read More →


Thursday May 18, 2017 3:40pm - 4:30pm
Balmoral

3:40pm

One-Click Production Deployment of Tensorflow AI and Spark ML Models Using 100% Open Source Jupyter Notebook, Kubernetes, and NetflixOSS - Chris Fregly, PipelineIO
In this completely demo-based talk, Chris Fregly from PipelineIO will demo the latest 100% open source research in high-scale, fault-tolerant, distributed model training, testing, and serving using Tensorflow, Spark ML, Jupyter Notebook, Docker, Kubernetes, and NetflixOSS Microservices. This talk will discuss the trade-offs of mutable vs. immutable model deployments, on-the-fly JVM byte-code generation, global request batching, miroservice circuit breakers, and dynamic cluster scaling - all from within a Jupyter notebook. All code and docker images are available from Github and DockerHub at http://pipeline.io.

Thursday May 18, 2017 3:40pm - 4:30pm
Alhambra

3:40pm

A Smarter Pig - Eli Levine, Salesforce & Julian Hyde, Hortonworks
What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.

Speakers
avatar for Julian Hyde

Julian Hyde

Architect, Hortonworks
Julian Hyde is an expert in query optimization, in-memory analytics, and streaming. He was the initial developer of Apache Calcite and is a PMC member of Drill, Kylin and Eagle. He is an architect at Hortonworks.
avatar for Eli Levine

Eli Levine

Architect, Salesforce
Eli Levine is an architect at Salesforce building large scale storage and compute systems. He is a PMC member of Apache Phoenix.


Thursday May 18, 2017 3:40pm - 4:30pm
Windsor

3:40pm

Applying Apache Big Data Stack for Science-Centric Use Cases - Suresh Marru, Indiana University
This talk will discuss adaptation of Apache Big Data Technologies to analyze large, self-described, structured scientific data sets. We will present initial results for the problem of analyzing petabytes of weather forecasting simulation data produced as part of National Oceanic and Atmospheric Administration's annual Hazardous Weather Testbed. The challenge is to enable weather researchers to perform investigative queries over the full forecast simulation outputs to find the signatures for severe weather phenomena like tornadogenesis. Given the size of the data and the complexity of weather phenomena, these data sets are candidates for exploration by machine learning techniques that can identify heretofore unknown relationships in the dozens of weather parameters generated by the simulations, guiding researchers into developing new scientific models.

Speakers
avatar for Suresh Marru

Suresh Marru

Member, Indiana University
Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain... Read More →


Thursday May 18, 2017 3:40pm - 4:30pm
Trianon

4:40pm

Docker on Hadoop - Daniel Templeton, Cloudera, Inc.
Apache Hadoop is a powerful platform for processing large volumes of structured, semi-structured, and unstructured data. Docker is an exciting technology for containerizing workloads. Combining the two can solve a number of issues for big data practitioners. In this talk, Daniel Templeton will walk the audience through the current level of Docker support in Hadoop, where it fall short, and how best to take advantage of it. Daniel will also cover the ongoing community work, including impact and expected availability.

Speakers
DT

Daniel Templeton

Cloudera, Inc.
Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support, and is a Hadoop committer. Dan... Read More →


Thursday May 18, 2017 4:40pm - 5:30pm
Balmoral

4:40pm

Multi-Model Big Data Platform for Complex Real Estate Analytics - Karthik Karuppaiya, Ten-X
Building an online real-estate marketplace is an extremely complex high touch business. The data that the business deals with varies from scanned PDFs and complex excel spread sheets to transactional RDBMSes(?) and click stream data. Data engineering at Ten-X has spent the last couple of years building a highly effective multi-model data platform that brings all of this data together and analyses it to help the business make better decisions and move faster. In this talk we will talk about how our data platform evolved, including the technology choices we made and why we made them. Our data lake is built as a multi-model platform on top of technologies including Hadoop, JanusGraph, Spark, Hive, Cassandra and HBase. We will also introduce you to some of the complex pattern matching algorithms and Natural Language Processing techniques we have implemented on our platform.

Speakers
avatar for Karthik Karuppaiya

Karthik Karuppaiya

Sr. Engineering Manager, Data and Analytics, Ten-X
Leading the Data Engineering team at Ten-X. Have been working on Hadoop and NoSQL technologies since 2010. Currently helping to build the next generation Data Platform for Ten-X using Hadoop, Kafka, JanusGraph, Spark and Cassandra. Prior to Ten-X, I led the Big Data Engineering t... Read More →


Thursday May 18, 2017 4:40pm - 5:30pm
Windsor
  • Experience Level Any

4:40pm

Advertising on Google and Traffic Experimentation Platform in eBay - Martin Zhang, eBay
eBay is one of largest e-commerce company in the world, providing C2C and B2C sales services via the Internet. ebay has more than 400 million users (160 million active) and more than 1 billions sales items on ebay site. We built advertising and experimentation platform for search network, like Google and Bing, based on Hadoop, Spark, Kafka, etc. In this session, we introduce our advertising and experimentation platform, how the experimentation platform supports A/B test and running different science models.

Speakers

Thursday May 18, 2017 4:40pm - 5:30pm
Trianon

4:40pm

The Importance of Automation in Open Source - Ashish Thusoo, Qubole
Open source technologies permeate the enterprise. But while open source technologies are critical to everything from infrastructure and data initiatives, the technologies often require a great deal of expertise, labor and resources to integrate these constantly evolving and emerging technologies into enterprise atmospheres. As a result, the importance of automation is becoming paramount to the successful implementation of the latest open source projects. Platforms need more automation to reduce the amount of effort and expertise it takes to integrate these open source projects that are constantly evolving. In this presentation, Ashish can discuss the use of automation to free up your data experts time to effectively employ open source projects that are not often enterprise-ready without a lot of customization.

Speakers
AT

Ashish Thusoo

CEO and co-founder, Qubole
Ashish Thusoo is the CEO and co-founder of Qubole, a cloud-based provider of Hadoop services. Before co-founding Qubole, Ashish ran Facebook’s Data Infrastructure team; under his leadership the team built one of the largest data processing and analytics platforms in the world. As... Read More →


Thursday May 18, 2017 4:40pm - 5:30pm
Biscayne