Apache: Big Data North America 2017: Full Schedule

Apache: Big Data North America 2017 will be held at the Intercontinental Miami in Miami, Florida.

Register now for the event taking place May 16-18, 2017.

9:30am EDT

BarCampApache

BarCampApache is a BarCamp being facilitated by a group of people involved in the Apache Software Foundation (ASF). All topics are still welcome however! As the ASF is helping to organize, there will be a lot of people around who know a lot about Apache projects / communities / technologies, so there are normally quite a few sessions proposed on those areas. It's not exclusively Apache though, so everyone should come, and talk about fun new ideas, projects and technologiesBarCampApache will be a dynamic get together open to the public. Like other unconferences, the schedule will be determined by the participants, both Apache and non! We strongly encourage lots of people to come along and share their knowledge and ideas. We want it to be a great day of sharing for everyone, not just those at the event. Everyone coming in for the conference is encouraged to come early, as it will be a great day for all.

(Please note: While BarCamp Apache is free to attend, you will need to register for Apache: Big Data if you wish to attend the conference sessions.)

Monday May 15, 2017 9:30am - 3:00pm EDT
Rafael

BarCamp

Experience Level Any

12:05pm EDT

Even Faster: When Presto Meets Parquet @ Uber - Zhenxiao Luo, Uber

As Uber continues to grow, our big data systems need to grow in scalability, reliability, and performance, to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Since 2016, we put Presto in production. Now Presto is serving ~100K queries per day @ Uber, and it becomes a key component for interactive SQL queries on big data. In this presentation, we would like to talk about our experiences and engineering efforts, we start with general introduction about Hadoop Infrastructure & Analytics @ Uber, then comes a brief introduction to Presto, the Interactive SQL engine for big data. We will focus on how we build the New Parquet Reader for Presto, and the detail techniques, Columnar Reads, Lazy Reads, Nested Column Pruning. We will show performance improvements and Uber's Use Cases. Finally, we would like to share our ongoing work.

Speakers

Zhenxiao Luo

Sr. Staff Engineer, Twitter

Zhenxiao Luo is leading Interactive Query Engines team at Twitter, where he focuses on Druid, Presto, and Spark. Before joining Twitter, Zhenxiao was running Interactive Analytics team at Uber. He has big data experience at Netflix, Facebook, Cloudera, and Vertica. Zhenxiao is Committer... Read More →

Tuesday May 16, 2017 12:05pm - 12:55pm EDT
Alhambra

SQL

Experience Level Any

3:30pm EDT

Leveraging Smart Meter Data for Electric Utilities: Comparison of Spark SQL with Hive - Yusuke Furuyama, Hitachi

Hitachi has focused on social innovation business. It has constantly evolved to create sustainable business products and solutions to enhance the quality of life across the globe. Now we are leveraging smart meter data for electric utilities. To meet their needs, we compared the performance of batch processing for aggregating data from smart meters using Hadoop (MapReduce) and Spark 1.6 and Spark 2.0 with changing some parameters (the amount of input data, the logic of processing, input file format, etc.).In this session, we report the results of performance test above.

Speakers

Yusuke Furuyama

Senior Engineer, Hitachi, Ltd.

Yusuke Furuyama is a solution engineer at Hitachi and responsible for leveraging OSS related to bigdata and AI and working on offering progressive bigdata solutions to customers who are going to build enterprise system. Currently, he is focusing on Python and Python libraries for... Read More →

Tuesday May 16, 2017 3:30pm - 4:20pm EDT
Alhambra

SQL

Experience Level Any

10:15am EDT

The Rise of Real-Time: Apache DistributedLog and Its Stream Store - Sijie Guo, Twitter

Data growth is exponential and organizations are producing it in a myriad of formats. Instead of storing and processing the data at some regular cadence, many in the industry are realizing the benefits of real-time data analytics via stream processing. The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of storage for real-time applications, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Sijie Guo will describe Apache DistributedLog - a high throughput and low latency replicated stream store, discuss what are the challenges on building a stream store for real-time applications, and explore the future of Apache DistributedLog and the big data ecosystem.

Speakers

Sijie Guo

Twitter

Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.

Wednesday May 17, 2017 10:15am - 11:05am EDT
Biscayne

Streaming

Experience Level Any

10:15am EDT

Evolution of an Apache Spark Architecture for Processing Game Data - Nick Afshartous, Warner Brothers Interactive Entertainment (WBIE)

We discuss lessons learned from our first production deployment of a Spark Streaming pipeline for processing game data. Deployment is to the AWS Cloud where we use managed services (i.e. EMR, S3 and Redshift). However, having downstream dependencies with outages and unpredictable response latencies can pose significant challenges. To address, we evolved the architecture by separating data processing from post-processing tasks (i.e. copying data into Redshift). Post-processing tasks are sent downstream from Spark to a task executor that was built using Akka Streams and Reactive Kafka. The end result is a loosely coupled architecture where the Spark streaming job is a firehose to S3 and fault-tolerant when Redshift is unavailable.

Speakers

Nick Afshartous

Tech Director, Warner Brothers Interactive

Nick Afshartous is a Tech Director at Warner Brothers Interactive Entertainment (WBIE) where he leads the Analytics Core Platform team. Using Apache Spark, he's helping to build WBIE's next generation real-time analytics platform for processing game data. He's passionate about... Read More →

Wednesday May 17, 2017 10:15am - 11:05am EDT
Trianon

Use Cases

Experience Level Any

2:30pm EDT

Nexmark, a Unified Framework to Evaluate Big Data Processing Systems with Apache Beam - Ismael Mejia & Etienne Chauchot, Talend

Big Data processing in real-time is on the rise at Apache with projects like Apache Spark, Apache Flink or Apache Apex. However at this moment we don’t have a unified framework to evaluate the correctness and the performance of these systems. Apache Beam implements a unified model to write both Batch and Streaming jobs with a single API and execute them independently in any of the supported platforms (runners), this makes Beam an ideal candidate to support an evaluation framework.

In this talk we will present Nexmark, a benchmark framework to evaluate queries over data streams. An implementation of Nexmark was donated by Google as part of the Apache Beam incubation process. Nexmark bridges the gap for evaluating data processing frameworks, but also serves as a rich integration test to evaluate the correct implementation of both the Beam runners and the new features of the Beam SDK.

Speakers

Etienne Chauchot

Talend

Etienne has been working in software engineering and architecture for more than 13 years in domains such as retail or financial groups. He has been focusing on Big Data for a few years on technologies such as Apache Cassandra, ElasticSearch or Apache Spark. He is an Open Source fan... Read More →

Ismaël Mejía

Senior Cloud Advocate, Microsoft

Ismaël Mejía is a Senior Cloud Advocate at Microsoft working on the Azure Data and AI team. He has more than a decade of experience architecting systems for startups and financial companies. He has been recently focused on distributed data frameworks, he is an active contributor... Read More →

Wednesday May 17, 2017 2:30pm - 3:20pm EDT
Balmoral

Beam/Zeppelin

Experience Level Any

5:40pm EDT

Helium makes Zeppelin Fly! - Moon Soo Lee, Ahyoung Ryu and Hoon Park, NFLabs

Apache Zeppelin is interactive data analytics environment for computing system. It integrates many different data processing frameworks like Apache Spark and provides beautiful interactive web-based interface, data visualization, collaborative work environment to make your data science lifecycle more fun and enjoyable.

Since 0.7.0, Zeppelin has framework called 'Helium' with two new pluggable components: Visualization, Spell. Visualization extends built-in visualization and Spell provides lightweight way to extend interpreter and display system in Zeppelin.

In this talk we'll see how visualization and spell can be created and used. Also Zeppelin community provides Helium online registry by leveraging NPM package registry for publishing Visualization and Sell. We'll take a look how community manages online registry service and how to publish package to online registry.

Speakers

Moon soo Lee

ZEPL, inc

Moon Soo Lee is a creator for Apache Zeppelin (incubating) and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and building healthy business around of it.

Hoon Park

Ahyoung Ryu

Wednesday May 17, 2017 5:40pm - 6:30pm EDT
Balmoral

Beam/Zeppelin

Experience Level Any

4:40pm EDT

Multi-Model Big Data Platform for Complex Real Estate Analytics - Karthik Karuppaiya, Ten-X

Building an online real-estate marketplace is an extremely complex high touch business. The data that the business deals with varies from scanned PDFs and complex excel spread sheets to transactional RDBMSes(?) and click stream data. Data engineering at Ten-X has spent the last couple of years building a highly effective multi-model data platform that brings all of this data together and analyses it to help the business make better decisions and move faster. In this talk we will talk about how our data platform evolved, including the technology choices we made and why we made them. Our data lake is built as a multi-model platform on top of technologies including Hadoop, JanusGraph, Spark, Hive, Cassandra and HBase. We will also introduce you to some of the complex pattern matching algorithms and Natural Language Processing techniques we have implemented on our platform.

Speakers

Karthik Karuppaiya

Sr. Engineering Manager, Data and Analytics, Ten-X

Leading the Data Engineering team at Ten-X. Have been working on Hadoop and NoSQL technologies since 2010. Currently helping to build the next generation Data Platform for Ten-X using Hadoop, Kafka, JanusGraph, Spark and Cassandra. Prior to Ten-X, I led the Big Data Engineering team... Read More →

Thursday May 18, 2017 4:40pm - 5:30pm EDT
Windsor

SQL

Experience Level Any