Apache: Big Data North America 2017: Full Schedule

Apache: Big Data North America 2017 will be held at the Intercontinental Miami in Miami, Florida.

Register now for the event taking place May 16-18, 2017.

11:05am EDT

OODT 2.0: The Future Of Distributed Data Management - Tom Barber, Meteorite Consulting

OODT, originally developed by NASA, provides distributed data management. In this talk we will look at the history of OODT and what is coming in OODT 2.0 to provide a more modern infrastructure to manage your data and metadata.

OODT 2.0 will offer much improved big data connectivity, workflow processing and deployment techniques, allowing for easier distribution and scaling of the platform. We will run through a sample deployment and show how beneficial using OODT to process your incoming data can be.

Speakers

Tom Barber

Technical Director, Spicule LTD

Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →

Tuesday May 16, 2017 11:05am - 11:55am EDT
Alhambra

Big Data

Experience Level Beginner

2:30pm EDT

Online and Offline Analytics on Cassandra in eBay - DongQian Liu, eBay

ebay is one of largest e-commerce company in the world, providing C2C and B2C sales services via the Internet. We use Cassandra to store large tables for online query. To reduce the Cassandra load, we do offline Analytics of Cassandra table, we dump sstables to HDFS and transform to Hadoop file formats. In this session, we introduce how we build high-performance, cross datacenter Cassandra cluster for online query, and for offline Analytics, we introduce how we implement splittable input format for sstables and transform to Hadoop file formats. We also introduce how we use bulk loader tool to load data from Hadoop to Cassandra quickly.

Speakers

Dongqian Liu

eBay

Tuesday May 16, 2017 2:30pm - 3:20pm EDT
Windsor

Cassandra

Experience Level Beginner

3:30pm EDT

Real-World Tales of Repair with Apache Cassandra - Alexander Dejanovski, TheLastPickle

Distributed databases inevitably have to deal with entropy. Within Apache Cassandra, the Anti-Entropy process initiated via CLI tools is the way of ensuring consistency of data on disk. Over the many years of the Apache Cassandra project it has also been the biggest operator pain points. Without a solid repair process in place, you had no guarantee that deleted data will not come back to life, or that data is fully distributed to replicas when nodes fail.

In this talk Alexander Dejanovski, Consultant at The Last Pickle, will explain how Anti-Entropy works and why it should be run on your cluster. He will discuss the different types of repair and their effect on data consistency. He will also introduce tools such as Cassandra Reaper and the range repair script to manage scheduling and running repairs in the most efficient way.

Speakers

Alexander Dejanovski

Apache Cassandra Consultant, The Last Pickle

Consultant Apache Cassandra @TheLastPickleAlexander has been working as a software developer since 1998, mainly for Chronopost. He's been leading there the effort to build a Cassandra based architecture and migrate critical services to it from traditional RDBMS. He is involved in... Read More →

Tuesday May 16, 2017 3:30pm - 4:20pm EDT
Windsor

Cassandra

Experience Level Beginner

10:15am EDT

Using Apache Beam for Batch, Streaming, and Everything in Between - Dan Halperin, Google

Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to easily tune requirements around completeness and latency and run the same pipeline across multiple runtime environments. In addition, Beam's model enables cutting edge optimizations, like dynamic work rebalancing and autoscaling, giving those runtimes the ability to be highly efficient.

This talk will cover the basics of Apache Beam, touch on its evolution, and describe the main concepts in its powerful programming model. We'll include detailed, concrete examples of how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios.

Speakers

Daniel Halperin

Google

Dan Halperin is a PMC member of Apache Beam. He has worked on Beam and Google Cloud Dataflow for 2 years. Previously, he was the director of research for scalable data analytics at the University of Washington eScience Institute, where he worked on scientific big data problems in... Read More →

Wednesday May 17, 2017 10:15am - 11:05am EDT
Balmoral

Beam/Zeppelin

Experience Level Beginner

11:15am EDT

Fast Cars, Big Data - How Apache Can Help Formula 1 - Carol McDonald, MapR Technologies

Modern race cars produce lot of data, and all this in real time. In this presentation I will show you how data could be generated and used by various applications in the car, on the track or team head quarter. The demonstration will show how to move data using messaging systems like Apache Kafka, process the data using Apache Spark and Flink and use various storage technics: distributed file system, HBase. This presentation is a great opportunity to see how to build a " near real time big data application" with Apache projects. The code from this talk will be made available as open source.

Speakers

Carol McDonald

solutions architect, mapr

Carol Mcdonald is a solutions architect at MapR focusing on big data, Apache HBase, Apache Drill, Apache Spark, and machine learning in healthcare, finance, and telecom. Previously, Carol worked as a Technology Evangelist for Sun, an architect/developer on: a large health information... Read More →

Wednesday May 17, 2017 11:15am - 12:05pm EDT
Trianon

Use Cases

Experience Level Beginner

4:40pm EDT

Leveraging the GPU on Spark - Tobias Polzer, QAware GmbH

GPUs are a great resource of computing power but yet not accessible from Apache Spark. We present a RDD implementation we've open sourced to leverage GPU computing power with Spark. We'll share the experiences we gained along the way implementing the RDD, and a real-world application using the RDD: What's the best way to bridge from Java to GPU code (OpenCL or CUDA)? From an architectural perspective - what's the best way to integrate a GPU processing facility into Spark? How much faster are typical Spark actions when using the GPU? What Spark actions are best processed on a GPU? Java-to-GPU bridges, best way to integrate GPU processing into Spark and performance evaluation.

Speakers

Tobias Polzer

Master's student, Friedrich-Alexander University Erlangen-Nuremberg/QAware

Wednesday May 17, 2017 4:40pm - 5:30pm EDT
Trianon

Deep Learning/GPU

Experience Level Beginner

5:40pm EDT

TensorFlow in the Wild: From Cucumber Farmer to Global Insurance Firm - Kazunori Sato, Google

One of the largest global insurance firm recently introduced TensorFlow, the open source library from Google for machine intelligence, for classifying car drivers who has high likelihood on major accidents with deep neural network. The model provides 2x higher accuracy compared with existing random forest model, gives them a possibility to lower the insurance price significantly. Also, a cucumber farmer in Japan has been using TensorFlow to build a hand-made sorter that classifies cucumbers into 9 classes based on its length, shape and color. At this session, we'll look at how TensorFlow democratizes the power of machine intelligence and is changing the world with many different real-world use cases of the technology.

Speakers

Kaz Sato

Developer Advocate, Google

Kaz Sato is Staff Developer Advocate at Google Cloud for machine learning and AI products, such as TensorFlow, Cloud AI and BigQuery. Kaz has been invited as a speaker at major events including Google Cloud Next, Google I/O, NVIDIA GTC and etc. Also, authoring many GCP blog posts... Read More →

Wednesday May 17, 2017 5:40pm - 6:30pm EDT
Biscayne

Deep Learning/GPU

Experience Level Beginner

10:00am EDT

Hadoop Cluster Governance - Vimal Sharma, Hortonworks

Apache Atlas is the one stop solution for data governance and metadata management on enterprise Hadoop clusters. Atlas has a scalable and extensible architecture which can plug into many Hadoop components to manage their metadata in a central repository. Vimal Sharma will review the challenges associated with managing large datasets on Hadoop clusters and demonstrate how Atlas solves the problem. Vimal will focus on Cross Component lineage tracking capability of Apache Atlas. Vimal will discuss the upcoming features and roadmap of Apache Atlas.

Speakers

Vimal Sharma

Software Engineer, Hortonworks

Vimal Sharma is Apache Atlas PMC and Committer at Hortonworks. Vimal is highly passionate about Hadoop stack and has previously worked on scaling backend systems at WalmartLabs using Spark and Kafka. Vimal was a speaker at ApacheCon BigData 2017 where he spoke on Metadata governance... Read More →

Atlas ApacheCon pdf

Thursday May 18, 2017 10:00am - 10:50am EDT
Balmoral

Hadoop

Experience Level Beginner

2:40pm EDT

Lessons Learned with Spark & Cassandra - Matthias Niehoff, codecentric AG

We built multiple applications based Apache Cassandra and Apache Spark. During the project we encountered a number of challenges and problems with both technologies as well as with the Spark-Cassandra-Connector In this talk we want to outline a few of those problems and our actions to solve them. Furthermore we want to give best practices which turned out to be useful in our projects. Topics include are not limited to:

Cassandra Bucketing
Spark Partitioning
Efficient Queries
Spark Join With Cassandra Table
Spark Data Locality

Speakers

Matthias Niehoff

IT Consultant, codecentric AG

Matthias works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and... Read More →

Thursday May 18, 2017 2:40pm - 3:30pm EDT
Biscayne

Ops

Experience Level Beginner

4:40pm EDT

Advertising on Google and Traffic Experimentation Platform in eBay - Martin Zhang, eBay

eBay is one of largest e-commerce company in the world, providing C2C and B2C sales services via the Internet. ebay has more than 400 million users (160 million active) and more than 1 billions sales items on ebay site. We built advertising and experimentation platform for search network, like Google and Bing, based on Hadoop, Spark, Kafka, etc. In this session, we introduce our advertising and experimentation platform, how the experimentation platform supports A/B test and running different science models.

Speakers

Yu Zhang

EBAY

Thursday May 18, 2017 4:40pm - 5:30pm EDT
Trianon

Use Cases

Experience Level Beginner