Hortonworks contributes to a number of Apache projects. When we started we depended on our many experienced Apache community members to train their fellow Hortonworkers in the Apache Way. However, as we grew we found that training "by osmosis" was no longer sufficient. So we have instituted training for our teams in what Apache is, how it works, their responsibilities as part of Apache and how that meshes with their responsibilities as Hortonworkers, as well as a practical list of best practices and what to avoid. This talk will share some thoughts on the need for this training, give an overview of the content, talk about the results we have seen, and discuss how we are now working to role this out beyond engineering into the rest of the company.
There are no shortage of reasons why an open source project can stagnate. Yet despite confronting many of these challenges, Apache CouchDB has been resilient in the nearly 10 years since becoming an Apache Software Foundation project, to the point where today, its codebase and community are about as strong as they’ve ever been. The constant thread throughout the life of the project has been the consistent support of the ASF and IBM.
Adam Kocoloski, CTO of IBM Watson Data Platform, co-founder of Cloudant and PMC member for CouchDB, shares his perspective on what IBM finds so valuable about the Apache Software Foundation, through the lens of projects like CouchDB, Apache Spark, Apache Edgent and Apache OpenWhisk.
The importance of digital psychometrics – that is the assessment of psychological characteristics via digital footprints – was highlighted recently in the context of Trump’s unexpected victory during the U.S. presidential election. According to international media reports, Trump’s campaign used detailed psychological profiles of 220 million US citizens to target them with more than 175,000 different versions of personalized ads that catered to their values and preferences. In line with the public debate around the effectiveness as well as broader implications of such predictive technologies, this talk focuses on the following three questions: (1) How does digital psychometrics work (2) What are the potential benefits and dangers of digital psychometrics? (3) And finally, what does the future of digital psychometrics hold and how will it affect technology?
Real-time data insight is getting more important for trend capturing and just-in-time decision making. eBay as one of the world’s largest and most vibrant marketplace, it relies on real-time data analysis at multiple domains to run the business, like user info protection, promotion prediction as well as site performance detect and monitor etc.
In this session Ken will introduce a new zero latency streaming OLAP engine built on Apache Kylin and how the streaming OLAP engine serves eBay's real-time data analysis business. The new Kylin streaming engine uses column based storage and indexes as well as in-memory query technique to make real-time data be visible with no latency. The new streaming engine will also provide exactly once delivery semantics to make sure data quality when used together with Apache Kafka.
Most streaming engines focus on performing computations on a stream: for example, one can map a stream to run a function on each record, reduce it to aggregate events by time, etc. However, as we worked with users, we found that virtually no use case of streaming engines only involved performing computations on a stream. Instead, stream processing happens as part of a larger application, which we’ll call a continuous application.
Online machine learning and serving real-time data are examples that show streaming computations are part of larger applications that include serving, storage, or batch jobs. Unfortunately, in current systems, streaming computations run on their own, in an engine focused just on streaming. This leaves developers responsible for the complex tasks of interacting with external systems (e.g. managing transactions) and making their result consistent with the the rest of the application (e.g., batch jobs). This is what we’d like to solve with continuous applications.
Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we'll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding, and recommendation engines. Apache MXNet is a deep learning framework that allows you to define, train, and deploy deep neural networks on a wide array of devices, from cloud infrastructure to mobile devices. It is fast, highly scalable, supports a flexible programming model and multiple languages. This session offers an introduction to Apache MXNet, its benefits and how to quickly get started using it.
R is the de factor standard for statistics and data analysis. In this talk, we introduce R4ML, a new open-source R package from IBM. R4ML provides a bridge between R and Apache SystemML, allowing R scripts to invoke custom algorithms developed in SystemML's R-like domain specific language. This capability also provides a bridge to the algorithm scripts that ship with Apache SystemML, effectively adding a new library of prebuilt scalable algorithms for R on Apache Spark. R4ML integrates seamlessly SparkR, so data scientists can use the best features of SparkR and SystemML together in the same script. In addition, the R4ML package provides a number of useful new R functions that simplify common data cleaning and statistical analysis tasks.
Our talk will begin with an overview of the R4ML package, its API, supported canned algorithms, and the integration to Spark and SystemML. We will walk through a small example of creating a custom algorithm and a demo. We will share our experiences using R4ML technology with IBM clients. The talk will conclude with pointers to how the audience can try out R4ML and discuss potential areas of community collaboration.
Ready to dip your toe into data science? Yes? But where and how do you start? Well we have an answer – Notebooks and PixieDust! PixieDust is a new open source library that helps data scientists and developers working in Jupyter Notebooks and Apache Spark be more efficient. PixieDust speeds data manipulation and display with features like auto-visualization of Spark DataFrames, real-time Spark Job progress monitoring directly from the Notebook, seamless integration to cloud services, and automated local install of Python and Scala kernels running with Spark. And if you prefer working with a Scala Notebook - no problem! PixieDust can also run on a Scala Kernel - imagine being able to visualize your favorite Python chart engines from a Scala Notebook!
Cloud orchestration is no longer a wild-west of proprietary solutions. The enterprise and NFV industries are moving towards standards compliance with efforts focused on the OASIS TOSCA standard, which offers a policy-driven YAML-based language to design flexible and extensible cloud topologies, comprising compute nodes (VMs and containers), VNFs (Virtual Network Functions), as well as user-defined node types. This talk will introduce the Apache AriaTosca project, a compliant TOSCA parser and orchestrator. As well as being a fully functional orchestrator in itself, AriaTosca serves as a platform and SDK for building TOSCA-based solutions in Apache and beyond.
Most big data processing frameworks are JVM based. A big gap in such systems is to efficiently map the software layers/patterns to the underlying hardware, especially for newer technologies like Non Volatile Memory (NVM), and remove performance bottlenecks. The Apache Mnemonic project presents abstract models that help resolve memory bottlenecks e.g. SerDe/marshalling, Garbage Collection(GC) performance issues, memory-storage mapping, massive object caching, object sharing across clustering and kernel caching issues. In this talk we present Mnemonic, its architecture and the programming models and their applications (including integrations with Apache Hadoop and Apache Spark).