Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

7/9/2021

Reading time:6 min

Next-Gen Data Movement Platform at PayPal

by John Doe

…using Apache Airflow scheduler and Apache Gobblin — a data integration framework open-sourced by LinkedIn.As PayPal grows beyond 300 million users, we generate lots of data, both on our online (site) and offline (analytics) storage platforms. Data movement among those systems plays a critical role in enabling many of PayPal’s business use cases.PayPal hosts a large installation of Hadoop and other analytics systems, holding hundreds of petabytes of data.PayPal is one of the few companies in the software industry where almost every major type of storage system is used, from traditional RDBMS systems like MySQL to analytics platforms like Hadoop and other specialized data stores like Aerospike, Elasticsearch, and Kafka.Data that moves is alive and valuable. At rest, data is dead.Data constantly needs to move around and get processed, analyzed, and organized to realize its value. Moreover, the data producers and data consumers are not always in the same org. while producers optimize to write, consumers look to optimize to read. This inherently creates a challenge to enable data-driven decisions at a rapid pace. The data when created is tiny yet very critical, but when it's the time to read/analyze it, it usually becomes a big data problem. This dichotomy in the data world is bridged by data movement platforms and teams.A decade ago, data movement at PayPal was seen as an operations problem where system admins or platform service providers built tools and utilities to facilitate data movement in and out of the systems. This blog on the Evolution of data movement platforms provides a good glimpse of how data movement system evolved and where it’s headed.At Paypal, due to many legacies and one-off pointed data movement solutions, we ended up having a complex and unmanageable eco-system which added to the run-the-business (RTB) cost. Moreover, supporting new sources and targets costs more and hinders quick solutions for new business initiatives. We needed a data movement platform that can scale and cover a wide variety of storage ecosystems. The following diagram roughly depicts what we ended up with:No product roadmap plans for this :) and yet it inevitably happened as part of our growth journey.Also, as the amount of data being produced increased and consumers demanded more and more real-time experiences, we needed a much faster (i.e. throughput-wise), efficient, and reliable data movement platform to serve the downstream business use cases. So we embarked on the journey to build our Next-Generation of Data Movement Platform for PayPal.“If a solution is built for the most complex scenario, consider it built for the easier ones” — Enterprise data platform leadership.RADD — The Risk Analytical Dynamic Datasets pipeline is one of the most challenging and business-critical use-cases that enable the PayPal risk platform to make decisions on payment transactions. This was a perfect candidate to be proven on a new platform.RADD data flow requirements:Data flow path of the business use-caseFor us, building on top of open-source technology fits with our belief of “don't reinvent the wheel”, and “contribute back to the community”. So when we evaluated many OSS frameworks, based on our requirements and proof-of-concept results, we felt that Apache Gobblin offers the most features, and provides flexibility and space for us to build our next-gen data movement platform.An enterprise data movement platform requires many more components than just a reader and writer. The following diagram shows what we built and how it interacts to provide an end-to-end solution at a high level.End-to-End Component Interaction for the Data Movement PlatformOnboarding Service is a set of REST APIs built using PayPal’s internal Java spring framework (we call it Raptor). The onboarding service orchestrates the data pipeline. It interacts with various other services like schema reg., Gobblin, and Airflow APIs to create an end-to-end data pipeline. An onboarding API call results in a DAG and deploying configs on Airflow for executions. The DAG can be triggered based on the chosen mechanism at onboarding time, like an upstream handshake, cron-based interval, or ad-hoc. During every run, DAG also fetches metadata to operate based on the latest changes. A set of APIs to manage data pipeline lifecycles is also provided. Here is what the swagger spec looks like.DAG Service: The DAG service can create Airflow DAGs as per the requested configurations and template. Since Airflow does not provide a stable API interface to manage the Airflow DAGs, we built our own as part of this service. The DAG it builds is primarily responsible for incorporating all the application-specific logic: allowed deviation per dataset, type of movement (ad-hoc or rollback), etc. Once the DAG is deployed, Airflow does the execution.Apache Airflow: Airflow is a well-known workflow management and executor platform. We use Airflow to define and execute the data pipeline DAG. Airflow provides a runtime orchestration layer for end-to-end movement. This also provides the ability to embed data processing and handshaking capabilities, and easier operational management for the data pipeline. Here is how the DAG looks like for one of the datasets.Apache Gobblin is a highly scalable and distributed data integration framework that simplifies common aspects of data movement and integration and can support both streaming and batch movements. We use it as a core data mover component, controlled and managed by Airflow. To achieve this architecture, we developed new components within Gobblin for better service integration — job server to start/stop jobs from Airflow and CRUD APIs to manage jobs by the onboarding service, job metadata persistence over MySQL for better job management, SignalFX integration, etc. These new additions make Apache Gobblin more generic for enterprise use-cases that we also plan to contribute back (ref: Gobblin Improvement Proposal 4).With security being a top priority at PayPal, all platform components communicate via HTTPs (over TLS 1.x) and keep data encrypted at rest. This blog on Secure & encrypted data movement across security zones talks about how we achieved it.Overall, there are many components engaged here to form an end-to-end solution and things can quickly get complex if we don't define clear roles, responsibility, and design principles. There are many architectural principles we followed but one that really works for us is the following set of guidelines. These made implementation very clear for developers to expedite delivery while minimizing complexity:Each component acts as a micro-service that interacts over REST APIs, operating as a service provider.Each component operates on the latest configuration at runtime.Each component’s responsibilities are clearly defined with boundaries. Airflow does not make any data-movement-specific decisions. Gobblin does not have any visibility into why it moved the data; it simply does it when asked by Airflow.Metadata is centralized and changes are directly visible to all components.All components provide visibility via metric store integration (InfluxDB).All components should support rolling deployments to incur zero downtime while deploying changes.The journey for #OnePlatform4PYPL has been started and we are going to keep solidifying and contributing to the open-source platform. Next in a bucket is to support PayPal’s cloud journey… Exciting!.

Illustration Image

Read this article if you want to know more about Next-Gen Data Movement Platform at PayPal

…using Apache Airflow scheduler and Apache Gobblin — a data integration framework open-sourced by LinkedIn.

As PayPal grows beyond 300 million users, we generate lots of data, both on our online (site) and offline (analytics) storage platforms. Data movement among those systems plays a critical role in enabling many of PayPal’s business use cases.

PayPal hosts a large installation of Hadoop and other analytics systems, holding hundreds of petabytes of data.

PayPal is one of the few companies in the software industry where almost every major type of storage system is used, from traditional RDBMS systems like MySQL to analytics platforms like Hadoop and other specialized data stores like Aerospike, Elasticsearch, and Kafka.

Data that moves is alive and valuable. At rest, data is dead.

Data constantly needs to move around and get processed, analyzed, and organized to realize its value. Moreover, the data producers and data consumers are not always in the same org. while producers optimize to write, consumers look to optimize to read. This inherently creates a challenge to enable data-driven decisions at a rapid pace. The data when created is tiny yet very critical, but when it's the time to read/analyze it, it usually becomes a big data problem. This dichotomy in the data world is bridged by data movement platforms and teams.

A decade ago, data movement at PayPal was seen as an operations problem where system admins or platform service providers built tools and utilities to facilitate data movement in and out of the systems. This blog on the Evolution of data movement platforms provides a good glimpse of how data movement system evolved and where it’s headed.

At Paypal, due to many legacies and one-off pointed data movement solutions, we ended up having a complex and unmanageable eco-system which added to the run-the-business (RTB) cost. Moreover, supporting new sources and targets costs more and hinders quick solutions for new business initiatives. We needed a data movement platform that can scale and cover a wide variety of storage ecosystems. The following diagram roughly depicts what we ended up with:

No product roadmap plans for this :) and yet it inevitably happened as part of our growth journey.

Also, as the amount of data being produced increased and consumers demanded more and more real-time experiences, we needed a much faster (i.e. throughput-wise), efficient, and reliable data movement platform to serve the downstream business use cases. So we embarked on the journey to build our Next-Generation of Data Movement Platform for PayPal.

“If a solution is built for the most complex scenario, consider it built for the easier ones” — Enterprise data platform leadership.

RADD — The Risk Analytical Dynamic Datasets pipeline is one of the most challenging and business-critical use-cases that enable the PayPal risk platform to make decisions on payment transactions. This was a perfect candidate to be proven on a new platform.

RADD data flow requirements:

Data flow path of the business use-case

For us, building on top of open-source technology fits with our belief of “don't reinvent the wheel”, and “contribute back to the community”. So when we evaluated many OSS frameworks, based on our requirements and proof-of-concept results, we felt that Apache Gobblin offers the most features, and provides flexibility and space for us to build our next-gen data movement platform.

An enterprise data movement platform requires many more components than just a reader and writer. The following diagram shows what we built and how it interacts to provide an end-to-end solution at a high level.

End-to-End Component Interaction for the Data Movement Platform

Onboarding Service is a set of REST APIs built using PayPal’s internal Java spring framework (we call it Raptor). The onboarding service orchestrates the data pipeline. It interacts with various other services like schema reg., Gobblin, and Airflow APIs to create an end-to-end data pipeline. An onboarding API call results in a DAG and deploying configs on Airflow for executions. The DAG can be triggered based on the chosen mechanism at onboarding time, like an upstream handshake, cron-based interval, or ad-hoc. During every run, DAG also fetches metadata to operate based on the latest changes. A set of APIs to manage data pipeline lifecycles is also provided. Here is what the swagger spec looks like.
DAG Service: The DAG service can create Airflow DAGs as per the requested configurations and template. Since Airflow does not provide a stable API interface to manage the Airflow DAGs, we built our own as part of this service. The DAG it builds is primarily responsible for incorporating all the application-specific logic: allowed deviation per dataset, type of movement (ad-hoc or rollback), etc. Once the DAG is deployed, Airflow does the execution.
Apache Airflow: Airflow is a well-known workflow management and executor platform. We use Airflow to define and execute the data pipeline DAG. Airflow provides a runtime orchestration layer for end-to-end movement. This also provides the ability to embed data processing and handshaking capabilities, and easier operational management for the data pipeline. Here is how the DAG looks like for one of the datasets.

Apache Gobblin is a highly scalable and distributed data integration framework that simplifies common aspects of data movement and integration and can support both streaming and batch movements. We use it as a core data mover component, controlled and managed by Airflow. To achieve this architecture, we developed new components within Gobblin for better service integration — job server to start/stop jobs from Airflow and CRUD APIs to manage jobs by the onboarding service, job metadata persistence over MySQL for better job management, SignalFX integration, etc. These new additions make Apache Gobblin more generic for enterprise use-cases that we also plan to contribute back (ref: Gobblin Improvement Proposal 4).

With security being a top priority at PayPal, all platform components communicate via HTTPs (over TLS 1.x) and keep data encrypted at rest. This blog on Secure & encrypted data movement across security zones talks about how we achieved it.

Overall, there are many components engaged here to form an end-to-end solution and things can quickly get complex if we don't define clear roles, responsibility, and design principles. There are many architectural principles we followed but one that really works for us is the following set of guidelines. These made implementation very clear for developers to expedite delivery while minimizing complexity:

Each component acts as a micro-service that interacts over REST APIs, operating as a service provider.
Each component operates on the latest configuration at runtime.
Each component’s responsibilities are clearly defined with boundaries. Airflow does not make any data-movement-specific decisions. Gobblin does not have any visibility into why it moved the data; it simply does it when asked by Airflow.
Metadata is centralized and changes are directly visible to all components.
All components provide visibility via metric store integration (InfluxDB).
All components should support rolling deployments to incur zero downtime while deploying changes.

The journey for #OnePlatform4PYPL has been started and we are going to keep solidifying and contributing to the open-source platform. Next in a bucket is to support PayPal’s cloud journey… Exciting!.

Related Articles

GitHub - andreia-negreira/Data_streaming_project: Data streaming project with robust end-to-end pipeline, combining tools such as Airflow, Kafka, Spark, Cassandra and containerized solution to easy deployment.

andreia-negreira

12/2/2023

GitHub - airscholar/e2e-data-engineering: An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

airscholar

12/2/2023

cassandra.lunch

Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark - Business Platform Team

Arpan Patel

6/17/2022

cassandra.lunch

Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management - Business Platform Team

Obioma Anomnachi

6/16/2022

cassandra.lunch

Apache Cassandra Lunch #48: Airflow and Cassandra - Business Platform Team

Obioma Anomnachi

6/13/2022

data.engineering

Apache Cassandra Lunch #94: StreamSets and Cassandra - Business Platform Team

John Doe

5/31/2022

Using Airflow with Astra · datastaxdevs/awesome-astra Wiki

datastaxdevs

2/17/2022

cassandra.lunch

Apache Cassandra Lunch #53: Cassandra ETL with Airflow and Spark - Business Platform Team

Arpan Patel

6/11/2021

cassandra.lunch

Apache Cassandra Lunch #52: Airflow and Cassandra for Cluster Management - Business Platform Team

Obioma Anomnachi

6/9/2021

cassandra.lunch

Apache Cassandra Lunch #48: Airflow and Cassandra - Business Platform Team

Obioma Anomnachi

6/6/2021

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

hdfs

brianmhess/DSE-Spark-HDFS

brianmhess

1/28/2021

About the Cassandra File System (CFS)

John Doe

8/19/2020

SnappyDataInc/snappydata

John Doe

8/3/2018

tuplejump/snackfs-release

John Doe

8/2/2018

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company.

Contact Info

3 Washington Circle NW Suite 301 - Washington, D.C. 20037

support@anant.us

(855) 262-6526

Resources

Services

Careers

Events

Contact Us

Open Source Tools

Properties

Blog

Cassandra.Link

Cassandra.Tools

Anant Playbook

Awesome Cassandra

Follow Us

Github

Youtube

Twitter

Linkedin

Facebook

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company.

Illustration Image

Illustration Image

© 2023 Anant Corporation

Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.