{"componentChunkName":"component---src-templates-article-single-page-js","path":"/post/github-embulk-embulk-embulk-pluggable-bulk-data-loader","result":{"pageContext":{"obj_id":"a4267048-a4b0-5ef8-a912-cba270ec62f0","node":{"content":"<div id=\"js-flash-container\" data-turbo-replace=\"\"><div class=\"flash flash-full {{ className }} px-2\"><p>{{ message }}</p></div>\n</div><div class=\"application-main\" data-commit-hovercards-enabled=\"\" data-discussion-hovercards-enabled=\"\" data-issue-and-pr-hovercards-enabled=\"\"><main id=\"js-repo-pjax-container\"><div id=\"repository-container-header\" class=\"pt-3 hide-full-screen c6\" data-turbo-replace=\"\"><div class=\"d-flex flex-wrap flex-justify-end mb-3 px-3 px-md-4 px-lg-5 c4\"><p> / <strong itemprop=\"name\" class=\"mr-2 flex-self-stretch\"><a data-pjax=\"#repo-content-pjax-container\" data-turbo-frame=\"repo-content-turbo-frame\" href=\"https://github.com/embulk/embulk\">embulk</a></strong> Public</p><div id=\"repository-details-container\" data-turbo-replace=\"\"><ul class=\"pagehead-actions flex-shrink-0 d-none d-md-inline c3\"><li><a href=\"https://github.com/login?return_to=%2Fembulk%2Fembulk\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;notification subscription menu watch&quot;,&quot;repository_id&quot;:null,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/embulk/embulk&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"c8b67ee18acde297e993cb710324046ffdb601329a0c4c5eb8141cc0f4842aab\" aria-label=\"You must be signed in to change notification settings\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn\">Notifications</a></li>\n<li><a id=\"fork-button\" href=\"https://github.com/login?return_to=%2Fembulk%2Fembulk\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;repo details fork button&quot;,&quot;repository_id&quot;:24084730,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/embulk/embulk&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"d084fecb624f91e1df2f8b1fab3617c8d355f34a7581bf5fd89e1a7d92531bb0\" data-view-component=\"true\" class=\"btn-sm btn\">Fork 205</a></li>\n<li>\n<p><a href=\"https://github.com/login?return_to=%2Fembulk%2Fembulk\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;star button&quot;,&quot;repository_id&quot;:24084730,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/embulk/embulk&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"28dcb37034a44fde4fa1b0679aadb225140532b6bed83ec0462a5d5ffd36dd3c\" aria-label=\"You must be signed in to star a repository\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn BtnGroup-item\"> Star 1.7k</a> </p>\n</li>\n</ul></div></div><div class=\"d-block d-md-none mb-2 px-3 px-md-4 px-lg-5\" id=\"responsive-meta-container\" data-turbo-replace=\"\"><p class=\"f4 mb-3\">Embulk: Pluggable Bulk Data Loader.</p><p><a title=\"https://www.embulk.org/\" role=\"link\" target=\"_blank\" class=\"text-bold\" rel=\"noopener noreferrer\" href=\"https://www.embulk.org/\">www.embulk.org/</a></p><h3 class=\"sr-only\">License</h3><details class=\"details-reset details-overlay details-overlay-dark lh-default color-fg-default d-inline mb-2\"><summary class=\"Link--muted\"> Apache-2.0 and 4 other licenses found</summary><p>\n</p><h3 class=\"Box-title\">Licenses found</h3>\n<a class=\"Link--primary no-underline\" aria-label=\"Apache-2.0 license\" href=\"https://github.com/embulk/embulk/blob/master/LICENSE\">\n<div class=\"Box-row Box-row--hover-gray border-top rounded-0\"><p>Apache-2.0</p>LICENSE</div>\n</a> <a class=\"Link--primary no-underline\" aria-label=\"Apache-2.0 license\" href=\"https://github.com/embulk/embulk/blob/master/LICENSE-jctools\">\n<div class=\"Box-row Box-row--hover-gray border-top rounded-0\"><p>Apache-2.0</p>LICENSE-jctools</div>\n</a> <a class=\"Link--primary no-underline\" aria-label=\"Unknown license\" href=\"https://github.com/embulk/embulk/blob/master/LICENSE-logback\">\n<div class=\"Box-row Box-row--hover-gray border-top rounded-0\"><p>Unknown</p>LICENSE-logback</div>\n</a> <a class=\"Link--primary no-underline\" aria-label=\"MIT license\" href=\"https://github.com/embulk/embulk/blob/master/LICENSE-slf4j\">\n<div class=\"Box-row Box-row--hover-gray border-top rounded-0\"><p>MIT</p>LICENSE-slf4j</div>\n</a> <a class=\"Link--primary no-underline\" aria-label=\"MIT license\" href=\"https://github.com/embulk/embulk/blob/master/LICENSE-slf4j-netty\">\n<div class=\"Box-row Box-row--hover-gray border-top rounded-0\"><p>MIT</p>LICENSE-slf4j-netty</div>\n</a></details><p><a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/embulk/embulk/stargazers\"> 1.7k stars</a> <a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/embulk/embulk/forks\"> 205 forks</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/embulk/embulk/activity\"> Activity</a></p><div class=\"d-flex flex-wrap gap-2\"><p><a href=\"https://github.com/login?return_to=%2Fembulk%2Fembulk\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;star button&quot;,&quot;repository_id&quot;:24084730,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/embulk/embulk&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"28dcb37034a44fde4fa1b0679aadb225140532b6bed83ec0462a5d5ffd36dd3c\" aria-label=\"You must be signed in to star a repository\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn btn-block BtnGroup-item\"> Star</a> </p><p><a href=\"https://github.com/login?return_to=%2Fembulk%2Fembulk\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;notification subscription menu watch&quot;,&quot;repository_id&quot;:null,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/embulk/embulk&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"c8b67ee18acde297e993cb710324046ffdb601329a0c4c5eb8141cc0f4842aab\" aria-label=\"You must be signed in to change notification settings\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn btn-block\">Notifications</a></p></div></div></div>\n</main></div><p> You can’t perform that action at this time.</p><details class=\"details-reset details-overlay details-overlay-dark lh-default color-fg-default hx_rsm\" open=\"open\">\n\n</details>","id":"a4267048-a4b0-5ef8-a912-cba270ec62f0","title":"GitHub - embulk/embulk: Embulk: Pluggable Bulk Data Loader.","origin_url":"https://github.com/embulk/embulk","url":"https://github.com/embulk/embulk","wallabag_created_at":"2023-12-01T02:40:41+00:00","published_at":null,"published_by":"['embulk']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/710726da5afbe5bac1b38fd385c4e7c1e62d1ee9a922a947bbc3f958828f9974/embulk/embulk","tags":["hive","elasticsearch","cassandra","data.processing","elastic","migration","mysql","etl","redis"],"description":"{{ message }}\n / embulk PublicNotifications\nFork 205\n\n Star 1.7k \n\nEmbulk: Pluggable Bulk Data Loader.www.embulk.org/License Apache-2.0 and 4 other licenses found\nLicenses found\n\nApache-2.0LICENSE\n \nA..."},"relatedArticles":[{"content":"<p dir=\"auto\">Visual Flow is an ETL tool designed for effective data manipulation via convenient and user-friendly interface. The tool has the following capabilities:</p><ul dir=\"auto\"><li>Can integrate data from heterogeneous sources:\n<ul dir=\"auto\"><li>AWS S3</li>\n<li>Cassandra</li>\n<li>Click House</li>\n<li>DB2</li>\n<li>Dataframe (for reading)</li>\n<li>Elastic Search</li>\n<li>IBM COS</li>\n<li>Kafka</li>\n<li>Local File</li>\n<li>MS SQL</li>\n<li>Mongo</li>\n<li>MySQL/Maria</li>\n<li>Oracle</li>\n<li>PostgreSQL</li>\n<li>Redis</li>\n<li>Redshift</li>\n</ul></li>\n<li>Leverage direct connectivity to enterprise applications as sources and targets</li>\n<li>Perform data processing and transformation</li>\n<li>Run custom code</li>\n<li>Leverage metadata for analysis and maintenance</li>\n</ul><p dir=\"auto\">Visual Flow application is divided into the following repositories:</p><p dir=\"auto\"><a href=\"https://github.com/ibagroup-eu/Visual-Flow/blob/main/CONTRIBUTING.md\">Check the official guide</a>.</p><p dir=\"auto\">Visual flow is an open-source software licensed under the <a href=\"https://github.com/ibagroup-eu/Visual-Flow/blob/main/LICENSE\">Apache-2.0 license</a>.</p>","id":"86395185-03f1-53f4-94a4-e3a2d2d45779","title":"GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository","origin_url":"https://github.com/ibagroup-eu/Visual-Flow","url":"https://github.com/ibagroup-eu/Visual-Flow","wallabag_created_at":"2024-12-02T13:34:31+00:00","published_at":null,"published_by":"['ibagroup-eu']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/9187fdecad3a37939c1971bcdec19ffed4090307ee508b009f47c7bcd49a7f8d/ibagroup-eu/Visual-Flow","tags":["mongo","nocode","elasticsearch","open.source","cassandra","data.pipeline","elastic","aws.s3","etl","low.code","postgres"],"description":"Visual Flow is an ETL tool designed for effective data manipulation via convenient and user-friendly interface. The tool has the following capabilities:Can integrate data from heterogeneous sources:\nA..."},{"content":"<p dir=\"auto\"><a href=\"https://github.com/datastax/cql-proxy/actions/workflows/test.yml\"><img src=\"https://github.com/datastax/cql-proxy/actions/workflows/test.yml/badge.svg\" alt=\"GitHub Action\" class=\"c13\" referrerpolicy=\"no-referrer\" /></a> <a href=\"https://goreportcard.com/report/github.com/datastax/cql-proxy\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/e1c32ff51117d37ba38fd853bb54c63214d25a3a367d0de90a00a03124924acb/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f64617461737461782f63716c2d70726f7879\" alt=\"Go Report Card\" data-canonical-src=\"https://goreportcard.com/badge/github.com/datastax/cql-proxy\" class=\"c13\" referrerpolicy=\"no-referrer\" /></a></p><p dir=\"auto\"><a target=\"_blank\" rel=\"noopener noreferrer\" href=\"https://github.com/datastax/cql-proxy/blob/main/cql-proxy.png\"><img src=\"https://github.com/datastax/cql-proxy/raw/main/cql-proxy.png\" alt=\"cql-proxy\" class=\"c13\" referrerpolicy=\"no-referrer\" /></a></p><p dir=\"auto\"><code>cql-proxy</code> is designed to forward your application's CQL traffic to an appropriate database service. It listens on a local address and securely forwards that traffic.</p><p dir=\"auto\">The <code>cql-proxy</code> sidecar enables unsupported CQL drivers to work with <a href=\"https://astra.datastax.com/\" rel=\"nofollow\">DataStax Astra</a>. These drivers include both legacy DataStax <a href=\"https://docs.datastax.com/en/driver-matrix/doc/driver_matrix/common/driverMatrix.html\" rel=\"nofollow\">drivers</a> and community-maintained CQL drivers, such as the <a href=\"https://github.com/gocql/gocql\">gocql</a> driver and the <a href=\"https://github.com/scylladb/scylla-rust-driver\">rust-driver</a>.</p><p dir=\"auto\"><code>cql-proxy</code> also enables applications that are currently using <a href=\"https://cassandra.apache.org/\" rel=\"nofollow\">Apache Cassandra</a> or <a href=\"https://www.datastax.com/products/datastax-enterprise\" rel=\"nofollow\">DataStax Enterprise (DSE)</a> to use Astra without requiring any code changes. Your application just needs to be configured to use the proxy.</p><p dir=\"auto\">If you're building a new application using DataStax <a href=\"https://docs.datastax.com/en/driver-matrix/doc/driver_matrix/common/driverMatrix.html\" rel=\"nofollow\">drivers</a>, <code>cql-proxy</code> is not required, as the drivers can communicate directly with Astra. DataStax drivers have excellent support for Astra out-of-the-box, and are well-documented in the <a href=\"https://docs.datastax.com/en/astra/docs/connecting-to-astra-databases-using-datastax-drivers.html\" rel=\"nofollow\">driver-guide</a> guide.</p><p dir=\"auto\">Use the <code>-h</code> or <code>--help</code> flag to display a listing all flags and their corresponding descriptions and environment variables (shown below as items starting with <code>$</code>):</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"$ ./cql-proxy -h Usage: cql-proxy Flags: -h, --help Show context-sensitive help. -b, --astra-bundle=STRING Path to secure connect bundle for an Astra database. Requires '--username' and '--password'. Ignored if using the token or contact points option ($ASTRA_BUNDLE). -t, --astra-token=STRING Token used to authenticate to an Astra database. Requires '--astra-database-id'. Ignored if using the bundle path or contact points option ($ASTRA_TOKEN). -i, --astra-database-id=STRING Database ID of the Astra database. Requires '--astra-token' ($ASTRA_DATABASE_ID) --astra-api-url=&quot;https://api.astra.datastax.com&quot; URL for the Astra API ($ASTRA_API_URL) --astra-timeout=10s Timeout for contacting Astra when retrieving the bundle and metadata ($ASTRA_TIMEOUT) -c, --contact-points=CONTACT-POINTS,... Contact points for cluster. Ignored if using the bundle path or token option ($CONTACT_POINTS). -u, --username=STRING Username to use for authentication ($USERNAME) -p, --password=STRING Password to use for authentication ($PASSWORD) -r, --port=9042 Default port to use when connecting to cluster ($PORT) -n, --protocol-version=&quot;v4&quot; Initial protocol version to use when connecting to the backend cluster (default: v4, options: v3, v4, v5, DSEv1, DSEv2) ($PROTOCOL_VERSION) -m, --max-protocol-version=&quot;v4&quot; Max protocol version supported by the backend cluster (default: v4, options: v3, v4, v5, DSEv1, DSEv2) ($MAX_PROTOCOL_VERSION) -a, --bind=&quot;:9042&quot; Address to use to bind server ($BIND) -f, --config=CONFIG YAML configuration file ($CONFIG_FILE) --debug Show debug logging ($DEBUG) --health-check Enable liveness and readiness checks ($HEALTH_CHECK) --http-bind=&quot;:8000&quot; Address to use to bind HTTP server used for health checks ($HTTP_BIND) --heartbeat-interval=30s Interval between performing heartbeats to the cluster ($HEARTBEAT_INTERVAL) --idle-timeout=60s Duration between successful heartbeats before a connection to the cluster is considered unresponsive and closed ($IDLE_TIMEOUT) --readiness-timeout=30s Duration the proxy is unable to connect to the backend cluster before it is considered not ready ($READINESS_TIMEOUT) --idempotent-graph If true it will treat all graph queries as idempotent by default and retry them automatically. It may be dangerous to retry some graph queries -- use with caution ($IDEMPOTENT_GRAPH). --num-conns=1 Number of connection to create to each node of the backend cluster ($NUM_CONNS) --proxy-cert-file=STRING Path to a PEM encoded certificate file with its intermediate certificate chain. This is used to encrypt traffic for proxy clients ($PROXY_CERT_FILE) --proxy-key-file=STRING Path to a PEM encoded private key file. This is used to encrypt traffic for proxy clients ($PROXY_KEY_FILE) --rpc-address=STRING Address to advertise in the 'system.local' table for 'rpc_address'. It must be set if configuring peer proxies ($RPC_ADDRESS) --data-center=STRING Data center to use in system tables ($DATA_CENTER) --tokens=TOKENS,... Tokens to use in the system tables. It's not recommended ($TOKENS)\"><pre>$ ./cql-proxy -h\nUsage: cql-proxy\nFlags:\n  -h, --help                                              Show context-sensitive help.\n  -b, --astra-bundle=STRING                               Path to secure connect bundle for an Astra database. Requires '--username' and '--password'. Ignored if using the\n                                                          token or contact points option ($ASTRA_BUNDLE).\n  -t, --astra-token=STRING                                Token used to authenticate to an Astra database. Requires '--astra-database-id'. Ignored if using the bundle path\n                                                          or contact points option ($ASTRA_TOKEN).\n  -i, --astra-database-id=STRING                          Database ID of the Astra database. Requires '--astra-token' ($ASTRA_DATABASE_ID)\n      --astra-api-url=\"https://api.astra.datastax.com\"    URL for the Astra API ($ASTRA_API_URL)\n      --astra-timeout=10s                                 Timeout for contacting Astra when retrieving the bundle and metadata ($ASTRA_TIMEOUT)\n  -c, --contact-points=CONTACT-POINTS,...                 Contact points for cluster. Ignored if using the bundle path or token option ($CONTACT_POINTS).\n  -u, --username=STRING                                   Username to use for authentication ($USERNAME)\n  -p, --password=STRING                                   Password to use for authentication ($PASSWORD)\n  -r, --port=9042                                         Default port to use when connecting to cluster ($PORT)\n  -n, --protocol-version=\"v4\"                             Initial protocol version to use when connecting to the backend cluster (default: v4, options: v3, v4, v5, DSEv1,\n                                                          DSEv2) ($PROTOCOL_VERSION)\n  -m, --max-protocol-version=\"v4\"                         Max protocol version supported by the backend cluster (default: v4, options: v3, v4, v5, DSEv1, DSEv2)\n                                                          ($MAX_PROTOCOL_VERSION)\n  -a, --bind=\":9042\"                                      Address to use to bind server ($BIND)\n  -f, --config=CONFIG                                     YAML configuration file ($CONFIG_FILE)\n      --debug                                             Show debug logging ($DEBUG)\n      --health-check                                      Enable liveness and readiness checks ($HEALTH_CHECK)\n      --http-bind=\":8000\"                                 Address to use to bind HTTP server used for health checks ($HTTP_BIND)\n      --heartbeat-interval=30s                            Interval between performing heartbeats to the cluster ($HEARTBEAT_INTERVAL)\n      --idle-timeout=60s                                  Duration between successful heartbeats before a connection to the cluster is considered unresponsive and closed\n                                                          ($IDLE_TIMEOUT)\n      --readiness-timeout=30s                             Duration the proxy is unable to connect to the backend cluster before it is considered not ready\n                                                          ($READINESS_TIMEOUT)\n      --idempotent-graph                                  If true it will treat all graph queries as idempotent by default and retry them automatically. It may be\n                                                          dangerous to retry some graph queries -- use with caution ($IDEMPOTENT_GRAPH).\n      --num-conns=1                                       Number of connection to create to each node of the backend cluster ($NUM_CONNS)\n      --proxy-cert-file=STRING                            Path to a PEM encoded certificate file with its intermediate certificate chain. This is used to encrypt traffic\n                                                          for proxy clients ($PROXY_CERT_FILE)\n      --proxy-key-file=STRING                             Path to a PEM encoded private key file. This is used to encrypt traffic for proxy clients ($PROXY_KEY_FILE)\n      --rpc-address=STRING                                Address to advertise in the 'system.local' table for 'rpc_address'. It must be set if configuring peer proxies\n                                                          ($RPC_ADDRESS)\n      --data-center=STRING                                Data center to use in system tables ($DATA_CENTER)\n      --tokens=TOKENS,...                                 Tokens to use in the system tables. It's not recommended ($TOKENS)</pre></div><p dir=\"auto\">To pass configuration to <code>cql-proxy</code>, either command-line flags, environment variables, or a configuration file can be used. Using the <code>docker</code> method as an example, the following samples show how the token and database ID are defined with each method.</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 --rm datastax/cql-proxy:v0.1.5 --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-datbase-id&gt;\"><pre>docker run -p 9042:9042 \\\n  --rm datastax/cql-proxy:v0.1.5 \\\n  --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-datbase-id&gt;</pre></div><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 --rm datastax/cql-proxy:v0.1.5 -e ASTRA_TOKEN=&lt;astra-token&gt; -e ASTRA_DATABASE_ID=&lt;astra-datbase-id&gt;\"><pre>docker run -p 9042:9042  \\\n  --rm datastax/cql-proxy:v0.1.5 \\\n  -e ASTRA_TOKEN=&lt;astra-token&gt; -e ASTRA_DATABASE_ID=&lt;astra-datbase-id&gt;</pre></div><p dir=\"auto\">Proxy settings can also be passed using a configuration file with the <code>--config /path/to/proxy.yaml</code> flag. This can be mixed and matched with command-line flags and environment variables. Here are some example configuration files:</p><div class=\"highlight highlight-source-yaml notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"contact-points: - 127.0.0.1 username: cassandra password: cassandra port: 9042 bind: 127.0.0.1:9042 # ...\"><pre>contact-points:\n  - 127.0.0.1\nusername: cassandra\npassword: cassandra\nport: 9042\nbind: 127.0.0.1:9042\n# ...</pre></div><p dir=\"auto\">or with a Astra token:</p><div class=\"highlight highlight-source-yaml notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"astra-token: &lt;astra-token&gt; astra-database-id: &lt;astra-database-id&gt; bind: 127.0.0.1:9042 # ...\"><pre>astra-token: &lt;astra-token&gt;\nastra-database-id: &lt;astra-database-id&gt;\nbind: 127.0.0.1:9042\n# ...</pre></div><p dir=\"auto\">All configuration keys match their command-line flag counterpart, e.g. <code>--astra-bundle</code> is <code>astra-bundle:</code>, <code>--contact-points</code> is <code>contact-points:</code> etc.</p><p dir=\"auto\">Multi-region failover with DC-aware load balancing policy is the most useful case for a multiple proxy setup.</p><p dir=\"auto\">When configuring <code>peers:</code> it is required to set <code>--rpc-address</code> (or <code>rpc-address:</code> in the yaml) for each proxy and it must match is corresponding <code>peers:</code> entry. Also, <code>peers:</code> is only available in the configuration file and cannot be set using a command-line flag.</p><p dir=\"auto\">Here's an example of configuring multi-region failover with two proxies. A proxy is started for each region of the cluster connecting to it using that region's bundle. They all share a common configuration file that contains the full list of proxies.</p><p dir=\"auto\"><em>Note:</em> Only bundles are supported for multi-region setups.</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"cql-proxy --astra-bundle astra-region1-bundle.zip --username token --password &lt;astra-token&gt; --bind 127.0.0.1:9042 --rpc-address 127.0.0.1 --data-center dc-1 --config proxy.yaml\"><pre>cql-proxy --astra-bundle astra-region1-bundle.zip --username token --password &lt;astra-token&gt; \\\n  --bind 127.0.0.1:9042 --rpc-address 127.0.0.1 --data-center dc-1 --config proxy.yaml</pre></div><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"cql-proxy ---astra-bundle astra-region2-bundle.zip --username token --password &lt;astra-token&gt; --bind 127.0.0.2:9042 --rpc-address 127.0.0.2 --data-center dc-2 --config proxy.yaml\"><pre>cql-proxy ---astra-bundle astra-region2-bundle.zip --username token --password &lt;astra-token&gt; \\\n  --bind 127.0.0.2:9042 --rpc-address 127.0.0.2 --data-center dc-2 --config proxy.yaml</pre></div><p dir=\"auto\">The peers settings are configured using a yaml file. It's a good idea to explicitly provide the <code>--data-center</code> flag, otherwise; these values are pulled from the backend cluster and would need to be pulled from the <code>system.local</code> and <code>system.peers</code> table to properly setup the peers <code>data-center:</code> values. Here's an example <code>proxy.yaml</code>:</p><div class=\"highlight highlight-source-yaml notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"peers: - rpc-address: 127.0.0.1 data-center: dc-1 - rpc-address: 127.0.0.2 data-center: dc-2\"><pre>peers:\n  - rpc-address: 127.0.0.1\n    data-center: dc-1\n  - rpc-address: 127.0.0.2\n    data-center: dc-2</pre></div><p dir=\"auto\"><em>Note:</em> It's okay for the <code>peers:</code> to contain entries for the current proxy itself because they'll just be omitted.</p><p dir=\"auto\">There are three methods for using <code>cql-proxy</code>:</p><ul dir=\"auto\"><li>Locally build and run <code>cql-proxy</code></li>\n<li>Run a docker image that has <code>cql-proxy</code> installed</li>\n<li>Use a Kubernetes container to run <code>cql-proxy</code></li>\n</ul><ol dir=\"auto\"><li>\n<p dir=\"auto\">Build <code>cql-proxy</code>.</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"go build\"><pre>go build</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Run with your desired database.</p>\n<ul dir=\"auto\"><li>\n<p dir=\"auto\"><a href=\"https://astra.datastax.com/\" rel=\"nofollow\">DataStax Astra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"./cql-proxy --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;\"><pre>./cql-proxy --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;</pre></div>\n<p dir=\"auto\">The <code>&lt;astra-token&gt;</code> can be generated using these <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html\" rel=\"nofollow\">instructions</a>. The proxy also supports using the <a href=\"https://docs.datastax.com/en/astra/docs/obtaining-database-credentials.html#_getting_your_secure_connect_bundle\" rel=\"nofollow\">Astra Secure Connect Bundle</a> along with a client ID and secret generated using these <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html\" rel=\"nofollow\">instructions</a>:</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"./cql-proxy --astra-bundle &lt;your-secure-connect-zip&gt; --username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;\"><pre>./cql-proxy --astra-bundle &lt;your-secure-connect-zip&gt; \\\n--username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;\n</pre></div>\n</li>\n<li>\n<p dir=\"auto\"><a href=\"https://cassandra.apache.org/\" rel=\"nofollow\">Apache Cassandra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"./cql-proxy --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]\"><pre>./cql-proxy --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]</pre></div>\n</li>\n</ul></li>\n</ol><ol dir=\"auto\"><li>\n<p dir=\"auto\">Run with your desired database.</p>\n<ul dir=\"auto\"><li>\n<p dir=\"auto\"><a href=\"https://astra.datastax.com/\" rel=\"nofollow\">DataStax Astra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 datastax/cql-proxy:v0.1.5 --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;\"><pre>docker run -p 9042:9042 \\\n  datastax/cql-proxy:v0.1.5 \\\n  --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;</pre></div>\n<p dir=\"auto\">The <code>&lt;astra-token&gt;</code> can be generated using these <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html\" rel=\"nofollow\">instructions</a>. The proxy also supports using the <a href=\"https://docs.datastax.com/en/astra/docs/obtaining-database-credentials.html#_getting_your_secure_connect_bundle\" rel=\"nofollow\">Astra Secure Connect Bundle</a>, but it requires mounting the bundle to a volume in the container:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -v &lt;your-secure-connect-bundle.zip&gt;:/tmp/scb.zip -p 9042:9042 --rm datastax/cql-proxy:v0.1.5 --astra-bundle /tmp/scb.zip --username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;\"><pre>docker run -v &lt;your-secure-connect-bundle.zip&gt;:/tmp/scb.zip -p 9042:9042 \\\n--rm datastax/cql-proxy:v0.1.5 \\\n--astra-bundle /tmp/scb.zip --username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;</pre></div>\n</li>\n<li>\n<p dir=\"auto\"><a href=\"https://cassandra.apache.org/\" rel=\"nofollow\">Apache Cassandra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 datastax/cql-proxy:v0.1.5 --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]\"><pre>docker run -p 9042:9042 \\\n  datastax/cql-proxy:v0.1.5 \\\n  --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]</pre></div>\n</li>\n</ul></li>\n</ol><p dir=\"auto\">If you wish to have the docker image removed after you are done with it, add <code>--rm</code> before the image name <code>datastax/cql-proxy:v0.1.5</code>.</p><p dir=\"auto\">Using Kubernetes with <code>cql-proxy</code> requires a number of steps:</p><ol dir=\"auto\"><li>\n<p dir=\"auto\">Generate a token following the Astra <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html#_create_application_token\" rel=\"nofollow\">instructions</a>. This step will display your Client ID, Client Secret, and Token; make sure you download the information for the next steps. Store the secure bundle in <code>/tmp/scb.zip</code> to match the example below.</p>\n</li>\n<li>\n<p dir=\"auto\">Create <code>cql-proxy.yaml</code>. You'll need to add three sets of information: arguments, volume mounts, and volumes. A full example can be found <a href=\"https://github.com/datastax/cql-proxy/blob/main/k8s/cql-proxy.yml\">here</a>.</p>\n</li>\n</ol><ul dir=\"auto\"><li>\n<p dir=\"auto\">Argument: Modify the local bundle location, username and password, using the client ID and client secret obtained in the last step to the container argument.</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"command: [&quot;./cql-proxy&quot;] args: [&quot;--astra-bundle=/tmp/scb.zip&quot;,&quot;--username=Client ID&quot;,&quot;--password=Client Secret&quot;]\"><pre>command: [\"./cql-proxy\"]\nargs: [\"--astra-bundle=/tmp/scb.zip\",\"--username=Client ID\",\"--password=Client Secret\"]\n</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Volume mounts: Modify <code>/tmp/</code> as a volume mount as required.</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"volumeMounts: - name: my-cm-vol mountPath: /tmp/\"><pre>volumeMounts:\n  - name: my-cm-vol\n  mountPath: /tmp/\n</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Volume: Modify the <code>configMap</code> filename as required. In this example, it is named <code>cql-proxy-configmap</code>. Use the same name for the <code>volumes</code> that you used for the <code>volumeMounts</code>.</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"volumes: - name: my-cm-vol configMap: name: cql-proxy-configmap\"><pre>volumes:\n  - name: my-cm-vol\n    configMap:\n      name: cql-proxy-configmap        \n</pre></div>\n</li>\n</ul><ol start=\"3\" dir=\"auto\"><li>\n<p dir=\"auto\">Create a configmap. Use the same secure bundle that was specified in the <code>cql-proxy.yaml</code>.</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl create configmap cql-proxy-configmap --from-file /tmp/scb.zip\"><pre>kubectl create configmap cql-proxy-configmap --from-file /tmp/scb.zip </pre></div>\n</li>\n<li>\n<p dir=\"auto\">Check the configmap that was created.</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl describe configmap cql-proxy-configmap Name: cql-proxy-configmap Namespace: default Labels: &lt;none&gt; Annotations: &lt;none&gt; Data ==== BinaryData ==== scb.zip: 12311 bytes\"><pre>kubectl describe configmap cql-proxy-configmap\n  Name:         cql-proxy-configmap\n  Namespace:    default\n  Labels:       &lt;none&gt;\n  Annotations:  &lt;none&gt;\n  Data\n  ====\n  BinaryData\n  ====\n  scb.zip: 12311 bytes</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Create a Kubernetes deployment with the YAML file you created:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl create -f cql-proxy.yaml\"><pre>kubectl create -f cql-proxy.yaml</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Check the logs:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl logs &lt;deployment-name&gt;\"><pre>kubectl logs &lt;deployment-name&gt;</pre></div>\n</li>\n</ol><p dir=\"auto\">Drivers that use token-aware load balancing may print a warning or may not work when using cql-proxy. Because cql-proxy abstracts the backend cluster as a single endpoint this doesn't always work well with token-aware drivers that expect there to be at least \"replication factor\" number of nodes in the cluster. Many drivers print a warning (which can be ignored) and fallback to something like round-robin, but other drivers might fail with an error. For the drivers that fail with an error it is required that they disable token-aware or configure the round-robin load balancing policy.</p>","id":"17fac8a9-8b96-51ec-a7dd-dd3809bba528","title":"GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.","origin_url":"https://github.com/datastax/cql-proxy","url":"https://github.com/datastax/cql-proxy","wallabag_created_at":"2024-11-01T17:26:01+00:00","published_at":null,"published_by":"['datastax']","reading_time":8,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/c2528e3426d98910ed27819e048b4c1081fab2ed2c7adbea6e6a3b1872deb30a/datastax/cql-proxy","tags":["migration","proxy","cassandra","cql"],"description":" cql-proxy is designed to forward your application's CQL traffic to an appropriate database service. It listens on a local address and securely forwards that traffic.The cql-proxy sidecar enables unsu..."},{"content":"<header>\n</header><p>Zero Downtime Migration (ZDM) Proxy is an open-source component developed in Go and based on client-server architecture. It enables you to migrate from one Apache Cassandra® cluster to another without downtime or code changes in the application client.</p><p>For details on ZDM Proxy, see <a href=\"https://github.com/datastax/zdm-proxy\" target=\"_blank\" rel=\"noopener noreferrer\">zdm-proxy GitHub</a>.</p><p>When using ZDM Proxy, the client connects to the proxy rather than to the source cluster. The proxy connects both to the source cluster and the target cluster. It sends read requests to the source cluster only, while write requests are forwarded to both clusters.</p><p>For details on how ZDM Proxy works, see <a href=\"https://docs.datastax.com/en/data-migration/introduction.html\" target=\"_blank\" rel=\"noopener noreferrer\">Introduction to Zero Downtime Migration</a>.</p><ul><li>Apache Cassandra instance to migrate to the Aiven platform (migration source)</li>\n<li>Aiven for Apache Cassandra service where to migrate your external instance (migration target)</li>\n<li><a href=\"https://aiven.io/docs/tools/cli\">Aiven CLI client installed</a></li>\n<li><code>cqlsh</code> <a href=\"https://cassandra.apache.org/doc/latest/cassandra/getting_started/installing.html\" target=\"_blank\" rel=\"noopener noreferrer\">installed</a></li>\n</ul><p><a href=\"https://aiven.io/docs/products/cassandra/howto/connect-cqlsh-cli\">Connect to your Aiven for Apache Cassandra service</a> using <code>cqlsh</code>, for example.</p><div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">cqlsh --ssl-u avnadmin -p YOUR_SECRET_PASSWORD cassandra-target-cluster-name.a.avns.net 12345<br /></pre></div><p>You can expect to receive output similar to the following:</p><div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">Connected to a1b2c3d4-1a2b-3c4d-5e6f-a1b2c3d4e5f6 at cassandra-target-cluster-name.a.avns.net:12345<br />[cqlsh 6.1.0 | Cassandra 4.0.11 | CQL spec 3.4.5 | Native protocol v5]<br /></pre></div><p>In your target service, create the same keyspaces and tables you have in your source Apache Cassandra cluster.</p><div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">create keyspace KEYSPACE_NAME with replication ={'class':'SimpleStrategy', 'replication_factor':3};<br />create table KEYSPACE_NAME.TABLE_NAME (n_id int, value int, primary key (n_id));<br /></pre></div><p>Download the ZDM Proxy's binary from <a href=\"https://github.com/datastax/zdm-proxy/releases\" target=\"_blank\" rel=\"noopener noreferrer\">ZDM Proxy releases</a>.</p><div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">wget https://github.com/datastax/zdm-proxy/releases/download/v2.1.0/zdm-proxy-linux-amd64-v2.1.0.tgz<br />tar xf zdm-proxy-linux-amd64-v2.1.0.tgz<br /></pre></div><p>Check if the binary has been downloaded successfully using <code>ls</code> in the relevant directory. You can expect to receive output similar to the following:</p><div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">LICENSE  zdm-proxy-linux-amd64-v2.1.0.tgz  zdm-proxy-v2.1.0<br /></pre></div><ol><li>\n<p>Specify connection information by setting <code>ZDM_TARGET_*</code> and <code>ZDM_ORIGIN_*</code> environment variables using the <code>export</code> command.</p>\n<div class=\"theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary\"><p>note</p><div class=\"admonitionContent_BuS1\"><p><code>ORIGIN</code> refers to the source service.</p></div></div>\n</li>\n<li>\n<p>Run the binary.</p>\n</li>\n</ol><div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">exportZDM_ORIGIN_CONTACT_POINTS=localhost<br />exportZDM_ORIGIN_USERNAME=cassandra<br />exportZDM_ORIGIN_PASSWORD=cassandra<br />exportZDM_ORIGIN_PORT=1234<br />exportZDM_TARGET_CONTACT_POINTS=cassandra-target-cluster-name.a.avns.net<br />exportZDM_TARGET_USERNAME=avnadmin<br />exportZDM_TARGET_PASSWORD=YOUR_SECRET_PASSWORD<br />exportZDM_TARGET_PORT=12345<br />exportZDM_TARGET_TLS_SERVER_CA_PATH=\"/tmp/ca.pem\"<br />exportZDM_TARGET_ENABLE_HOST_ASSIGNMENT=false<br /># ZDM_ORIGIN_ENABLE_HOST_ASSIGNMENT=false  # (may be needed, see note)<br />./zdm-proxy-v2.1.0<br /></pre></div><div class=\"theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary\"><p>ENABLE_HOST_ASSIGNMENT</p><div class=\"admonitionContent_BuS1\"><p>Make sure you set the ZDM_TARGET_ENABLE_HOST_ASSIGNMENT variable. Otherwise, ZDM Proxy tries to connect to one of internal addresses of the cluster nodes, which are unavailable from outside. If this occurs to your source cluster, set <code>ZDM_ORIGIN_ENABLE_HOST_ASSIGNMENT=false</code>.</p></div></div><p>To connect to ZDM Proxy, use, for example, <code>cqlsh</code>. Provide connection details and, if your source or target require authentication, specify target username and password.</p><p>Check more details on using the credentials in <a href=\"https://docs.datastax.com/en/data-migration/introduction.html\" target=\"_blank\" rel=\"noopener noreferrer\">Client application credentials</a>.</p><p>The port that ZDM Proxy uses is 14002, which can be overridden.</p><ol><li>\n<p>Connect using ZDM Proxy.</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">cqlsh -u avnadmin -p YOUR_SECRET_PASSWORD localhost 14002<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">Connected to CLUSTER_NAME at localhost:14002<br />[cqlsh 6.1.0 | Cassandra 4.1.3 | CQL spec 3.4.6 | Native protocol v4]<br /></pre></div>\n</li>\n<li>\n<p>Check data in the table.</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">select * from KEYSPACE_NAME.TABLE_NAME;<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">n_id | value<br />------+-------<br />1|42<br />2|44<br />3|46<br />(3 rows)<br /></pre></div>\n</li>\n<li>\n<p>Insert more data into the table to test how ZDM Proxy handles write request.</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">insert into KEYSPACE_NAME.TABLE_NAME (n_id, value) values (4, 48);<br />insert into KEYSPACE_NAME.TABLE_NAME (n_id, value) values (5, 50);<br /></pre></div>\n</li>\n<li>\n<p>Check again data inside the table.</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">select * from KEYSPACE_NAME.TABLE_NAME;<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">n_id | value<br />------+-------<br />5|50<br />1|42<br />2|44<br />4|48<br />3|46<br />(5 rows)<br /></pre></div>\n</li>\n</ol><ol><li>\n<p>Connect to the source:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">cqlsh localhost 1234<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">Connected to SOURCE_CLUSTER_NAME at localhost:1234<br />[cqlsh 6.1.0 | Cassandra 4.1.3 | CQL spec 3.4.6 | Native protocol v5]<br /></pre></div>\n</li>\n<li>\n<p>Check data in the table:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">select * from KEYSPACE_NAME.TABLE_NAME;<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">n_id | value<br />------+-------<br />5|50<br />1|42<br />2|44<br />4|48<br />3|46<br />(5 rows)<br /></pre></div>\n<p>ZDM Proxy has forwarded both the write request and the read request to the source cluster. As a result, all the values are there: both newly-added ones (<code>50</code> and <code>48</code>) and previously added ones (<code>42</code>, <code>44</code>, and <code>46</code>).</p>\n</li>\n</ol><ol><li>\n<p>Connect to the target service.</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">cqlsh --ssl-u avnadmin -p YOUR_SECRET_PASSWORD cassandra-target-cluster-name.a.avns.net 12345<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">Connected to a1b2c3d4-1a2b-3c4d-5e6f-a1b2c3d4e5f6 at cassandra-target-cluster-name.a.avns.net:12345<br />[cqlsh 6.1.0 | Cassandra 4.0.11 | CQL spec 3.4.5 | Native protocol v5]<br /></pre></div>\n</li>\n<li>\n<p>Check data in the table.</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">select * from KEYSPACE_NAME.TABLE_NAME;<br /></pre></div>\n<p>You can expect to receive output similar to the following:</p>\n<div class=\"language-bash codeBlockContainer_Ckt0 theme-code-block codeBlockContent_biex c3\"><pre class=\"codeBlockLines_e6Vv\">n_id | value<br />------+-------<br />5|50<br />4|48<br />(2 rows)<br /></pre></div>\n<p><code>50</code> and <code>48</code> are there in the target table since ZDM Proxy has forwarded the write request to the target service. <code>42</code>, <code>44</code>, and <code>46</code> are not there since ZDM Proxy has not sent the read request to the target service.</p>\n</li>\n</ol><ul><li><a href=\"https://github.com/datastax/zdm-proxy\" target=\"_blank\" rel=\"noopener noreferrer\">zdm-proxy GitHub</a></li>\n<li><a href=\"https://docs.datastax.com/en/data-migration/introduction.html\" target=\"_blank\" rel=\"noopener noreferrer\">Introduction to Zero Downtime Migration</a></li>\n<li><a href=\"https://github.com/datastax/zdm-proxy/releases\" target=\"_blank\" rel=\"noopener noreferrer\">ZDM Proxy releases</a></li>\n<li><a href=\"https://docs.datastax.com/en/data-migration/connect-clients-to-target.html\" target=\"_blank\" rel=\"noopener noreferrer\">Client application credentials</a></li>\n</ul>","id":"e63b0a0b-3b07-502b-823c-4c32121f1540","title":"Migrate to Aiven for Apache Cassandra® with no downtime | Aiven docs","origin_url":"https://aiven.io/docs/products/cassandra/howto/zdm-proxy","url":"https://aiven.io/docs/products/cassandra/howto/zdm-proxy","wallabag_created_at":"2024-11-01T17:25:08+00:00","published_at":null,"published_by":null,"reading_time":4,"domain_name":"aiven.io","preview_picture":"https://aiven.io/docs/images/site-preview.png","tags":["migration","proxy","cassandra","aiven"],"description":"\nZero Downtime Migration (ZDM) Proxy is an open-source component developed in Go and based on client-server architecture. It enables you to migrate from one Apache Cassandra® cluster to another withou..."},{"content":"<p dir=\"auto\">The ZDM Proxy is client-server component written in Go that enables users to migrate with zero downtime from an Apache Cassandra® cluster to another (which may be an <a href=\"https://astra.datastax.com/\" rel=\"nofollow\">Astra</a> cluster) and not requiring code changes in the application client.</p><p dir=\"auto\">The only change to the client is pointing it to the proxy rather than directly to the original cluster (Origin). In turn, the proxy connects to both Origin and Target clusters.</p><p dir=\"auto\">By default, the proxy will forward read requests only to the Origin cluster, though you can optionally configure it to forward reads to both clusters asynchronously, while writes will always be sent to both clusters concurrently.</p><p dir=\"auto\">An overview of the proxy architecture and logical flow can be viewed <a href=\"https://docs.datastax.com/en/data-migration/introduction.html#migration-phases\" rel=\"nofollow\">here</a>.</p><p dir=\"auto\">In order to run the proxy, you'll need to set some environment variables or pass reference to YAML configuration file. Below you'll find a list with the most important variables along with their default values. The required ones are marked with a comment. Variable names for YAML configuration file do not have <code>ZDM_</code> prefix and are lower-cased.</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"ZDM_ORIGIN_CONTACT_POINTS=10.0.0.1 #required ZDM_ORIGIN_USERNAME=cassandra #required ZDM_ORIGIN_PASSWORD=cassandra #required ZDM_ORIGIN_PORT=9042 ZDM_TARGET_CONTACT_POINTS=10.0.0.2 #required ZDM_TARGET_USERNAME=cassandra #required ZDM_TARGET_PASSWORD=cassandra #required ZDM_TARGET_PORT=9042 ZDM_PROXY_LISTEN_PORT=14002 ZDM_PROXY_LISTEN_ADDRESS=127.0.0.1 ZDM_PRIMARY_CLUSTER=ORIGIN ZDM_READ_MODE=PRIMARY_ONLY ZDM_LOG_LEVEL=INFO\"><pre>ZDM_ORIGIN_CONTACT_POINTS=10.0.0.1  #required\nZDM_ORIGIN_USERNAME=cassandra       #required\nZDM_ORIGIN_PASSWORD=cassandra       #required\nZDM_ORIGIN_PORT=9042\nZDM_TARGET_CONTACT_POINTS=10.0.0.2  #required\nZDM_TARGET_USERNAME=cassandra       #required\nZDM_TARGET_PASSWORD=cassandra       #required\nZDM_TARGET_PORT=9042\nZDM_PROXY_LISTEN_PORT=14002\nZDM_PROXY_LISTEN_ADDRESS=127.0.0.1\nZDM_PRIMARY_CLUSTER=ORIGIN\nZDM_READ_MODE=PRIMARY_ONLY\nZDM_LOG_LEVEL=INFO</pre></div><p dir=\"auto\">The environment variables (or YAM configuration file) must be set for the proxy to work.</p><p dir=\"auto\">In order to get started quickly, in your local environment, grab a copy of the binary distribution in the <a href=\"https://github.com/datastax/zdm-proxy/releases\">Releases</a> page. For the recommended installation in a production environment, check the <a href=\"https://github.com/datastax/zdm-proxy#production-setup\">Production Setup</a> section below.</p><p dir=\"auto\">Now, suppose you have two clusters running at <code>10.0.0.1</code> and <code>10.0.0.2</code> with <code>cassandra/cassandra</code> credentials and the same key-value <a href=\"https://github.com/datastax/zdm-proxy/blob/main/nb-tests/schema.cql\">schema</a>. You can start the proxy and connect it to these clusters like this:</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"$ export ZDM_ORIGIN_CONTACT_POINTS=10.0.0.1 \\ export ZDM_TARGET_CONTACT_POINTS=10.0.0.2 export ZDM_ORIGIN_USERNAME=cassandra export ZDM_ORIGIN_PASSWORD=cassandra export ZDM_TARGET_USERNAME=cassandra export ZDM_TARGET_PASSWORD=cassandra ./zdm-proxy-v2.0.0 # run the ZDM proxy executable\"><pre>$ export ZDM_ORIGIN_CONTACT_POINTS=10.0.0.1 \\ \nexport ZDM_TARGET_CONTACT_POINTS=10.0.0.2 \\\nexport ZDM_ORIGIN_USERNAME=cassandra \\\nexport ZDM_ORIGIN_PASSWORD=cassandra \\\nexport ZDM_TARGET_USERNAME=cassandra \\\nexport ZDM_TARGET_PASSWORD=cassandra \\\n./zdm-proxy-v2.0.0 # run the ZDM proxy executable</pre></div><p dir=\"auto\">If you prefer to use YAML configuration file, an equivalent setup would look like:</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"$ cat zdm-config.yml origin_contact_points: 10.0.0.1 target_contact_points: 10.0.0.2 origin_username: cassandra origin_password: cassandra target_username: cassandra target_password: cassandra $ ./zdm-proxy-v2.0.0 --config=./zdm-config.yml # run the ZDM proxy executable\"><pre>$ cat zdm-config.yml\norigin_contact_points: 10.0.0.1\ntarget_contact_points: 10.0.0.2\norigin_username: cassandra\norigin_password: cassandra\ntarget_username: cassandra\ntarget_password: cassandra\n$ ./zdm-proxy-v2.0.0 --config=./zdm-config.yml # run the ZDM proxy executable</pre></div><p dir=\"auto\">At this point, you should be able to connect some client such as <a href=\"https://downloads.datastax.com/#cqlsh\" rel=\"nofollow\">CQLSH</a> to the proxy and write data to it and the proxy will take care of forwarding the requests to both clusters concurrently.</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"$ cqlsh &lt;proxy-ip-address&gt; 14002 # this is the proxy's default listen port\"><pre>$ cqlsh &lt;proxy-ip-address&gt; 14002 # this is the proxy's default listen port</pre></div><p dir=\"auto\">From the CQLSH prompt:</p><div class=\"highlight highlight-source-sql notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"cqlsh&gt; INSERT INTO test.keyvalue (key, value) VALUES (1, 'ABC'); cqlsh&gt; INSERT INTO test.keyvalue (key, value) VALUES (2, 'DEF'); cqlsh&gt; SELECT * FROM test.keyvalue; cqlsh&gt; UPDATE test.keyvalue SET value='GYEKJF' WHERE key = 1; cqlsh&gt; DELETE FROM test.keyvalue WHERE key = 2;\"><pre>cqlsh&gt; INSERT INTO test.keyvalue (key, value) VALUES (1, 'ABC');\ncqlsh&gt; INSERT INTO test.keyvalue (key, value) VALUES (2, 'DEF');\ncqlsh&gt; SELECT * FROM test.keyvalue;\ncqlsh&gt; UPDATE test.keyvalue SET value='GYEKJF' WHERE key = 1;\ncqlsh&gt; DELETE FROM test.keyvalue WHERE key = 2;</pre></div><p dir=\"auto\">You can confirm that the data is stored in both clusters by querying them directly in other cqlsh sessions.</p><p dir=\"auto\">Note: For the moment, the keyspace must be specified when accessing a table, even after using <code>USE &lt;keyspace&gt;</code>.</p><p dir=\"auto\">If you don't have test clusters readily available to try with, check the <a href=\"https://github.com/datastax/zdm-proxy/blob/main/CONTRIBUTING.md#running-on-localhost-with-docker-compose\">alternative</a> method with docker-compose in the <a href=\"https://github.com/datastax/zdm-proxy/blob/main/CONTRIBUTING.md\">Contributor's guide</a>, which will set up all the dependencies, including two test clusters and a proxy instance, in a containerized sandbox environment.</p><p dir=\"auto\"><strong>ZDM Proxy supports protocol versions v2, v3, v4, DSE_V1 and DSE_V2.</strong></p><p dir=\"auto\">It technically doesn't support v5, but handles protocol negotiation so that the client application properly downgrades the protocol version to v4 if v5 is requested. This means that any client application using a recent driver that supports protocol version v5 can be migrated using the ZDM Proxy (as long as it does not use v5-specific functionality).</p><p dir=\"auto\">ZDM Proxy requires origin and target clusters to have at least one protocol version in common. It is therefore not feasible to configure Apache Cassandra 2.0 as origin and 3.x / 4.x as target. Below table displays protocol versions supported by various C* versions:</p><table><thead><tr><th>Apache Cassandra</th>\n<th>Protocol Version</th>\n</tr></thead><tbody><tr><td>2.0</td>\n<td>V2</td>\n</tr><tr><td>2.1</td>\n<td>V2, V3</td>\n</tr><tr><td>2.2</td>\n<td>V2, V3, V4</td>\n</tr><tr><td>3.x</td>\n<td>V3, V4</td>\n</tr><tr><td>4.x</td>\n<td>V3, V4, V5</td>\n</tr></tbody></table><hr /><p dir=\"auto\">⚠️ <strong>Thrift is not supported by ZDM Proxy.</strong> If you are using a very old driver or cluster version that only supports Thrift then you need to change your client application to use CQL and potentially upgrade your cluster before starting the migration process.</p><hr /><p dir=\"auto\">In practice this means that ZDM Proxy supports the following cluster versions (as Origin and / or Target):</p><ul dir=\"auto\"><li>Apache Cassandra from 2.0+ up to (and including) Apache Cassandra 4.x. (although both clusters have to support a common protocol version as mentioned above).</li>\n<li>DataStax Enterprise 4.8+. DataStax Enterprise 4.6 and 4.7 support will be introduced when protocol version v2 is supported.</li>\n<li>DataStax Astra DB (both Serverless and Classic)</li>\n</ul><p dir=\"auto\">The setup we described above is only for testing in a local environment. It is <strong>NOT</strong> recommended for a production installation where the minimum number of proxy instances is 3.</p><p dir=\"auto\">For a comprehensive guide with the recommended production setup check the documentation available at <a href=\"https://docs.datastax.com/en/astra-serverless/docs/migrate/introduction.html\" rel=\"nofollow\">Datastax Migration</a>.</p><p dir=\"auto\">There you'll find information about an Ansible-based tool that automates most of the process.</p><p dir=\"auto\">For information on the packaged dependencies of the Zero Downtime Migration (ZDM) Proxy and their licenses, check out our <a href=\"https://app.fossa.com/reports/ccfe72e5-68ea-4c02-ad48-d92061e6d0b0\" rel=\"nofollow\">open source report</a>.</p><p dir=\"auto\">For frequently asked questions, please refer to our separate <a href=\"https://docs.datastax.com/en/astra-serverless/docs/migrate/faqs.html\" rel=\"nofollow\">FAQ</a> page.</p>","id":"4dcc572f-754e-539f-b5e7-e290530b2e94","title":"GitHub - datastax/zdm-proxy: An open-source component designed to seamlessly handle the real-time client application activity while a migration is in progress.","origin_url":"https://github.com/datastax/zdm-proxy","url":"https://github.com/datastax/zdm-proxy","wallabag_created_at":"2024-11-01T17:23:20+00:00","published_at":null,"published_by":"['datastax']","reading_time":4,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/2d9ec70e3a53cb00af12aa2c8f6c9f9568324cfe693483d667268c95dc17d22b/datastax/zdm-proxy","tags":["migration","proxy","datastax","cassandra"],"description":"The ZDM Proxy is client-server component written in Go that enables users to migrate with zero downtime from an Apache Cassandra® cluster to another (which may be an Astra cluster) and not requiring c..."},{"content":"<div id=\"js-flash-container\" data-turbo-replace=\"\"><div class=\"flash flash-full {{ className }} px-2\"><p>{{ message }}</p></div>\n</div><div class=\"application-main\" data-commit-hovercards-enabled=\"\" data-discussion-hovercards-enabled=\"\" data-issue-and-pr-hovercards-enabled=\"\"><main id=\"js-repo-pjax-container\"><div id=\"repository-container-header\" class=\"pt-3 hide-full-screen c6\" data-turbo-replace=\"\"><div class=\"d-flex flex-wrap flex-justify-end mb-3 px-3 px-md-4 px-lg-5 c4\"><p> / <strong itemprop=\"name\" class=\"mr-2 flex-self-stretch\"><a data-pjax=\"#repo-content-pjax-container\" data-turbo-frame=\"repo-content-turbo-frame\" href=\"https://github.com/dreamfactorysoftware/dreamfactory\">dreamfactory</a></strong> Public</p><div id=\"repository-details-container\" data-turbo-replace=\"\"><ul class=\"pagehead-actions flex-shrink-0 d-none d-md-inline c3\"><li><a href=\"https://github.com/login?return_to=%2Fdreamfactorysoftware%2Fdreamfactory\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;notification subscription menu watch&quot;,&quot;repository_id&quot;:null,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/dreamfactorysoftware/dreamfactory&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"93c9ab634587da00bc26aa5da68430a299593b5019b6e04c0d5d7f107149eb5d\" aria-label=\"You must be signed in to change notification settings\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn\">Notifications</a></li>\n<li><a id=\"fork-button\" href=\"https://github.com/login?return_to=%2Fdreamfactorysoftware%2Fdreamfactory\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;repo details fork button&quot;,&quot;repository_id&quot;:34855724,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/dreamfactorysoftware/dreamfactory&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"24e8715fa5f3a78848129a62d4bbb05f6284e462c62b311fe10d55461cd4e680\" data-view-component=\"true\" class=\"btn-sm btn\">Fork 301</a></li>\n<li>\n<p><a href=\"https://github.com/login?return_to=%2Fdreamfactorysoftware%2Fdreamfactory\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;star button&quot;,&quot;repository_id&quot;:34855724,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/dreamfactorysoftware/dreamfactory&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"dccedfc7c5a9fc4da56f76816691ff5094281f9d73401fc69c4c4ac2ec27fd73\" aria-label=\"You must be signed in to star a repository\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn BtnGroup-item\"> Star 1.5k</a> </p>\n</li>\n</ul></div></div><div class=\"d-block d-md-none mb-2 px-3 px-md-4 px-lg-5\" id=\"responsive-meta-container\" data-turbo-replace=\"\"><p class=\"f4 mb-3\">DreamFactory API Management Platform</p><p><a title=\"https://www.dreamfactory.com\" role=\"link\" target=\"_blank\" class=\"text-bold\" rel=\"noopener noreferrer\" href=\"https://www.dreamfactory.com\">www.dreamfactory.com</a></p><h3 class=\"sr-only\">License</h3><p><a href=\"https://github.com/dreamfactorysoftware/dreamfactory/blob/master/LICENSE\" class=\"Link--muted\" data-analytics-event=\"{&quot;category&quot;:&quot;Repository Overview&quot;,&quot;action&quot;:&quot;click&quot;,&quot;label&quot;:&quot;location:sidebar;file:license&quot;}\"> Apache-2.0 license</a></p><p><a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/dreamfactorysoftware/dreamfactory/stargazers\"> 1.5k stars</a> <a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/dreamfactorysoftware/dreamfactory/forks\"> 301 forks</a> <a class=\"Link--secondary no-underline mr-3 d-inline-block\" href=\"https://github.com/dreamfactorysoftware/dreamfactory/branches\"> Branches</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/dreamfactorysoftware/dreamfactory/tags\"> Tags</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/dreamfactorysoftware/dreamfactory/activity\"> Activity</a></p><div class=\"d-flex flex-wrap gap-2\"><p><a href=\"https://github.com/login?return_to=%2Fdreamfactorysoftware%2Fdreamfactory\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;star button&quot;,&quot;repository_id&quot;:34855724,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/dreamfactorysoftware/dreamfactory&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"dccedfc7c5a9fc4da56f76816691ff5094281f9d73401fc69c4c4ac2ec27fd73\" aria-label=\"You must be signed in to star a repository\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn btn-block BtnGroup-item\"> Star</a> </p><p><a href=\"https://github.com/login?return_to=%2Fdreamfactorysoftware%2Fdreamfactory\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;notification subscription menu watch&quot;,&quot;repository_id&quot;:null,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/dreamfactorysoftware/dreamfactory&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"93c9ab634587da00bc26aa5da68430a299593b5019b6e04c0d5d7f107149eb5d\" aria-label=\"You must be signed in to change notification settings\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn btn-block\">Notifications</a></p></div></div></div>\n</main></div><p> You can’t perform that action at this time.</p><details class=\"details-reset details-overlay details-overlay-dark lh-default color-fg-default hx_rsm\" open=\"open\">\n\n</details>","id":"4091a3a9-a3a9-58cf-9251-ccc929c4bae4","title":"GitHub - dreamfactorysoftware/dreamfactory: DreamFactory API Management Platform","origin_url":"https://github.com/dreamfactorysoftware/dreamfactory","url":"https://github.com/dreamfactorysoftware/dreamfactory","wallabag_created_at":"2024-03-07T15:31:07+00:00","published_at":null,"published_by":"['']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/b311801e40a75be590c1a78266cca311322501629521ecbaf5116ececda6c81e/dreamfactorysoftware/dreamfactory","tags":["api.management","snowflake","oracle","postgresql","cassandra","mysql","api"],"description":"{{ message }}\n / dreamfactory PublicNotifications\nFork 301\n\n Star 1.5k \n\nDreamFactory API Management Platformwww.dreamfactory.comLicense Apache-2.0 license 1.5k stars  301 forks  Branches  Tags  Activ..."},{"content":"I was recently asked to set up a solution for Cassandra open-source log analysis to include in an existing Elasticsearch-Logstash-Kibana (ELK) stack. After some research on more of the newer capabilities of the technologies, I realized I could use \"beats\" in place of the heavier logstash processes for basic monitoring. This basic monitoring would not involve extensive log transformation. The code to run this demo is available to clone or fork at<a href=\"https://github.com/pythian/cassandra-elk\">https://github.com/pythian/cassandra-elk</a>. The only other requirement is Docker (I am using Docker version 18.05.0-ce-rc1) -- using Docker for Mac or Docker for Windows will be most convenient. In a typical production system, you would already have Cassandra running, but all the pieces are included in the Docker stack here so you can start from zero. The model here assumes ELK and a Cassandra cluster are running in your environment, and you need to stream the Cassandra logs into your monitoring system. In this setup, the Cassandra logs are being ingested into Elasticsearch and visualized via Kibana. I have included some ways to see data at each step of the workflow in the final section below.<strong>Start the containers:</strong><pre>docker-compose up -d </pre>(Note: The cassandra-env.sh included with this test environment limits the memory used by the setup via MAX_HEAP_SIZE and HEAP_NEWSIZE, allowing it to be run on a laptop with small memory. This would not be the case in production.)<strong>Set up the test Cassandra cluster:</strong>As the Docker containers are starting up, it can be convenient to see resource utilization via ctop:<img class=\"alignnone wp-image-104079 size-full\" src=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=826&amp;height=181&amp;name=image3-4.png\" alt=\"Example of ctop resource monitor for Docker containers in open-source log analysis for Cassandra\" width=\"826\" height=\"181\" srcset=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=413&amp;height=91&amp;name=image3-4.png 413w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=826&amp;height=181&amp;name=image3-4.png 826w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=1239&amp;height=272&amp;name=image3-4.png 1239w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=1652&amp;height=362&amp;name=image3-4.png 1652w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=2065&amp;height=453&amp;name=image3-4.png 2065w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image3-4.png?width=2478&amp;height=543&amp;name=image3-4.png 2478w\" referrerpolicy=\"no-referrer\" /><strong>Set up the filebeat software</strong>Do the following on each Cassandra node.<strong>1. Download the software</strong>You would likely not need to install curl in your environment, but the Docker images used here are bare-bones by design. The<em>apt update</em>statement is also necessary since typically repos are cleared of files after the requested packages are installed via the Dockerfile.<pre>apt update\n apt install curl -y\n curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-6.2.3-amd64.deb\n dpkg -i filebeat-6.2.3-amd64.deb</pre>For other operating systems, see:<a href=\"https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html\">https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html</a>.  <strong>2. Configure filebeat</strong>The beats software allows for basic filtering and transformation via this configuration file. Put the below in /etc/filebeat/filebeat.yml. (This is edited from an example at:<a href=\"https://github.com/thelastpickle/docker-cassandra-bootstrap/blob/master/cassandra/config/filebeat.yml\">https://github.com/thelastpickle/docker-cassandra-bootstrap/blob/master/cassandra/config/filebeat.yml</a>.) The values in the output.elasticsearch and setup.kibana are their respective IP addresses and port numbers. For filebeat.prospectors -- a<em>prospector</em>manages all the log inputs -- two types of logs are used here, the system log and the garbage collection log. For each, we will exclude any compressed (.zip) files. The multiline* settings define how multiple lines in the log files are handled. Here, the log manager will find files that start with any of the patterns shown and append the following lines not matching the pattern until it reaches a new match. More options available at:<a href=\"https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html\">https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html</a>.<pre>output.elasticsearch:\n  enabled: true\n  hosts: [\"172.16.238.31:9200\"]\n setup.kibana:\n  host: \"172.16.238.33:5601\"\n filebeat.prospectors:\n  - input_type: log\n  paths:\n  - \"/var/log/cassandra/system.log*\"\n  document_type: cassandra_system_logs\n  exclude_files: ['\\.zip$']\n  multiline.pattern: '^TRACE|DEBUG|WARN|INFO|ERROR'\n  multiline.negate: true\n  multiline.match: after\n  - input_type: log\n  paths:\n  - \"/var/log/cassandra/debug.log*\"\n  document_type: cassandra_debug_logs\n  exclude_files: ['\\.zip$']\n  multiline.pattern: '^TRACE|DEBUG|WARN|INFO|ERROR'\n  multiline.negate: true\n  multiline.match: after</pre> <strong>3. Set up Kibana dashboards</strong><pre>filebeat setup --dashboards</pre>  Example output:<pre>Loaded dashboards</pre> <strong>4. Start the beat</strong><pre>service filebeat start</pre>  Example output:<pre>2018-04-12T20:43:03.798Z INFO instance/beat.go:468 Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]\n 2018-04-12T20:43:03.799Z INFO instance/beat.go:475 Beat UUID: 2f43562f-985b-49fc-b229-83535149c52b\n 2018-04-12T20:43:03.800Z INFO instance/beat.go:213 Setup Beat: filebeat; Version: 6.2.3\n 2018-04-12T20:43:03.801Z INFO elasticsearch/client.go:145 Elasticsearch url: https://172.16.238.31:9200\n 2018-04-12T20:43:03.802Z INFO pipeline/module.go:76 Beat name: C1\n Config OK</pre> <strong>View the graphs:</strong>Then view the Kibana graphs in a local browser at:<a href=\"https://localhost:5601\">https://localhost:5601</a>.   Run some sample load against one of the nodes to get more logs to experiment with:<pre>cassandra-stress write n=20000 -pop seq=1..20000 -rate threads=4</pre><img class=\"alignnone wp-image-104077 size-full\" src=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=720&amp;height=171&amp;name=image1-4.png\" alt=\"Example output from Cassandra-stress being used to populate test data\" width=\"720\" height=\"171\" srcset=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=360&amp;height=86&amp;name=image1-4.png 360w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=720&amp;height=171&amp;name=image1-4.png 720w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=1080&amp;height=257&amp;name=image1-4.png 1080w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=1440&amp;height=342&amp;name=image1-4.png 1440w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=1800&amp;height=428&amp;name=image1-4.png 1800w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image1-4.png?width=2160&amp;height=513&amp;name=image1-4.png 2160w\" referrerpolicy=\"no-referrer\" />Here are some sample queries to run in Kibana:<ul><li class=\"c5\">message:WARN*</li>\n<li class=\"c5\">message:(ERROR* OR WARN*)</li>\n<li class=\"c5\">message:(ERROR* OR WARN*) AND beat.hostname:DC1C2</li>\n</ul>  You can also filter the display by choosing from the available fields on the left.<img class=\"alignnone wp-image-104078 size-full\" src=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=1020&amp;height=550&amp;name=image2-5.png\" alt=\"Kibana dashboard example display\" width=\"1020\" height=\"550\" srcset=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=510&amp;height=275&amp;name=image2-5.png 510w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=1020&amp;height=550&amp;name=image2-5.png 1020w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=1530&amp;height=825&amp;name=image2-5.png 1530w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=2040&amp;height=1100&amp;name=image2-5.png 2040w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=2550&amp;height=1375&amp;name=image2-5.png 2550w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image2-5.png?width=3060&amp;height=1650&amp;name=image2-5.png 3060w\" referrerpolicy=\"no-referrer\" />  If you would like to see what the logs look at each step of the workflow, view logs within the Cassandra container in /var/log/cassandra like this:<pre>tail /var/log/cassandra/debug.log</pre>Example output:<pre>WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-05-07 14:01:09,216 NoSpamLogger.java:94 - Out of 0 commit log syncs over the past 0.00s with average duration of Infinityms, 1 have exceeded the configured commit interval by an average of 80.52ms</pre>  View this data stored in Elasticsearch (in JSON format) in a browser like this:<a href=\"https://localhost:9200/_search?q=(message:(ERROR*%20OR%20WARN*)%20AND%20beat.hostname:DC1C2)\">https://localhost:9200/_search?q=(message:(ERROR*%20OR%20WARN*)%20AND%20beat.hostname:DC1C2)</a>Example output:<img class=\"alignnone size-full wp-image-104082\" src=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=836&amp;height=570&amp;name=image4-3.png\" alt=\"\" width=\"836\" height=\"570\" srcset=\"https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=418&amp;height=285&amp;name=image4-3.png 418w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=836&amp;height=570&amp;name=image4-3.png 836w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=1254&amp;height=855&amp;name=image4-3.png 1254w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=1672&amp;height=1140&amp;name=image4-3.png 1672w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=2090&amp;height=1425&amp;name=image4-3.png 2090w, https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/image4-3.png?width=2508&amp;height=1710&amp;name=image4-3.png 2508w\" referrerpolicy=\"no-referrer\" />","id":"294431cd-3823-54d9-8293-635cc0794b4a","title":"Cassandra open-source log analysis in Kibana, using filebeat, modeled in Docker","origin_url":"https://www.pythian.com/blog/cassandra-open-source-log-analysis-kibana-using-filebeat-modeled-docker","url":"https://www.pythian.com/blog/cassandra-open-source-log-analysis-kibana-using-filebeat-modeled-docker","wallabag_created_at":"2024-02-16T18:58:29+00:00","published_at":null,"published_by":null,"reading_time":4,"domain_name":"www.pythian.com","preview_picture":"https://www.pythian.com/hubfs/Pythian-Featured-Meta-OG-Image.jpg","tags":["elastic","logging","kibana","cassandra"],"description":"I was recently asked to set up a solution for Cassandra open-source log analysis to include in an existing Elasticsearch-Logstash-Kibana (ELK) stack. After some research on more of the newer capabilit..."},{"content":"Vald\n<main role=\"main\"><div class=\"top\"><div class=\"top__content\"><p class=\"top__title\">Vald</p><p class=\"top__lead\">A Highly Scalable Distributed Vector Search Engine</p><p><a href=\"https://vald.vdaas.org/docs/tutorial/get-started/\" class=\"top__linkitem top__linkitem--getstarted\">Get Started</a> <a href=\"https://vald.vdaas.org/docs/\" class=\"top__linkitem top__linkitem--docs\">Documents</a> <a href=\"https://join.slack.com/t/vald-community/shared_invite/zt-db2ky9o4-R_9p2sVp8xRwztVa8gfnPA\" target=\"_blank\" class=\"top__linkitem top__linkitem--valdslack\">Vald Slack</a></p></div></div>\n<div class=\"concept\" id=\"concept\"><div class=\"concept__content\"><img src=\"https://vald.vdaas.org/images/vald_color_1.svg\" alt=\"\" referrerpolicy=\"no-referrer\" /><div class=\"concept__block cf\"><p class=\"concept__text\">Vald is designed and implemented based on the Cloud-Native architecture. It uses the fastest ANN Algorithm NGT to search neighbors. Vald has automatic vector indexing and index backup, and horizontal scaling which made for searching from billions of feature vector data. Vald is easy to use, feature-rich and highly customizable as you needed.</p></div></div></div>\n<div class=\"features features__content\"><ul class=\"features__list\"><li class=\"features__item\">\n<div class=\"features__title\"><img class=\"features__icon\" src=\"https://vald.vdaas.org/images/features_icon_asynchronize.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index asynchronize\">Asynchronize Auto Indexing</h3></div>\n<p class=\"features__text\">Usually the graph requires locking during indexing, which cause stop-the-world. But Vald uses distributed index graph so it continues to work during indexing.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_filtering.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index filtering\">Customizable Ingress/Egress Filtering</h3></div>\n<p class=\"features__text\">Vald implements it's own highly customizable Ingress/Egress filter. Which can be configured to fit the gRPC interface.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_cloud-native.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index cloud-native\">Cloud-native based vector searching engine</h3></div>\n<p class=\"features__text\">Horizontal scalable on memory and cpu for your demand.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_auto-indexing.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index auto-indexing\">Auto Indexing Backup</h3></div>\n<p class=\"features__text\">Vald supports to auto backup feature using Object Storage or Persistent Volume which enables disaster recovery.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_distributed.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index distributed\">Distributed Indexing</h3></div>\n<p class=\"features__text\">Vald distribute vector index to multiple agent, each agent stores different index.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_replication.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index replication\">Index Replication</h3></div>\n<p class=\"features__text\">Vald stores each index in multiple agents which enables index replicas. Automatically rebalance the replica when some Vald agent goes down.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_easy.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index easy\">Easy to use</h3></div>\n<p class=\"features__text\">Vald can be easily installed in a few steps.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_customizable.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index customizable\">Highly customizable</h3></div>\n<p class=\"features__text\">You can configure the number of vector dimension, the number of replica and etc.</p>\n</li>\n<li class=\"features__item\">\n<div class=\"features__title\"><img src=\"https://vald.vdaas.org/images/features_icon_multi-language.png\" alt=\"\" width=\"48px\" height=\"48px\" referrerpolicy=\"no-referrer\" /><h3 class=\"features__index multi-language\">Multi language supported</h3></div>\n<p class=\"features__text\">Golang, Java, Nodejs and python is supported.</p>\n</li>\n</ul></div>\n</main>","id":"9fb11831-7697-54a5-ad0a-6f3c56c33344","title":"Vald","origin_url":"https://vald.vdaas.org/","url":"https://vald.vdaas.org/","wallabag_created_at":"2024-02-11T20:02:22+00:00","published_at":null,"published_by":null,"reading_time":1,"domain_name":"vald.vdaas.org","preview_picture":"https://vald.vdaas.org//images/ogp_vald.png","tags":["python","java","cassandra","vector.database","go","scylladb","distributed","mysql","vector","vector.search","redis","docker"],"description":"Vald\nValdA Highly Scalable Distributed Vector Search EngineGet Started Documents Vald Slack\nVald is designed and implemented based on the Cloud-Native architecture. It uses the fastest ANN Algorithm N..."},{"content":"Para - Backend for busy developers.\n<div class=\"hello row small-12 columns text-center\"><p></p><h3>Para is a flexible and cost-effective backend service that stores, indexes and caches your objects.</h3><p><a href=\"https://paraio.com/signin\" class=\"button radius hello-btn\">Get started for free</a></p></div>\n<section class=\"em5v\"><div class=\"row text-center\"><div class=\"medium-3 columns\"><h4 class=\"mtl\">Open Source</h4><hr class=\"hello-hr\" /><p class=\"text-justify text-lighter pam\">A managed service based on the open source library <a href=\"https://paraio.org\">Para</a>. Adding your own features is a pull request away on <a href=\"https://github.com/erudika/para\">GitHub</a>.</p></div><div class=\"medium-3 columns\"><h4 class=\"mtl\">Cloud-based</h4><hr class=\"hello-hr\" /><p class=\"text-justify text-lighter pam\">We are hosted on Amazon Web Services, inside a private cloud, which allows us to scale up and down quickly.</p></div><div class=\"medium-3 columns\"><h4 class=\"mtl\">No vendor lock-in</h4><hr class=\"hello-hr\" /><p class=\"text-justify text-lighter pam\">You can always switch to another hosting service or run Para on your infrastructure. We're cool with that.</p></div><div class=\"medium-3 columns\"><h4 class=\"mtl\">Client-side focus</h4><hr class=\"hello-hr\" /><p class=\"text-justify text-lighter pam\">Our simple JSON API and client SDKs allow you to easily integrate Para into your code.</p></div></div></section><section id=\"features\" class=\"em3v text-center\"><div class=\"em5v\"><div class=\"row medium-12 columns phl\"><ul class=\"medium-block-grid-3\"><li>\n<div class=\"row text-center medium-10 medium-offset-1 columns\"><h4 class=\"mtl\">Advanced authentication</h4><hr class=\"hello-hr\" /><p class=\"text-justify\">Your users can be authenticated with their social accounts and Para will issue a JSON Web Token (JWT). LDAP and Active Directory are also supported. This combined with powerful resource permissions for each user, gives you great flexibility and security.</p></div>\n</li>\n<li>\n<div class=\"row text-center medium-10 medium-offset-1 columns\"><h4 class=\"mtl\">Full-text search</h4><hr class=\"hello-hr\" /><p class=\"text-justify\">All your objects are indexed automatically by <strong>Elasticsearch</strong> - a robust and scalable search engine. Para supports many types of queries like basic field queries, wildcard, geo point, and \"more-like-this\" queries.</p></div>\n</li>\n<li>\n<div class=\"row text-center medium-10 medium-offset-1 columns\"><h4 class=\"mtl\">JSON API</h4><hr class=\"hello-hr\" /><p class=\"text-justify\">The API is simple and secure. Using basic HTTP methods you can create, read, update and delete objects. It also supports pagination and field limiting. All API calls are signed using the latest <strong>AWS signature v4 algorithm</strong>, used by Amazon.</p></div>\n</li>\n</ul></div><div class=\"row medium-12 columns phl\"><ul class=\"medium-block-grid-3\"><li>\n<div class=\"row text-center medium-10 medium-offset-1 columns\"><h4 class=\"mtl\">Distributed cache</h4><hr class=\"hello-hr\" /><p class=\"text-justify\">Your objects are cached when they are created or updated. Caching allows for faster read times and decreases the load on the database. This works great for \"hot\" objects that don't change very often.</p></div>\n</li>\n<li>\n<div class=\"row text-center medium-10 medium-offset-1 columns\"><h4 class=\"mtl\">Scalability</h4><hr class=\"hello-hr\" /><p class=\"text-justify\">Para stores data in several regions of the <strong>AWS cloud</strong> and is powered by <strong>DynamoDB</strong> - a scalable, fault tolerant database with very low latency. Your data is stored in globally replicated tables, encrypted, on Solid State Drives (SSDs).</p></div>\n</li>\n<li>\n<div class=\"row text-center medium-10 medium-offset-1 columns\"><h4 class=\"mtl\">Backup &amp; Restore</h4><hr class=\"hello-hr\" /><p class=\"text-justify\">Premium plans feature an easy-to-use backup and restore control panel which enables you to import and export data from <strong>Para</strong>. Simply download the JSON data archive and restore it when needed.</p></div>\n</li>\n</ul></div><p></p><h4 class=\"mtl\">Client libraries for easy integration</h4>\n<hr class=\"hello-hr\" /></div></section><section id=\"pricing\" class=\"em3v text-center\"><div class=\"em3v\"><div class=\"row\"><div class=\"large-3 columns\"><ul class=\"pricing-table\"><li class=\"title\">The Free Plan</li>\n<li class=\"price\">Free</li>\n<li class=\"bullet-item\"><strong>One app only</strong></li>\n<li class=\"bullet-item\">Shared database</li>\n<li class=\"bullet-item\">Maximum 10K objects</li>\n<li class=\"bullet-item\">Global replication</li>\n<li class=\"bullet-item\">Email support</li>\n<li class=\"bullet-item\">Chat with us <a href=\"https://gitter.im/Erudika/para\">on Gitter</a></li>\n</ul></div><div class=\"large-3 columns\"><ul class=\"pricing-table\"><li class=\"title\">The Paid Plan</li>\n<li class=\"price\">€25\n<p><strong>per app</strong>, monthly</p>\n</li>\n<li class=\"bullet-item\">Dedicated database</li>\n<li class=\"bullet-item\">10 GB of storage</li>\n<li class=\"bullet-item\">Automatic scalability</li>\n<li class=\"bullet-item\">Global replication</li>\n<li class=\"bullet-item\">Backup &amp; Restore</li>\n<li class=\"bullet-item\">Email support</li>\n</ul></div></div><div class=\"row large-12 columns text-center\"><p><em>The prices don't include European Union VAT (not applicable to customers outside the EU).</em></p><p>Billing is on a monthly basis. Annual billing is also available - pay 10 months and get 2 months free. Additional storage and app credits can be purchased at any time.</p><a href=\"https://paraio.com/faq\">Frequently Asked Questions</a></div></div></section><section id=\"signup\" class=\"em3v text-center\"><div class=\"row large-12 columns text-center\"><p class=\"text-lighter\">No registration required, just sign in with a social account.</p></div><p><a href=\"https://paraio.com/signin\" class=\"button radius large hello-btn\">Sign in</a></p><p></p><h5>Get updates on Twitter</h5>\n</section><footer class=\"row\"><div class=\"small-12 columns\"><hr /><div class=\"row\"><div class=\"medium-6 columns\"><p><a href=\"https://paraio.com/\"><img src=\"https://d1pzt52sl00uiv.cloudfront.net/static/202311261535/apple-touch-icon-precomposed.png\" width=\"30\" height=\"30\" alt=\"logo\" referrerpolicy=\"no-referrer\" /></a>  Crafted in the EU by <a href=\"https://erudika.com\">Erudika</a>.</p></div><div class=\"medium-6 columns\"><ul class=\"inline-list right\"><li><a href=\"https://paraio.com/signin\">Sign in</a></li>\n<li><a href=\"https://paraio.com/#pricing\">Pricing</a></li>\n<li><a href=\"https://paraio.com/support\">Support</a></li>\n<li><a href=\"https://paraio.com/docs\">Docs</a></li>\n<li><a href=\"https://paraio.com/contact\">Contact</a></li>\n<li><a href=\"https://paraio.com/legal\">Legal</a></li>\n</ul></div></div></div>\n</footer>","id":"a8e3cbee-aa7a-52eb-b3ca-a920f69117d1","title":"Para - backend for busy developers","origin_url":"https://paraio.com/","url":"https://paraio.com/","wallabag_created_at":"2024-01-28T14:49:47+00:00","published_at":null,"published_by":null,"reading_time":2,"domain_name":"paraio.com","preview_picture":"https://d1pzt52sl00uiv.cloudfront.net/static/202605261351/images/9b5033a96d21c2cbc32f523412b95ed1.logo-wide.png","tags":["jvm","rest","search","java","cassandra","elastic","lucene","api"],"description":"Para - Backend for busy developers.\nPara is a flexible and cost-effective backend service that stores, indexes and caches your objects.Get started for free\nOpen SourceA managed service based on the op..."},{"content":"<p class=\"f4 mb-3\">Multitenant backend server for building web and mobile apps rapidly. The backend for busy developers. (self-hosted or hosted)</p><p><a title=\"https://paraio.org\" role=\"link\" target=\"_blank\" class=\"text-bold\" rel=\"noopener noreferrer\" href=\"https://paraio.org\">paraio.org</a></p><h3 class=\"sr-only\">License</h3><p><a href=\"https://github.com/Erudika/para/blob/master/LICENSE\" class=\"Link--muted\" data-analytics-event=\"{&quot;category&quot;:&quot;Repository Overview&quot;,&quot;action&quot;:&quot;click&quot;,&quot;label&quot;:&quot;location:sidebar;file:license&quot;}\">Apache-2.0 license</a></p><p><a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/Erudika/para/stargazers\">507 stars</a> <a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/Erudika/para/forks\">141 forks</a> <a class=\"Link--secondary no-underline mr-3 d-inline-block\" href=\"https://github.com/Erudika/para/branches\">Branches</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/Erudika/para/tags\">Tags</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/Erudika/para/activity\">Activity</a></p>","id":"0533fa1a-4095-538d-8426-7b0ed0c2b5c1","title":"GitHub - Erudika/para: Multitenant backend server for building web and mobile apps rapidly. The backend for busy developers. (self-hosted or hosted)","origin_url":"https://github.com/Erudika/para","url":"https://github.com/Erudika/para","wallabag_created_at":"2024-01-26T14:53:22+00:00","published_at":null,"published_by":"['']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/2204f6ef5bbccb71656d9d43ff9116ca157ebf50d578547bbd1d8678bf3a3232/Erudika/para","tags":["mongo","rest","elasticsearch","cassandra","elastic","lucene","api","dynamo","baas"],"description":"Multitenant backend server for building web and mobile apps rapidly. The backend for busy developers. (self-hosted or hosted)paraio.orgLicenseApache-2.0 license507 stars 141 forks Branches Tags Activi..."},{"content":"<div id=\"js-flash-container\" data-turbo-replace=\"\"><div class=\"flash flash-full {{ className }} px-2\"><p>{{ message }}</p></div>\n</div><div class=\"application-main\" data-commit-hovercards-enabled=\"\" data-discussion-hovercards-enabled=\"\" data-issue-and-pr-hovercards-enabled=\"\"><main id=\"js-repo-pjax-container\"><div id=\"repository-container-header\" class=\"pt-3 hide-full-screen c6\" data-turbo-replace=\"\"><div class=\"d-flex flex-wrap flex-justify-end mb-3 px-3 px-md-4 px-lg-5 c4\"><p> / <strong itemprop=\"name\" class=\"mr-2 flex-self-stretch\"><a data-pjax=\"#repo-content-pjax-container\" data-turbo-frame=\"repo-content-turbo-frame\" href=\"https://github.com/loopbackio/loopback-next\">loopback-next</a></strong> Public</p><div id=\"repository-details-container\" data-turbo-replace=\"\"><ul class=\"pagehead-actions flex-shrink-0 d-none d-md-inline c3\"><li><a href=\"https://github.com/login?return_to=%2Floopbackio%2Floopback-next\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;notification subscription menu watch&quot;,&quot;repository_id&quot;:null,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/loopbackio/loopback-next&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"ced7971f240c55765bb3ff2cdb5afd6d9eeb730de86743713aeefd69d21f7657\" aria-label=\"You must be signed in to change notification settings\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn\">Notifications</a></li>\n<li><a id=\"fork-button\" href=\"https://github.com/login?return_to=%2Floopbackio%2Floopback-next\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;repo details fork button&quot;,&quot;repository_id&quot;:78452015,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/loopbackio/loopback-next&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"0c6d49360c03c88f58a6e1d2a6aa3ef8e93d5bf6ee1838e5c2c1ed4dad751500\" data-view-component=\"true\" class=\"btn-sm btn\">Fork 1k</a></li>\n<li>\n<p><a href=\"https://github.com/login?return_to=%2Floopbackio%2Floopback-next\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;star button&quot;,&quot;repository_id&quot;:78452015,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/loopbackio/loopback-next&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"3eeb8ce532212919e4a6474c6a187a1b7cb2c2c5ac4da8557b61fb9ad7ac5e23\" aria-label=\"You must be signed in to star a repository\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn BtnGroup-item\"> Star 4.7k</a> </p>\n</li>\n</ul></div></div><div class=\"d-block d-md-none mb-2 px-3 px-md-4 px-lg-5\" id=\"responsive-meta-container\" data-turbo-replace=\"\"><p class=\"f4 mb-3\">LoopBack makes it easy to build modern API applications that require complex integrations.</p><p><a title=\"https://loopback.io\" role=\"link\" target=\"_blank\" class=\"text-bold\" rel=\"noopener noreferrer\" href=\"https://loopback.io\">loopback.io</a></p><h3 class=\"sr-only\">License</h3><p><a href=\"https://github.com/loopbackio/loopback-next/blob/master/LICENSE\" class=\"Link--muted\" data-analytics-event=\"{&quot;category&quot;:&quot;Repository Overview&quot;,&quot;action&quot;:&quot;click&quot;,&quot;label&quot;:&quot;location:sidebar;file:license&quot;}\"> View license</a></p><p><a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/loopbackio/loopback-next/stargazers\"> 4.7k stars</a> <a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/loopbackio/loopback-next/forks\"> 1k forks</a> <a class=\"Link--secondary no-underline mr-3 d-inline-block\" href=\"https://github.com/loopbackio/loopback-next/branches\"> Branches</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/loopbackio/loopback-next/tags\"> Tags</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/loopbackio/loopback-next/activity\"> Activity</a></p><div class=\"d-flex flex-wrap gap-2\"><p><a href=\"https://github.com/login?return_to=%2Floopbackio%2Floopback-next\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;star button&quot;,&quot;repository_id&quot;:78452015,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/loopbackio/loopback-next&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"3eeb8ce532212919e4a6474c6a187a1b7cb2c2c5ac4da8557b61fb9ad7ac5e23\" aria-label=\"You must be signed in to star a repository\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn btn-block BtnGroup-item\"> Star</a> </p><p><a href=\"https://github.com/login?return_to=%2Floopbackio%2Floopback-next\" rel=\"nofollow\" data-hydro-click=\"{&quot;event_type&quot;:&quot;authentication.click&quot;,&quot;payload&quot;:{&quot;location_in_page&quot;:&quot;notification subscription menu watch&quot;,&quot;repository_id&quot;:null,&quot;auth_type&quot;:&quot;LOG_IN&quot;,&quot;originating_url&quot;:&quot;https://github.com/loopbackio/loopback-next&quot;,&quot;user_id&quot;:null}}\" data-hydro-click-hmac=\"ced7971f240c55765bb3ff2cdb5afd6d9eeb730de86743713aeefd69d21f7657\" aria-label=\"You must be signed in to change notification settings\" data-view-component=\"true\" class=\"tooltipped tooltipped-s btn-sm btn btn-block\">Notifications</a></p></div></div></div>\n</main></div><p> You can’t perform that action at this time.</p><details class=\"details-reset details-overlay details-overlay-dark lh-default color-fg-default hx_rsm\" open=\"open\">\n\n</details>","id":"3ea843b4-1c53-50b2-a104-edd4d201d162","title":"GitHub - loopbackio/loopback-next: LoopBack makes it easy to build modern API applications that require complex integrations.","origin_url":"https://github.com/loopbackio/loopback-next","url":"https://github.com/loopbackio/loopback-next","wallabag_created_at":"2024-01-26T14:52:26+00:00","published_at":null,"published_by":"['']","reading_time":null,"domain_name":"github.com","preview_picture":"https://repository-images.githubusercontent.com/78452015/2e9bdb80-6b51-11e9-9630-fdcead4ff24d","tags":["mongo","code.generation","sqlite","cassandra","db2","openapi","mysql","api","grpc","postgres","sql"],"description":"{{ message }}\n / loopback-next PublicNotifications\nFork 1k\n\n Star 4.7k \n\nLoopBack makes it easy to build modern API applications that require complex integrations.loopback.ioLicense View license 4.7k ..."}],"tagSets":[{"tag":"hive","articles":[{"content":"<p>Free multi-platform database tool for developers, database administrators, analysts and all people who need to work with databases. Supports all popular databases: MySQL, PostgreSQL, SQLite, Oracle, DB2, SQL Server, Sybase, MS Access, Teradata, Firebird, Apache Hive, Phoenix, Presto, etc.</p><p><a href=\"https://dbeaver.io/download\" class=\"blue button\">Download</a></p>","id":"62edea6c-0d3e-5635-a5ab-ead1f7c1785f","title":"DBeaver Community | Free Universal Database Tool","origin_url":"https://dbeaver.io/","url":"https://dbeaver.io/","wallabag_created_at":"2023-05-18T00:13:27+00:00","published_at":null,"published_by":null,"reading_time":null,"domain_name":"dbeaver.io","preview_picture":null,"tags":["hive","firebird","oracle","sybase","manager","sqlite","sql","database","datastax","cassandra","db2","mysql","postgres","presto"],"description":"Free multi-platform database tool for developers, database administrators, analysts and all people who need to work with databases. Supports all popular databases: MySQL, PostgreSQL, SQLite, Oracle, D..."},{"content":"<figure class=\"dp dq dr ds dt du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Photo credit <a href=\"https://pixabay.com/en/abstract-blur-britain-british-1239439\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Meditations</a></figcaption></figure><h2 id=\"9ae1\" class=\"fn eu as ar ck fo fp fq fr fs ft fu fv fw fx fy fz ga gb gc gd aw\">Introduction</h2><div class=\"ge\"><div class=\"n gf gg gh gi\"><div class=\"o n\"><div><a rel=\"noopener\" href=\"https://medium.com/@_sandeep_malik?source=post_page-----9cb4dca91c3d----------------------\"><img alt=\"Sandeep Malik\" class=\"r gj gk gl\" src=\"https://miro.medium.com/fit/c/96/96/0*dfwt6aN713ngmttu.jpg\" width=\"48\" height=\"48\" /></a></div></div></div></div><p id=\"733e\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Big data world is messy! The choice of frameworks, platforms, and data sources is simply mind-boggling. Several teams at Walmart use a mix of various stacks and tools to make sense of the data. While having choices is great, it does become a bit overwhelming when someone wants to access data in a coherent and uniform fashion. Each system exposes a different set of APIs, a different language to write queries in, and potentially a different way of analyzing the data at large scale.</p><p id=\"a4f4\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">It became clear to us early on that we need to invest on a tech stack, which can provide access to TBs of structured and semi-structured data for quick analysis and still be manageable with minimal overhead. It was also desirable that the query language be as close to SQL as possible so that it could be used with relative ease.</p><p id=\"0ff1\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Some of the systems we looked at were <a href=\"https://impala.incubator.apache.org/\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Apache Impala</a>, <a href=\"http://spark.apache.org/\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Apache Spark</a>, and <a href=\"https://prestosql.io\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Presto</a>. Each of these systems are battle tested and support a large number of production systems today. We started gravitating towards Presto for the following reasons:</p><ul class=\"\"><li id=\"3286\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il im in io fm\">Support for a large number of data sources (Hive, Cassandra, Kafka, Elastic Search)</li><li id=\"0132\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Extremely rich SQL dialect</li><li id=\"cdc9\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Ability to cross join data from multiple disparate sources</li><li id=\"4095\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">No external dependency on systems like ZooKeeper, etc</li><li id=\"5684\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Great performance</li></ul><p id=\"77c0\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">In this article, we will explore our Presto set up and examine its performance for some common use cases. We will also focus on <a href=\"https://zeppelin.apache.org/\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Apache Zeppelin</a>, which is an excellent open source data visualization tool for interactive queries and helps team collaborate in real time.</p><h2 id=\"8b4c\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Presto: A distributed SQL engine</h2><p id=\"8e1f\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\"><a href=\"https://prestosql.io/\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Presto</a> is an excellent distributed SQL engine that is optimized for running queries over huge datasets across multiple data sources. It supports distributed joins across multiple data sources like Hive, Kafka, Elastic Search, Cassandra, etc, thereby allowing a uniform access model for analytics.</p><p id=\"925b\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Setting up the Presto cluster is easy. We set up a 8 node (each having 8 cores and 16G RAM) presto cluster in literally under 20 mins. Presto provides live query plans, which is a big plus. It helps us understand how to tune the queries for more performance. Presto also provides a JDBC driver, so accessing it from Java applications is very convenient as well.</p><h2 id=\"6ce3\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Configuring Presto</h2><p id=\"865b\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">Presto requires minimal configuration. We need to provide node.properties file under etc/, which requires bare minimum properties, e.g.</p><pre class=\"jn jo jp jq jr js jt bz\">node.environment=prod<br />node.id=node1<br />node.data-dir=/var/presto/data</pre><p id=\"b405\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Each node also requires a config.properties.</p><pre class=\"jn jo jp jq jr js jt bz\">coordinator=false<br />http-server.http.port=8080<br />query.max-memory=50GB<br />query.max-memory-per-node=8GB<br />discovery-server.enabled=true<br />discovery.uri=http://${coordinator_ip}:8080</pre><p id=\"678f\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">One of the nodes has to be made coordinator and needs some more properties</p><pre class=\"jn jo jp jq jr js jt bz\">coordinator=true<br />node-scheduler.include-coordinator=false<br />http-server.http.port=8080<br />discovery-server.enabled=true<br />discovery.uri=http://localhost:8080</pre><p id=\"3162\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">One issue we ran into while configuring Presto was regarding the discovery.url. Presto uses its own discovery mechanism to identify the cluster workers. <strong class=\"ia jy\"><em class=\"jz\">We observed that in the coordinator’s config.properties the discovery.url should be pointing to localhost while in all other nodes it should be point to coordinator’s IP</em></strong></p><p id=\"e0c5\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Finally, we need to add catalogs (connectors for different data sources) under etc/catalog directory. Example below shows how to add a Cassandra connector with minimal configuration</p><pre class=\"jn jo jp jq jr js jt bz\">connector.name=cassandra<br />cassandra.contact-points=&lt;comma separated IPs&gt;<br />cassandra.consistency-level=LOCAL_ONE<br />cassandra.username=readonly_u<br />cassandra.password=readonly_p<br />cassandra.consistency-level=LOCAL_ONE<br />cassandra.load-policy.use-dc-aware=true<br />cassandra.load-policy.dc-aware.local-dc=DC1<br />cassandra.load-policy.use-token-aware=true</pre><p id=\"fca6\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">As mentioned before, Presto can connect to a lot of data sources including Cassandra, Hive, and Kafka among many others. A full list of connectors can be found <a href=\"https://prestosql.io/docs/current/connector.html\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">here</a></p><p id=\"df65\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Once the cluster is set up, the console can be accessed at the discovery.url</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Presto cluster (7 workers + 1 coordinator, 8 cores/16G)</figcaption></figure><h2 id=\"6861\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Measuring Presto Performance</h2><p id=\"5079\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">The first task that we set out to do was to measure query performance. We had a table handy with close to 260M rows. Each row had roughly 1 KB data. The schema of the table is as follows:</p><pre class=\"jn jo jp jq jr js jt bz\">CREATE TABLE event_lookup (<br />id text,<br />bucket_id timestamp,<br />payload text,<br />PRIMARY KEY (( id ))<br />)</pre><h2 id=\"6dcf\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Use case 1: count(*)</h2><p id=\"6a1f\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">A simple query was fired on Cassandra which returned the count of total partitions in Cassandra.</p><pre class=\"jn jo jp jq jr js jt bz\">SELECT COUNT(*) FROM event_lookup;</pre><p id=\"3dd8\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">This resulted in a full table scan, by Presto with an impressive rate of ~ 418K rows / second! The inflow rate was ~ 400KB/sec. The screen shot below shows the rates along with other metrics.</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Total counts query performance over 260M rows</figcaption></figure><p id=\"3aef\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">During the query execution, there were occasional blips where read rate reached almost <strong class=\"ia jy\">1 million rows / second</strong></p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Max peak performance at 1.06M rows/sec</figcaption></figure><p id=\"8c72\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Worker nodes also had almost equal distribution of load. From the screen shot below, each worker was processing on an average 51K rows / second.</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Worker nodes load distribution (Host IPs hidden for privacy)</figcaption></figure><h2 id=\"7ab4\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Use case 2: group by</h2><p id=\"9261\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">We wanted to execute some thing more complicated than a count(*) query to examine performance better. We used a group by query as follows:</p><pre class=\"jn jo jp jq jr js jt bz\">SELECT bucket_id, count(*) FROM event_lookup GROUP BY bucket_id ORDER BY count(*) LIMIT 10;</pre><p id=\"e875\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">This time around, the results were even more impressive. The average query rate reached an impressive of 560K rows / sec!</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Presto cluster dashboard view</figcaption></figure><p id=\"b205\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">The worker node distribution was a little skewed, possibly because of the way “bucket_id” field was partitioned among the Cassandra token ranges. What was stunning was that one of worker nodes reached a maximum of ~ 100K rows/sec TPS.</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">worker nodes load distribution for group by query</figcaption></figure><p id=\"fe43\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Having obtained satisfactory results, we moved on the other end of spectrum, the visualization!</p><h2 id=\"6389\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">The other end of the spectrum: Apache Zeppelin</h2><p id=\"675d\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">Data analytics is not appealing if not complemented by a strong visualization. Visualization goes a long way in quickly deriving data patterns and correlations. While there are tools and libraries that provide excellent visualization support, they require a fairly good understanding of JavaScript and CSS, which is not something that data analysts are expected to know. <a href=\"https://zeppelin.apache.org/\" class=\"co eo ep eq er es\" target=\"_blank\" rel=\"noopener nofollow\">Apache Zeppelin</a> aims to fill that gap precisely. It provides analysts an extremely convenient way to create interactive web notebooks to write and execute the queries and visualize the results immediately. The web notebooks can be shared and collaborated in real time for valuable feedback. The queries can be scheduled so that they can be executed periodically. Like Presto, Zeppelin can connect to many data sources as well. In this article we will focus on JDBC connector for accessing Presto.</p><h2 id=\"bf66\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Configuring JDBC connector for Presto</h2><p id=\"35be\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">Adding a JDBC connector in Zeppelin is very easy. We just need to provide url, user, password, and driver properties. Below screen shot shows how to configure a Presto connection</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Adding Presto connector / interpreter to Zeppelin</figcaption></figure><p id=\"b419\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Once the connector is added, we can create a notebook and add a new note. Inside the new note, the query can be executed as follows and the results can be visualized immediately in a variety of charts.</p><pre class=\"jn jo jp jq jr js jt bz\">%jdbc SELECT bucket_id, COUNT(*) FROM event_buckets GROUP BY bucket_id</pre><p id=\"19b2\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\"><em class=\"jz\">Note: if Presto is not configured as default interpreter, then you need to provide the name of the interpreter in the query</em></p><pre class=\"jn jo jp jq jr js jt bz\">%jdbc(presto) SELECT bucket_id, COUNT(*) FROM event_buckets GROUP BY bucket_id</pre><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Zeppelin visualization for Presto query</figcaption></figure><h2 id=\"238a\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Building an interactive SQL executor</h2><p id=\"1261\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">Zeppelin makes it very easy to build interactive form based UIs. Let’s try that by building a SQL input form. A general purpose SQL query executor requires following inputs:</p><ul class=\"\"><li id=\"f1f5\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il im in io fm\">Name of the fields/columns to be retrieved</li><li id=\"b248\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Table name (FROM clause)</li><li id=\"f890\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Filter expression or aggregate expression (WHERE or GROUP BY clause)</li><li id=\"8f1f\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Sort field (ORDER BY clause)</li><li id=\"7b60\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">pagination (LIMIT and OFFSET)</li></ul><p id=\"3bbe\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">We wrote a simple template zeppelin query</p><pre class=\"jn jo jp jq jr js jt bz\">%jdbc<br />SELECT ${checkbox:Select Fields=Field1, Field1 | Field2 |  Field3 | Field4} ${Free Form Fields = }<br />FROM <br />${Select Table=keyspace1.table1, keyspace1.table1 | keyspace2.table2 | keyspace3.table3} <br />${PREDICATE CLAUSE e.g. WHERE, GROUP BY = } <br />${ORDER BY CLAUSE = }<br />LIMIT ${limit = 10}</pre><p id=\"28d0\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">which resulted in the following UI</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Zeppelin general SQL query form (with query)</figcaption></figure><p id=\"e7f8\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">If we hide the query part, the UI becomes extremely simple</p><figure class=\"jn jo jp jq jr du da db paragraph-image\"><figcaption class=\"ej ek dc da db el em ar ck en at aw\">Zeppelin general SQL query form (query hidden)</figcaption></figure><p id=\"d8e5\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Now that’s the power of Zeppelin! In literally under 10 minutes, we are able to create a UI, where a user can</p><ul class=\"\"><li id=\"5705\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il im in io fm\">Select fields (by selecting checkboxes, or entering in the ‘Free Form Fields’ like Count(*)</li><li id=\"839b\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Provide any ORDER BY clause</li><li id=\"c896\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Provide any WHERE or GROUP BY clause</li><li id=\"dba4\" class=\"hy hz as ia b fo ip ic fr iq ie if ir fw ih is fz ij it gc il im in io fm\">Choose any table to connect to</li></ul><p id=\"a9d8\" class=\"hy hz as ia b fo ib ic fr id ie if ig fw ih ii fz ij ik gc il di fm\">Using Zeppelin’s powerful display system (AngularJS) and variable binding mechanism, it’s very easy to create a chain of paragraphs which can execute in succession or in any arbitrary custom defined manner.</p><h2 id=\"0f40\" class=\"iu iv as ar iw ix iy ic iz ja ie jb jc fw jd je fz jf jg gc jh fm\">Summary</h2><p id=\"e756\" class=\"hy hz as ia b fo ji ic fr jj ie if jk fw ih jl fz ij jm gc il di fm\">In this blog we explored how we can leverage Presto to run SQL queries over data sources. We also explored how we can wire Presto with Zeppelin to create compelling visualizations and analyze patterns quickly. This has resulted in quick analysis and collaboration between different teams. In the next article, we will be exploring how the Presto cluster is allowing us to join data across Cassandra, Kafka, Hive, etc for instant analysis on fast moving data. Stay tuned for more updates.</p>","id":"ea514087-85d6-5908-8463-0c121d6f6b68","title":"Exploring Presto and Zeppelin for fast data analytics and visualization","origin_url":"https://medium.com/walmartlabs/exploring-presto-and-zeppelin-for-fast-data-analytics-and-visualization-9cb4dca91c3d","url":"https://medium.com/walmartlabs/exploring-presto-and-zeppelin-for-fast-data-analytics-and-visualization-9cb4dca91c3d","wallabag_created_at":"2020-07-15T17:01:25+00:00","published_at":"2019-02-01T16:47:01+00:00","published_by":"['']","reading_time":7,"domain_name":"medium.com","preview_picture":"https://miro.medium.com/v2/resize:fit:640/1*BIAJryOgIbbUmYix6VOSGQ.jpeg","tags":["elastic","hive","zeppelin","cassandra","kafka","presto"],"description":"Photo credit MeditationsIntroductionBig data world is messy! The choice of frameworks, platforms, and data sources is simply mind-boggling. Several teams at Walmart use a mix of various stacks and too..."},{"content":"<div class=\"row\"><div class=\"col-md-6\"><ul><li> Data Ingestion</li>\n      <li> Data Discovery</li>\n      <li> Data Analytics</li>\n      <li> Data Visualization &amp; Collaboration</li>\n    </ul></div><div class=\"col-md-6\"><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/notebook.png\" alt=\"image\" /></div></div><h2>Multiple Language Backend</h2><p><a href=\"https://zeppelin.apache.org/docs/0.7.0/manual/interpreters.html\">Apache Zeppelin interpreter</a> concept allows any language/data-processing-backend to be plugged into Zeppelin.\nCurrently Apache Zeppelin supports many interpreters such as Apache Spark, Python, JDBC, Markdown and Shell.</p><p><img class=\"img-responsive\" width=\"500px\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/available_interpreters.png\" alt=\"image\" /></p><p>Adding new language-backend is really simple. Learn <a href=\"https://zeppelin.apache.org/docs/0.7.0/development/writingzeppelininterpreter.html#make-your-own-interpreter\">how to create your own interpreter</a>.</p><h4>Apache Spark integration</h4><p>Especially, Apache Zeppelin provides built-in <a href=\"http://spark.apache.org/\">Apache Spark</a> integration. You don't need to build a separate module, plugin or library for it.</p><p><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/spark_logo.png\" width=\"140px\" alt=\"image\" /></p><p>Apache Zeppelin with Spark integration provides</p><ul><li>Automatic SparkContext and SQLContext injection</li>\n<li>Runtime jar dependency loading from local filesystem or maven repository. Learn more about <a href=\"https://zeppelin.apache.org/docs/0.7.0/interpreter/spark.html#dependencyloading\">dependency loader</a>.</li>\n<li>Canceling job and displaying its progress</li>\n</ul><p>For the further information about Apache Spark in Apache Zeppelin, please see <a href=\"https://zeppelin.apache.org/docs/0.7.0/interpreter/spark.html\">Spark interpreter for Apache Zeppelin</a>.</p><h2>Data visualization</h2><p>Some basic charts are already included in Apache Zeppelin. Visualizations are not limited to Spark SQL query, any output from any language backend can be recognized and visualized.</p><div class=\"row\"><div class=\"col-md-6\"><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/graph1.png\" alt=\"image\" /></div><div class=\"col-md-6\"><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/graph2.png\" alt=\"image\" /></div></div><h3>Pivot chart</h3><p>Apache Zeppelin aggregates values and displays them in pivot chart with simple drag and drop. You can easily create chart with multiple aggregated values including sum, count, average, min, max.</p><div class=\"row\"><div class=\"col-md-12\"><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/screenshots/pivot.png\" width=\"480px\" alt=\"image\" /></div></div><p>Learn more about <a href=\"#display-system\">display systems</a> in Apache Zeppelin.</p><h2>Dynamic forms</h2><p>Apache Zeppelin can dynamically create some input forms in your notebook.\n</p><div class=\"row\"><div class=\"col-md-12\"><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/screenshots/dynamicform.png\" alt=\"image\" /></div></div>Learn more about<a href=\"https://zeppelin.apache.org/docs/0.7.0/manual/dynamicform.html\">Dynamic Forms</a>.<h2>Collaborate by sharing your Notebook &amp; Paragraph</h2><p>Your notebook URL can be shared among collaborators. Then Apache Zeppelin will broadcast any changes in realtime, just like the collaboration in Google docs.</p><div class=\"row\"><div class=\"col-md-12\"><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/screenshots/publish.png\" width=\"650px\" alt=\"image\" /></div></div><p>Apache Zeppelin provides an URL to display the result only, that page does not include any menus and buttons inside of notebooks.\nYou can easily embed it as an iframe inside of your website in this way.\nIf you want to learn more about this feature, please visit <a href=\"https://zeppelin.apache.org/docs/0.7.0/manual/publish.html\">this page</a>.</p><h2>100% Opensource</h2><p><img class=\"img-responsive\" src=\"https://zeppelin.apache.org/docs/0.7.0/assets/themes/zeppelin/img/asf_logo.png\" width=\"250px\" alt=\"image\" /></p><p>Apache Zeppelin is Apache2 Licensed software. Please check out the <a href=\"http://git.apache.org/zeppelin.git\">source repository</a> and <a href=\"https://zeppelin.apache.org/contribution/contributions.html\">how to contribute</a>.\nApache Zeppelin has a very active development community.\nJoin to our <a href=\"https://zeppelin.apache.org/community.html\">Mailing list</a> and report issues on <a href=\"https://issues.apache.org/jira/browse/ZEPPELIN\">Jira Issue tracker</a>.</p><h2>What is the next ?</h2><h4>Quick Start</h4><ul><li>Getting Started\n<ul><li><a href=\"https://zeppelin.apache.org/docs/0.7.0/install/install.html\">Quick Start</a> for basic instructions on installing Apache Zeppelin</li>\n<li><a href=\"https://zeppelin.apache.org/docs/0.7.0/install/configuration.html\">Configuration</a> lists for Apache Zeppelin</li>\n<li><a href=\"https://zeppelin.apache.org/docs/0.7.0/quickstart/explorezeppelinui.html\">Explore Apache Zeppelin UI</a>: basic components of Apache Zeppelin home</li>\n<li><a href=\"https://zeppelin.apache.org/docs/0.7.0/quickstart/tutorial.html\">Tutorial</a>: a short walk-through tutorial that uses Apache Spark backend</li>\n</ul></li>\n<li>Basic Feature Guide\n</li>\n<li>More\n</li>\n</ul><h4>Interpreter</h4><ul><li><a href=\"https://zeppelin.apache.org/docs/0.7.0/manual/interpreters.html\">Interpreters in Apache Zeppelin</a>: what is interpreter group? how can you set interpreters in Apache Zeppelin?</li>\n<li>Usage\n</li>\n<li>Available Interpreters: currently, about 20 interpreters are available in Apache Zeppelin.</li>\n</ul><h4>Display System</h4><ul><li>Basic Display System: <a href=\"https://zeppelin.apache.org/docs/0.7.0/displaysystem/basicdisplaysystem.html#text\">Text</a>, <a href=\"https://zeppelin.apache.org/docs/0.7.0/displaysystem/basicdisplaysystem.html#html\">HTML</a>, <a href=\"https://zeppelin.apache.org/docs/0.7.0/displaysystem/basicdisplaysystem.html#table\">Table</a> is available</li>\n<li>Angular API: a description about avilable backend and frontend AngularJS API with examples\n</li>\n</ul><h4>More</h4><ul><li>Notebook Storage: a guide about saving notebooks to external storage\n</li>\n<li>REST API: available REST API list in Apache Zeppelin\n</li>\n<li>Security: available security support in Apache Zeppelin\n</li>\n<li>Advanced\n</li>\n<li>Contribute\n</li>\n</ul><h4>External Resources</h4><ul><li><a href=\"https://zeppelin.apache.org/community.html\">Mailing List</a></li>\n<li><a href=\"https://cwiki.apache.org/confluence/display/ZEPPELIN/Zeppelin+Home\">Apache Zeppelin Wiki</a></li>\n<li><a href=\"http://stackoverflow.com/questions/tagged/apache-zeppelin\">StackOverflow tag <code>apache-zeppelin</code></a></li>\n</ul>","id":"55152a5a-c510-536e-b34a-56124333e067","title":"Apache Zeppelin 0.7.0 Documentation:","origin_url":"https://zeppelin.apache.org/docs/0.7.0/","url":"https://zeppelin.apache.org/docs/0.7.0/","wallabag_created_at":"2020-07-08T16:59:09+00:00","published_at":null,"published_by":"['']","reading_time":2,"domain_name":"zeppelin.apache.org","preview_picture":null,"tags":["flink","hive","notebooks and ides","python","visualization","elasticsearch","cassandra","elastic","zeppelin","hadoop","postgres","sql"],"description":" Data Ingestion\n Data Discovery\n Data Analytics\n Data Visualization & Collaboration\nMultiple Language BackendApache Zeppelin interpreter concept allows any language/data-processing-backend to be plugg..."},{"content":"<div><div><div class=\"speechify-ignore ab cp\"><div class=\"speechify-ignore bh l\"><div class=\"hv hw hx hy hz ab\"><div><div class=\"ab ia\"><div><div class=\"bm\" aria-hidden=\"false\"><a rel=\"noopener follow\" href=\"https://medium.com/@rako?source=post_page---byline--ed55f6e67d17--------------------------------\"><div class=\"l ib ic by id ie\"><div class=\"l fj\"><img alt=\"Arunkumar\" class=\"l fd by dd de cx\" src=\"https://miro.medium.com/v2/resize:fill:88:88/1*Pgfv2m0dFhDO2zF2IUAGMQ.jpeg\" width=\"44\" height=\"44\" data-testid=\"authorPhoto\" referrerpolicy=\"no-referrer\" /></div></div></a></div></div></div></div><div class=\"bn bh l\"><div class=\"ab\"><div><div class=\"ih ab q\"><div class=\"ab q ii\"><div class=\"ab q\"><div><div class=\"bm\" aria-hidden=\"false\"><p class=\"bf b ij ik bk\"><a class=\"af ag ah ai aj ak al am an ao ap aq ar il\" data-testid=\"authorName\" rel=\"noopener follow\" href=\"https://medium.com/@rako?source=post_page---byline--ed55f6e67d17--------------------------------\">Arunkumar</a></p></div></div></div>·<p class=\"bf b ij ik du\"><a class=\"io ip ah ai aj ak al am an ao ap aq ar ex iq ir\" rel=\"noopener follow\" href=\"https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fsubscribe%2Fuser%2Fc92b98ca6070&amp;operation=register&amp;redirect=https%3A%2F%2Fmedium.com%2F%40rako%2Fspark-and-cassandras-sstable-loader-ed55f6e67d17&amp;user=Arunkumar&amp;userId=c92b98ca6070&amp;source=post_page-c92b98ca6070--byline--ed55f6e67d17---------------------post_header-----------\">Follow</a></p></div></div></div></div><div class=\"l is\"><div class=\"ab cn it iu iv\"><div class=\"ab ae\">3 min read<div class=\"iw ix l\" aria-hidden=\"true\">·</div>May 13, 2018</div></div></div></div></div><div class=\"ab cp iy iz ja jb jc jd je jf jg jh ji jj jk jl jm jn\"><div class=\"h k w fg fh q\"><div class=\"kd l\"><div class=\"ab q ke kf\"><div class=\"pw-multi-vote-icon fj kg kh ki kj\"><a class=\"af ag ah ai aj ak al am an ao ap aq ar as at\" data-testid=\"headerClapButton\" rel=\"noopener follow\" href=\"https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2Fed55f6e67d17&amp;operation=register&amp;redirect=https%3A%2F%2Fmedium.com%2F%40rako%2Fspark-and-cassandras-sstable-loader-ed55f6e67d17&amp;user=Arunkumar&amp;userId=c92b98ca6070&amp;source=---header_actions--ed55f6e67d17---------------------clap_footer-----------\"><div><div class=\"bm\" aria-hidden=\"false\"><div class=\"kk ao kl km kn ko am kp kq kr kj\"></div></div></div></a></div><div class=\"pw-multi-vote-count l ks kt ku kv kw kx ky\"><p class=\"bf b dv z du\">--</p></div></div></div><div><div class=\"bm\" aria-hidden=\"false\"></div></div><div class=\"ab q jo jp jq jr js jt ju jv jw jx jy jz ka kb kc\"><div class=\"h k\"><div><div class=\"bm\" aria-hidden=\"false\"></div></div><div class=\"fd li cn\"><div class=\"l ae\"><div class=\"ab cb\"><div class=\"lj lk ll lm ln lo ci bh\"><div class=\"ab\"><div class=\"bm bh\" aria-hidden=\"false\"><div><div class=\"bm\" aria-hidden=\"false\"></div></div></div></div></div></div></div><div class=\"bm\" aria-hidden=\"false\" aria-describedby=\"postFooterSocialMenu\" aria-labelledby=\"postFooterSocialMenu\"><div><div class=\"bm\" aria-hidden=\"false\"></div></div></div></div></div></div></div></div><p id=\"d328\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\"><em class=\"nf\">Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.</em></p><p id=\"4acb\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,</p><figure class=\"ng nh ni nj nk nl\"><div class=\"nm nn l fj\"></div></figure><p id=\"e794\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.</p><figure class=\"ng nh ni nj nk nl nq nr paragraph-image\"><div role=\"button\" tabindex=\"0\" class=\"nt nu fj nv bh nw\"><div class=\"nq nr ns\"></div></div><figcaption class=\"ny ff nz nq nr oa ob bf b bg z du\">Latencies during Cassandra row writes</figcaption></figure><p id=\"9612\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">The spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes <a class=\"af oc\" href=\"https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#write-tuning-parameters\" rel=\"noopener ugc nofollow\" target=\"_blank\">here</a>. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.</p><p id=\"dba6\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).</p><figure class=\"ng nh ni nj nk nl nq nr paragraph-image\"><div role=\"button\" tabindex=\"0\" class=\"nt nu fj nv bh nw\"><div class=\"nq nr od\"></div></div><figcaption class=\"ny ff nz nq nr oa ob bf b bg z du\">Latencies during Cassandra SSTable loads</figcaption></figure><p id=\"896b\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">Also if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.</p><figure class=\"ng nh ni nj nk nl nq nr paragraph-image\"><div role=\"button\" tabindex=\"0\" class=\"nt nu fj nv bh nw\"><div class=\"nq nr oe\"></div></div><figcaption class=\"ny ff nz nq nr oa ob bf b bg z du\">Network Traffic (Row writes vs SSTable load)</figcaption></figure><p id=\"4d14\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">Now let’s get into how to do that in code,</p><ul class=\"\"><li id=\"f241\" class=\"mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne of og oh bk\">Using CQLSSTableWriter build the SSTables per partition</li></ul><figure class=\"ng nh ni nj nk nl\"><div class=\"nm nn l fj\"></div></figure><ul class=\"\"><li id=\"49df\" class=\"mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne of og oh bk\">We need to define the create and insert statements, but it’s easy to build that from the spark dataframe</li></ul><figure class=\"ng nh ni nj nk nl\"><div class=\"nm nn l fj\"></div></figure><ul class=\"\"><li id=\"f0b8\" class=\"mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne of og oh bk\">And stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.</li></ul><figure class=\"ng nh ni nj nk nl\"><div class=\"nm nn l fj\"></div></figure><ul class=\"\"><li id=\"2a2e\" class=\"mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne of og oh bk\">And finally the code that run’s it all,</li></ul><figure class=\"ng nh ni nj nk nl\"><div class=\"nm nn l fj\"></div></figure><ul class=\"\"><li id=\"1fd4\" class=\"mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne of og oh bk\">As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.</li><li id=\"438e\" class=\"mh mi gu mj b mk oi mm mn mo oj mq mr ms ok mu mv mw ol my mz na om nc nd ne of og oh bk\">Let say the size is 60GB, we will have 256 SSTables of size 256MB each.</li><li id=\"3dcd\" class=\"mh mi gu mj b mk oi mm mn mo oj mq mr ms ok mu mv mw ol my mz na om nc nd ne of og oh bk\">Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.</li></ul><p id=\"f414\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\"><strong class=\"mj gv\">Fyi,</strong></p><ul class=\"\"><li id=\"2ba0\" class=\"mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne of og oh bk\">SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.</li><li id=\"fb6f\" class=\"mh mi gu mj b mk oi mm mn mo oj mq mr ms ok mu mv mw ol my mz na om nc nd ne of og oh bk\">This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.</li></ul><p id=\"33fb\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in &lt; 30 mins without any issue to the read latencies.</p><p id=\"1564\" class=\"pw-post-body-paragraph mh mi gu mj b mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne gn bk\">Full example code can be found here, <a class=\"af oc\" href=\"https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala\" rel=\"noopener ugc nofollow\" target=\"_blank\">https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala</a></p></div></div></div></div>","id":"64bed0a4-a34b-5c9f-9d9e-57729b842af2","title":"Spark and Cassandra’s SSTable loader","origin_url":"https://medium.com/@rako/spark-and-cassandras-sstable-loader-ed55f6e67d17","url":"https://medium.com/@rako/spark-and-cassandras-sstable-loader-ed55f6e67d17","wallabag_created_at":"2024-11-01T17:13:45+00:00","published_at":"2018-06-08T01:43:43+00:00","published_by":"['Arunkumar']","reading_time":2,"domain_name":"medium.com","preview_picture":"https://miro.medium.com/v2/resize:fit:1200/1*bXczQ0OE6A9iB2X1yGOONA.png","tags":["sstable","cassandra","spark"],"description":"Arunkumar·Follow3 min read·May 13, 2018--Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s exper..."}]},{"tag":"elasticsearch","articles":[{"content":"<p dir=\"auto\">Visual Flow is an ETL tool designed for effective data manipulation via convenient and user-friendly interface. The tool has the following capabilities:</p><ul dir=\"auto\"><li>Can integrate data from heterogeneous sources:\n<ul dir=\"auto\"><li>AWS S3</li>\n<li>Cassandra</li>\n<li>Click House</li>\n<li>DB2</li>\n<li>Dataframe (for reading)</li>\n<li>Elastic Search</li>\n<li>IBM COS</li>\n<li>Kafka</li>\n<li>Local File</li>\n<li>MS SQL</li>\n<li>Mongo</li>\n<li>MySQL/Maria</li>\n<li>Oracle</li>\n<li>PostgreSQL</li>\n<li>Redis</li>\n<li>Redshift</li>\n</ul></li>\n<li>Leverage direct connectivity to enterprise applications as sources and targets</li>\n<li>Perform data processing and transformation</li>\n<li>Run custom code</li>\n<li>Leverage metadata for analysis and maintenance</li>\n</ul><p dir=\"auto\">Visual Flow application is divided into the following repositories:</p><p dir=\"auto\"><a href=\"https://github.com/ibagroup-eu/Visual-Flow/blob/main/CONTRIBUTING.md\">Check the official guide</a>.</p><p dir=\"auto\">Visual flow is an open-source software licensed under the <a href=\"https://github.com/ibagroup-eu/Visual-Flow/blob/main/LICENSE\">Apache-2.0 license</a>.</p>","id":"86395185-03f1-53f4-94a4-e3a2d2d45779","title":"GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository","origin_url":"https://github.com/ibagroup-eu/Visual-Flow","url":"https://github.com/ibagroup-eu/Visual-Flow","wallabag_created_at":"2024-12-02T13:34:31+00:00","published_at":null,"published_by":"['ibagroup-eu']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/9187fdecad3a37939c1971bcdec19ffed4090307ee508b009f47c7bcd49a7f8d/ibagroup-eu/Visual-Flow","tags":["mongo","nocode","elasticsearch","open.source","cassandra","data.pipeline","elastic","aws.s3","etl","low.code","postgres"],"description":"Visual Flow is an ETL tool designed for effective data manipulation via convenient and user-friendly interface. The tool has the following capabilities:Can integrate data from heterogeneous sources:\nA..."},{"content":"<p class=\"f4 mb-3\">Multitenant backend server for building web and mobile apps rapidly. The backend for busy developers. (self-hosted or hosted)</p><p><a title=\"https://paraio.org\" role=\"link\" target=\"_blank\" class=\"text-bold\" rel=\"noopener noreferrer\" href=\"https://paraio.org\">paraio.org</a></p><h3 class=\"sr-only\">License</h3><p><a href=\"https://github.com/Erudika/para/blob/master/LICENSE\" class=\"Link--muted\" data-analytics-event=\"{&quot;category&quot;:&quot;Repository Overview&quot;,&quot;action&quot;:&quot;click&quot;,&quot;label&quot;:&quot;location:sidebar;file:license&quot;}\">Apache-2.0 license</a></p><p><a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/Erudika/para/stargazers\">507 stars</a> <a class=\"Link--secondary no-underline mr-3\" href=\"https://github.com/Erudika/para/forks\">141 forks</a> <a class=\"Link--secondary no-underline mr-3 d-inline-block\" href=\"https://github.com/Erudika/para/branches\">Branches</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/Erudika/para/tags\">Tags</a> <a class=\"Link--secondary no-underline d-inline-block\" href=\"https://github.com/Erudika/para/activity\">Activity</a></p>","id":"0533fa1a-4095-538d-8426-7b0ed0c2b5c1","title":"GitHub - Erudika/para: Multitenant backend server for building web and mobile apps rapidly. The backend for busy developers. (self-hosted or hosted)","origin_url":"https://github.com/Erudika/para","url":"https://github.com/Erudika/para","wallabag_created_at":"2024-01-26T14:53:22+00:00","published_at":null,"published_by":"['']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/2204f6ef5bbccb71656d9d43ff9116ca157ebf50d578547bbd1d8678bf3a3232/Erudika/para","tags":["mongo","rest","elasticsearch","cassandra","elastic","lucene","api","dynamo","baas"],"description":"Multitenant backend server for building web and mobile apps rapidly. The backend for busy developers. (self-hosted or hosted)paraio.orgLicenseApache-2.0 license507 stars 141 forks Branches Tags Activi..."},{"content":"<p>As our business has been <a href=\"https://secondmeasure.com/datapoints/food-delivery-services-grubhub-uber-eats-doordash-postmates/\">growing rapidly</a> over the years, showcasing relevant content in the form of banners and carousels on high-traffic surfaces like the home page has become harder to support reliably. There has been an exponential increase in load on multiple systems such as application pods, databases, and caches, which is expensive to support and maintain. Before diving deeper into the details, let’s define some of the content such as banners and carousels.</p><p><strong>Banners</strong> - These are discovery units represented by a creative with some content that could appear on any page within the app. Examples of banners in the app are shown in Figure 1 and Figure 2. They are usually used to merchandise stores/businesses/deals or to inform consumers about an event. We typically show multiple of them as a horizontally scrollable unit. Each of them could be clickable and lead to a carousel, specific store, webpage, etc.</p><figure class=\"wp-block-image size-large is-resized is-style-default\"><img src=\"https://doordash.engineering/wp-content/uploads/2022/06/image5-473x1024.jpg\" alt=\"Figure 1: Banner showcasing an M&amp;M deal\" class=\"wp-image-7665\" width=\"261\" height=\"565\" referrerpolicy=\"no-referrer\" /><figcaption>Figure 1: Banner showcasing an M&amp;M deal</figcaption></figure><figure class=\"wp-block-image size-large is-resized is-style-default\"><img src=\"https://doordash.engineering/wp-content/uploads/2022/06/image3-1-473x1024.jpg\" alt=\"Figure 2: Informational banner on store page indicating this store is a top-rated store\" class=\"wp-image-7666\" width=\"285\" height=\"617\" referrerpolicy=\"no-referrer\" /><figcaption>Figure 2: Informational banner on store page indicating this store is a top-rated store</figcaption></figure><p><strong>Carousels</strong> - These are discovery units that could appear on any page within the app. They are usually used to group stores into a common theme/category so that consumers are able to discover content in a more organized way. The stores inside these units are horizontally scrollable. On clicking the gray arrow, a broader selection of the stores belonging to this theme is shown. Examples of carousels in the app are shown in Figure 3 and Figure 4.</p><figure class=\"wp-block-image size-large is-resized is-style-default\"><img src=\"https://doordash.engineering/wp-content/uploads/2022/06/image2-473x1024.png\" alt=\"Figure 3: Multiple carousels shown on the home page. Some of them are manually  curated or rule-based or are auto-generated based on machine learning algorithms  \" class=\"wp-image-7669\" width=\"291\" height=\"629\" referrerpolicy=\"no-referrer\" /><figcaption>Figure 3: Multiple carousels shown on the home page. Some of them are manually curated or rule-based or are auto-generated based on machine learning algorithms  <br /></figcaption></figure><figure class=\"wp-block-image size-large is-resized is-style-default\"><img src=\"https://doordash.engineering/wp-content/uploads/2022/06/image1-473x1024.png\" alt=\"Figure 4: Viewing more options for a carousel\" class=\"wp-image-7668\" width=\"316\" height=\"683\" referrerpolicy=\"no-referrer\" /><figcaption>Figure 4: Viewing more options for a carousel</figcaption></figure><h2>The challenge of fetching relevant content at scale</h2><p>The challenge we faced was that too many discovery units had to be fetched in real-time which could be relevant for a consumer address’ deliverable radius. This scaling challenge was causing a huge toll on the availability and reliability of carousels. </p><figure class=\"wp-block-image size-large is-style-default\"><img src=\"https://doordash.engineering/wp-content/uploads/2022/06/search-service-14-1024x321.jpg\" alt=\"Figure 5: Illustrates a high-level fan-out issue. Since campaigns are created and stored at a per-store level, to ensure high recall, we fetch campaigns for all stores which results in a fan-out from Campaign Service to Cassandra\" class=\"wp-image-7686\" referrerpolicy=\"no-referrer\" /><figcaption>Figure 5: Illustrates a high-level fan-out issue. Since campaigns are created and stored at a per-store level, to ensure high recall, we fetch campaigns for all stores which results in a fan-out from Campaign Service to Cassandra</figcaption></figure><p>When using DoorDash the user experience starts the second you open the consumer app. On our backend systems, a lot starts happening immediately. One of the first things that happens is the set of stores (includes restaurants, grocery stores, pet stores, and so on) that are in the consumer address’ deliverable radius are fetched from search service which has business logic to determine what stores are relevant for customers given the logistical and geographical constraints. The number of stores available in a dense location like LA or NYC could easily reach thousands compared to hundreds in suburban areas.</p><p>Once relevant context like store data, consumer data, geographical information (like lat/long, city, district), etc. is calculated, a call is made from the Discovery system to the <a href=\"https://doordash.engineering/2022/06/28/taming-content-discovery-scaling-challenges-with-hexagons-and-elasticsearch/#campaign\">Campaign system</a> to get a list of carousels and banners eligible, available, and relevant for the context that was passed along.</p><p>The Discovery system is responsible for content gathering, grouping and ranking of different entities for a given surface such as the home page.</p><p>The Campaign system internally tries to fetch campaigns for each store in the context to maximize recall.</p><h3 id=\"campaign\">How our Campaign system works</h3><p>Our banner and carousel system relies on campaign objects, which are containers that hold configuration rules such as: </p><ul><li><strong>what to show</strong></li>\n<li><strong>who to show to</strong></li>\n<li><strong>when to show</strong> </li>\n<li><strong>how to show</strong></li>\n</ul><p>These objects are configured at the store/business or a higher-order geographical level such as city, district, country, etc. Here, an example of a store could be the <a href=\"https://www.doordash.com/convenience/store/1741590/?pickup=false\">Safeway</a> at 303 2nd St in San Francisco. A business is a bigger entity than a store that could have a list of stores belonging to it; for example, McDonalds could have 10,000+ stores.</p><p>The campaign system gives DoorDash strategy operators a very powerful way to be able to control the discovery surface content. Today we have banners and carousels that are manually curated, machine learning curated, and rule-based curated. All of them can be highly targeted to a set of users, shown during certain times of the day, have discounts associated with them, capped on how often they could show during a given time period, displayed at different start and end dates, and so on.</p><p>A single campaign could be targeting <strong>thousands of stores</strong> and each store in turn could have its own specific targeting, for example, a consumer needs to be new to the store to be eligible for the campaign.</p><p>Below is a demonstration of a simple campaign configuration that <strong>targets</strong> a store with store id = 999 to show a <strong>banner on the store page,</strong> and has specific <strong>start dates and end dates that it should show,</strong> and is only visible on the DoorDash app.</p><pre lang=\"json\" class=\"language-json\" xml:lang=\"json\">{\n  \"campaign\": {\n    \"limitations\": [\n      {\n        \"type\": \"LIMITATION_TYPE_IS_ACTIVE\",\n        \"is_active\": {\n          \"value\": true\n        },\n        \"value\": \"is_active\"\n      },\n      {\n        \"type\": \"LIMITATION_TYPE_EXPERIENCE\",\n        \"experiences\": {\n          \"experience\": [\n            \"DOORDASH\"\n          ]\n        },\n        \"value\": \"experiences\"\n      },\n      {\n        \"type\": \"LIMITATION_TYPE_ACTIVE_DATES\",\n        \"active_dates\": {\n          \"start_time\": {\n            \"seconds\": \"1613635200\",\n            \"nanos\": 0\n          },\n          \"end_time\": {\n            \"seconds\": \"1672559940\",\n            \"nanos\": 0\n          }\n        },\n        \"value\": \"active_dates\"\n      }\n    ],\n    \"placements\": [\n      {\n        \"limitations\": [\n          {\n            \"type\": \"LIMITATION_TYPE_IS_ACTIVE\",\n            \"is_active\": {\n              \"value\": true\n            },\n            \"value\": \"is_active\"\n          }\n        ],\n        \"type\": \"PLACEMENT_TYPE_STORE_PAGE_BANNER\",\n        \"content_id\": {\n          \"value\": \"most-loved-2022-store\"\n        },\n        \"sort_order\": {\n          \"value\": 5\n        },\n        \"experiment_name\": {\n          \"value\": \"testMostLoved2022\"\n        }\n      }\n    ],\n    \"memberships\": [\n      {\n        \"ids\": [\n          \"9999999\"\n        ],\n        \"limitations\": [],\n        \"user_criteria\": [],\n        \"type\": \"MEMBERSHIP_ENTITY_TYPE_STORE\"\n      }\n    ],\n    \"user_criteria\": [],\n    \"id\": {\n      \"value\": \"35145320-69bc-45cd-bb89-fc721b94a21d\"\n    },\n    \"name\": {\n      \"value\": \"Campaign - BNY - Most Loved (Feb 2021)\"\n    },\n    \"description\": {\n      \"value\": \"Most Loved tile - February refresh\"\n    },\n    \"created_by\": \"ujjwal.gulecha@doordash.com\",\n    \"created_at\": {\n      \"seconds\": \"1613690199\",\n      \"nanos\": 0\n    }\n  }\n}</pre><h3>Explaining the fan-out problem</h3><p>For dense locations like Los Angeles, a single request would fan out to thousands of calls to our internal systems. During peak traffic, we would easily reach millions of queries per second to our database systems. This volume is particularly bad because it puts a lot of load on all our microservice <a href=\"https://doordash.engineering/2020/12/02/how-doordash-transitioned-from-a-monolith-to-microservices/\">systems</a> involved such as BFFs, service apps, and database systems. We had to massively horizontally scale all of our systems to meet this demand. As the number of stores and campaigns are increasing at a rapid pace to highlight content, it becomes harder to support everything at such a scale.</p><h2>Our approach to tame the Fan out problem</h2><p>So to summarize, there was a massive fan-out problem that kept growing and we were not sure how to proceed with it. We came up with a few solutions that we attempted to try to tame this problem.</p><h3>Batching</h3><p>The most obvious attempt to reduce the load on the application server sides was to batch the calls. We started experimenting with batching the calls to send X stores simultaneously, instead of all at once</p><p>After doing some performance testing, we empirically derived the optimal </p><p>batch size that worked for us. However, we soon started seeing that even this approach was ultimately not able to support our ever-growing expansion, selection, and discovery content. We could theoretically horizontally scale all our systems to support this, however that had its own challenges and we felt that was not the best use of our resources, nor was it sustainable in the longer term.</p><p>The four factors that did not allow us to support this in the long run can be summarized by this fan-out formula: </p><p>T * V * S * C (Traffic * Verticals * Stores * Campaigns)</p><ul><li>Traffic - Expansion into more geographical areas: this means more incoming traffic to our systems</li>\n<li>Verticals - Expansion into new verticals apart from restaurants, such as grocery, convenience, pet supply, etc</li>\n<li>Stores - Onboarding of more stores into the DoorDash system</li>\n<li>Campaigns - Explosion in the number of campaigns to merchandise stores</li>\n</ul><h2><strong>Researching geographical based grouping</strong></h2><p>Going back to the original problem, we were able to alleviate the load on application pods, but still had a load on our database systems. We had to research how to alleviate the load on our database systems.</p><p>As we began thinking more about this problem, one thing became clear to us: we need to reduce the <strong>cardinality of this fan-out</strong>. We needed a way to not request so many stores at a time but also not reduce the selection of stores; a way to group these stores which reduced this fan-out while fetching. <strong><em>Grouping stores by their geographical location</em></strong> made the most sense specifically in dense areas where you have lots of stores packed in a small area and then choose the best campaigns in those areas.</p><p>We looked into multiple existing solutions that would help us achieve this in a consistent, reliant, and scalable way. We looked at systems such as <a href=\"https://s2geometry.io/\">S2</a>, <a href=\"https://h3geo.org/docs/comparisons/geohash\">Geohash</a>, and <a href=\"https://h3geo.org/\">H3</a> </p><p>We did some testing, and based on empirical evidence, we chose <strong>H3</strong> over other libraries. Here we outline some of the reasons that we thought H3 was a better fit.</p><p><strong>H3 is Open source</strong></p><p>H3 is an open-source project and is maintained by an active community with a wide list of high traffic production use cases. It is used by other technology companies, <a href=\"https://h3geo.org/docs/community/libraries\">libraries</a> like geojson2H3, and <a href=\"https://h3geo.org/docs/community/applications\">applications</a> like kepler.gl.</p><p><strong>High Availability and reliability</strong></p><p>The API is simple, fast, and available in the languages DoorDash uses most frequently.</p><p><strong>Relevance to DoorDash use case</strong></p><p><strong>H3</strong> uses a hexagonal system which makes it easier to roughly approximate it to a circle which is closer to what DoorDash uses for calculating delivery radii. We compared the APIs and tested circle filling between <strong>S2</strong> and <strong>H3</strong> in our use cases. We found that <strong>H3</strong> fits our use cases better and both <strong>S2</strong> and <strong>H3</strong> performed similarly in computational complexities. We would need to make geometric approximation work on top of <strong>geohash</strong> while <strong>H3</strong> and <strong>S2</strong> are both mature out of the box full solutions with good performance.</p><h2><strong>How we used H3 for our fan out solution</strong></h2><p>We could use the H3 library to visualize the world into different hexagons. There are different resolutions 1-15 that allow us to geographically condense stores into a large entity.</p><p>This solution allowed us to organize geo’s by hex’s instead of stores or what we were using before. We could now call hexes instead of individual stores and fetch the best campaigns for each hex thereby reducing cardinality. </p><p>Then the question arose: what size hexagon should we use? We wanted to run some benchmark tests to see what the best fit was for our situation. We did real-time analysis for proof of concept and were able to reduce the fan-out by a factor of <strong>500x</strong> for non-dense areas and roughly <strong>200x</strong> for dense areas. </p><p>We found that we reached the empirical optimal balance between computational complexity and approximation effectiveness at H3 resolution level of 9.</p><p>Once we finalized on using geo-hashes as our geographical filter for campaigns, we started looking at other ways of optimizing our fetching. Formerly we were fetching all campaigns and doing in-memory eligibility/filtering. This meant that the amount of data we fetched online was large. </p><p>We saw room for optimization if we could reduce the amount of data fetched by filtering closer to the storage layer. Essentially we wanted to move from “fetch all and filter in-memory” to “fetch filtered data”. This optimization was challenging to do with our existing non-relational database Cassandra which is great for fast lookups but not filtering on multiple keys. </p><h2><strong>Using Elasticsearch to filter data retrieval </strong></h2><p>Based on existing technologies at DoorDash, to optimize for filtering at data retrieval layer, we chose to go with Elasticsearch as this seemed a good fit for filtering at a data retrieval layer at high scale. This index contained campaign data which was denormalized in a way for efficient filtering and retrieval based on request context such as the geohash, start/end date, time of day and so on. </p><h3>Why Elasticsearch</h3><p>Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable efficient data retrieval system. We selected it for the following reasons: </p><p><strong>Needle in a haystack</strong></p><p>Elasticsearch was great for needle-in-a-haystack queries where we would want to filter out and retrieve a smaller amount of campaigns compared to the total data-set. We calculated that we could reduce fetching for ~50% of campaigns if we could filter them at the data-retrieval layer.</p><p><strong>Boosting/Ranking</strong></p><p>Elasticsearch has in-built support for boosting search results in case we want to prefer some campaigns over others while fetching. There could be cases where we manually would want to fetch certain campaigns over others due to any business logic reasons, elasticsearch provided an easy way to achieve this</p><p><strong>Scalability</strong></p><p>We knew with our growth, we would need a system that could easily scale by simply adding more servers. Elasticsearch is <a href=\"https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html\">highly horizontally scalable</a></p><p><strong>Multi-tenancy</strong></p><p>We wanted to ensure we can use a system that can be extended for other use cases if needed. Elasticsearch can support our needs by <a href=\"https://www.bigeng.io/elasticsearch-scaling-multitenant/\">allowing multiple indexes</a> to be created, each having its own configurations</p><p><strong>Support</strong></p><p>It was widely being used already at DoorDash. This meant we would have expert support in case we ran into issues</p><h2>Results</h2><p>We were able to massively reduce our operational costs while still maintaining high reliability and quality. In particular we were able to reduce ~50% costs for our Cassandra and Redis clusters and around 75% costs on our Kubernetes application hosting costs.</p><h2>Things to explore</h2><p>DoorDash is constantly evolving and expanding every single day. We believe this system has helped us serve our needs at this rapid growth pace, however we believe this is not the final solution. With DoorDash going into more countries internationally, expanding into other verticals, acquiring more consumers, and adding more stores to its platform, we will continue investing and iteratively improving our platform. Some ideas we are considering include, but are not limited to:</p><ul><li>Hierarchical H3 geo-hashes.</li>\n<li>Using dynamic Hexagon resolution levels instead of a static one based on market density. Benefits might include a more optimized way of fetching depending on density. Egg.: a dense location like NYC could use fewer hexes to represent it as it is super dense compared to a not dense location like Alaska.</li>\n<li>Using a tiered storage system for data retrieval - offline for long term data and online for real time data.</li>\n<li>Based on the above formula of the fan-out: T * V * S * C (Traffic * Verticals * Stores * Campaigns), optimizing the fetching of relevant but smaller sets of stores and campaigns. Using a first-pass ranker to reduce the candidates of stores and/or campaigns to evaluate could help alleviate issues. E.g.: For a dense location like SF, instead of fetching thousands of campaigns online, we could use a smaller but more relevant subset using relevancy scores between users and campaigns.</li>\n</ul><h2>Acknowledgements</h2><p>Thank you to Grace Chin, Chao Li, Fahd Ahmed, Jacob Greenleaf,  Jennifer Be, Yichen Qiu, Shaohua Zhou, Shahrooz Ansari and Sonic Wang for their involvement and contribution to this project</p>","id":"7fe5400a-6b47-5795-b6aa-b84017025d43","title":"Taming Content Discovery Scaling Challenges with Hexagons and Elasticsearch","origin_url":"https://doordash.engineering/2022/06/28/taming-content-discovery-scaling-challenges-with-hexagons-and-elasticsearch/","url":"https://doordash.engineering/2022/06/28/taming-content-discovery-scaling-challenges-with-hexagons-and-elasticsearch/","wallabag_created_at":"2022-09-08T20:49:07+00:00","published_at":"2022-06-28T14:34:00+00:00","published_by":"['Ujjwal Gulecha']","reading_time":12,"domain_name":"doordash.engineering","preview_picture":"https://careersatdoordash.com/wp-content/uploads/2024/03/antoine-merour-QJDj2VWuOQg-unsplash.jpg","tags":["elasticsearch","cassandra"],"description":"As our business has been growing rapidly over the years, showcasing relevant content in the form of banners and carousels on high-traffic surfaces like the home page has become harder to support relia..."},{"content":"<div class=\"entry clearfix\"><p>In Apache Cassandra Lunch #61, we will discuss different ways of indexing and working with Elassandra as well as showcasing a project I built utilizing Kafka with Elassandra. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live. If you would like to attend Apache Cassandra Lunch live, it is hosted every Wednesday at 12 PM EST. Register <a href=\"https://www.meetup.com/Cassandra-DataStax-DC/events/\" target=\"_blank\" rel=\"noreferrer noopener\">here</a> now!</p><h2>Elassandra</h2><figure class=\"wp-block-image is-resized\"><img src=\"https://www.elassandra.io/elassandra.png\" alt=\"Elassandra Logo\" width=\"800\" height=\"375\" referrerpolicy=\"no-referrer\" /></figure><h3>Cross Datacenter Replication</h3><p>Apache Cassandra supports asynchronous multi-datacenters replication and various mechanisms to repair lost data. By closely integrating Elasticsearch with Cassandra, Elassandra provides search features on many datacenters.</p><h3>Scale On-Demand</h3><p>When you need to increase read/write throughput, Elassandra automatically re-shards your Elasticsearch indices as new machines are added, allowing you to smoothly scale out to fit your business needs without downtime or heavy maintenance operations requirements.</p><h3>Real-Time Analytics</h3><p>By indexing Cassandra’s data into Elasticsearch, <a href=\"https://www.elastic.co/fr/products/kibana\" target=\"_blank\" rel=\"noreferrer noopener\">Kibana</a> will allow you to get continuous and real-time data visualization of your applications.</p><h3>A Masterless Architecture</h3><p>By using a distributed transaction, Elassandra removes the single point of failure of Elasticsearch to manage its configuration.</p><h3>A Reliable Primary Datastore</h3><p>Cassandra is designed for write-intensive workloads, hence, making Elassandra suitable for applications where a large amount of data is to be inserted (such as infrastructure logging, IoT, or events). So, Elasticsearch indices can be rebuilt whenever needed using the Cassandra tables without the creation of data duplication.</p><h3>Continuous Operations in the Cloud</h3><p>Failover-based approaches do not truly achieve high availability as far as write operations are concerned. Thanks to its multi-master design, Elassandra is always available either when a server/container fails or restarts because of some maintenance operations.</p><h2>Elassandra Architecture</h2><p>Elassandra closely integrates Elasticsearch within Apache Cassandra as a secondary index, allowing near-realtime search with all existing Elasticsearch APIs, plugins, and tools like Kibana. When you index a document, the JSON document is stored as a row in a Cassandra table and synchronously indexed in Elasticsearch.</p><figure class=\"wp-block-image\"><img src=\"http://doc.elassandra.io/en/latest/_images/elassandra1.jpg\" alt=\"Diagram of an Elasticsearch Cluster\" referrerpolicy=\"no-referrer\" /></figure><h3>Shards and Replicas</h3><p>Unlike Elasticsearch, sharding depends on the number of nodes in the datacenter, and the number of replicas is defined by your keyspace Replication Factor. Elasticsearch number of shards is just information about the number of nodes.</p><ul><li>When adding a new Elassandra node, the Cassandra boostrap process gets some token ranges from the existing ring and pull the corresponding data. Pulled data is automatically indexed and each node update its routing table to distribute search requests according to the ring topology.</li>\n<li>When updating the Replication Factor, you will need to run a <a href=\"http://docs.datastax.com/en/cql/3.0/cql/cql_using/update_ks_rf_t.html\">nodetool repair &lt;keyspace&gt;</a> on the new node to effectively copy and index the data.</li>\n<li>If a node becomes unavailable, the routing table is updated on all nodes to route search requests on available nodes. The current default strategy routes search requests on primary token ranges’ owner first, then to replica nodes when available. If some token ranges become unreachable, the cluster status is in red, otherwise cluster status is in yellow.</li>\n</ul><h3>Elassandra: Write path</h3><p>Write operations (Elasticsearch index, update, delete and bulk operations) are converted into CQL write requests managed by the coordinator node. The Elasticsearch document <em>_id</em> is converted into an underlying primary key, and the corresponding row is stored on many nodes according to the Cassandra replication factor. Then, on each node hosting this row, an Elasticsearch document is indexed through a Cassandra custom secondary index. Every document includes a _token fields used when searching.</p><figure class=\"wp-block-image\"><img src=\"http://doc.elassandra.io/en/latest/_images/write-path.png\" alt=\"Diagram of the Elasticsearch write path.\" referrerpolicy=\"no-referrer\" /></figure><h3>Elassandra: Search path</h3><p>The search request is done in two phases. First, in the query phase, the coordinator node adds a token_ranges filter to the query and broadcasts a search request to all nodes. This token_ranges filter covers the entire Cassandra ring and avoids duplicating results. Secondly, in the fetch phases, the coordinator fetches the required fields by issuing a CQL request in the underlying Cassandra table and builds the final JSON response.</p><figure class=\"wp-block-image\"><img src=\"http://doc.elassandra.io/en/latest/_images/search-path.png\" alt=\"Diagram of the Elasticsearch search path.\" referrerpolicy=\"no-referrer\" /></figure><figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><iframe title=\"Apache Cassandra Lunch #61: Elassandra\" width=\"900\" height=\"506\" src=\"https://www.youtube.com/embed/jhmYb2xcdXo?feature=oembed\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"> </iframe>\n</figure><h2>Cassandra.Link</h2><p><a href=\"https://cassandra.link/\" target=\"_blank\" rel=\"noreferrer noopener\">Cassandra.Link</a> is a knowledge base that we created for all things Apache Cassandra. Our goal with <a href=\"https://cassandra.link/\" target=\"_blank\" rel=\"noreferrer noopener\">Cassandra.Link</a> was to not only fill the gap of <a href=\"https://web.archive.org/web/*/http://www.planetcassandra.org\" target=\"_blank\" rel=\"noreferrer noopener\">Planet Cassandra</a> but to bring the <strong>Cassandra</strong> community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.</p><p>We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!</p></div><p class=\"blog-post-meta\">Posted in <a href=\"https://blog.anant.us/category/platform/data-analytics/\" rel=\"category tag\">Data &amp; Analytics</a>, <a href=\"https://blog.anant.us/category/events/\" rel=\"category tag\">Events</a> <strong>|</strong> Comments Off on Apache Cassandra Lunch #61: Elassandra</p>","id":"77a7339b-dabf-5d8c-aa8f-2a857b7284ec","title":"Apache Cassandra Lunch #61: Elassandra - Business Platform Team","origin_url":"https://blog.anant.us/apache-cassandra-lunch-61-elassandra/","url":"https://blog.anant.us/apache-cassandra-lunch-61-elassandra/","wallabag_created_at":"2022-06-22T22:00:35+00:00","published_at":"2021-08-06T17:34:07+00:00","published_by":"['Stefan Nikolovski']","reading_time":3,"domain_name":"blog.anant.us","preview_picture":"https://blog.anant.us/wp-content/uploads/2021/08/Ellasandra-1.png","tags":["elassandra","cassandra.lunch","elasticsearch","cassandra"],"description":"In Apache Cassandra Lunch #61, we will discuss different ways of indexing and working with Elassandra as well as showcasing a project I built utilizing Kafka with Elassandra. The live recording of Cas..."}]},{"tag":"cassandra","articles":[{"content":"<p>This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the <a href=\"https://rustyrazorblade.com/post/2025/03-streaming/\">previous post</a>, I examined how streaming performance impacts node density and operational costs. In this post, I’ll focus on compaction throughput, and a recent optimization in Cassandra 5.0.4 that significantly improves it, <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-15452\" target=\"_blank\">CASSANDRA-15452</a>.</p><p>This post assumes some familiarity with Apache Cassandra storage engine fundamentals. The documentation has a nice <a href=\"https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html\" target=\"_blank\">section covering the storage engine</a> if you’d like to brush up before reading this post.</p><h2 id=\"the-compaction-bottleneck\">The Compaction Bottleneck</h2><p>Compaction in Cassandra is the process of merging multiple SSTables and writing out new ones, discarding tombstones, resolving overwrites, and generally organizing data for efficient reads. It’s an I/O intensive background operation that directly competes with foreground operations for system resources. In a later post I’ll look at how compaction strategies impact node density, but for now, I’ll just focus on throughput.</p><h2 id=\"why-compaction-throughput-matters-for-node-density\">Why Compaction Throughput Matters for Node Density</h2><p>As we continue to increase the amount of data we store per node, compaction performance becomes increasingly important. It affects:</p><ul><li>How quickly the system can reclaim disk space</li>\n<li>Whether the cluster can keep up with incoming writes</li>\n<li>Read latency, minimizing SSTables per read</li>\n<li>How fast nodes are able to join a new cluster</li>\n</ul><p>Simply put: as your data volume and write throughput increase, compaction throughput must as well. If it doesn’t, you’ll hit a performance wall that effectively caps your maximum practical node density.</p><p>Despite the significant improvements to compaction throughput over the years, there are some circumstances where compaction performance is inadequate. Let’s take a look at the reason why, then dive into what can be done about it.</p><p>When doing any performance evaluation, it’s important to understand how to measure where your time is spent. A lot of folks make incorrect assumptions, and then waste a lot of time trying to optimize something that doesn’t matter. I’ve written several posts about how useful profiling with the <a href=\"https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler\">async-profiler</a> can be from an application perspective. For looking at the OS and hardware, the eBPF based toolkit <a href=\"https://rustyrazorblade.com/post/2023-11-14-bcc-tools\">bcc-tools</a> can help you identify process bottlenecks. I’ve used these tools extensively over the years, and in this post I’ll show how they’ve helped identify two major performance bottlenecks in compaction. My <a href=\"https://github.com/rustyrazorblade/easy-cass-lab\" target=\"_blank\">easy-cass-lab</a> software includes all these tools, as well as integration with <a href=\"https://axonops.com/\" target=\"_blank\">AxonOps</a> for Cassandra dashboards and operational tooling.</p><h2 id=\"being-10x-smarter-with-our-disk-access\">Being 10x Smarter With Our Disk Access</h2><p>When investigating compaction behavior, I discovered an major inefficiency in how Cassandra was accessing disk. The problem was especially severe in cloud environments with disaggregated storage like AWS EBS, where IOPS (Input/Output Operations Per Second) are both limited and expensive when used improperly.</p><p>When Cassandra would read in data during compaction, it would read individual compressed chunks off disk, one small read at a time. Using bcc-tools, we can monitor every filesystem operation. Here I’m using <code>xfsslower</code> to record every read operation on the filesystem (original headers back in for clarity):</p><div class=\"highlight\"><pre class=\"language-shell\" data-lang=\"shell\">$ sudo /usr/share/bcc/tools/xfsslower 0 -p 26988 | awk '$4 == \"R\" { print $0 }'\nTracing XFS operations\nTIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME\n22:27:38 CompactionExec 26988  R 4096    0           0.01 nb-7-big-Statistics.db\n22:27:38 CompactionExec 26988  R 4096    4           0.00 nb-7-big-Statistics.db\n22:27:38 CompactionExec 26988  R 2062    8           0.00 nb-7-big-Statistics.db\n22:27:38 CompactionExec 26988  R 14907   0           0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14924   14          0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14896   29          0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14844   43          0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14923   58          0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14931   72          0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14905   87          0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14891   101         0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14919   116         0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14965   130         0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14918   145         0.01 nb-7-big-Data.db\n22:27:38 CompactionExec 26988  R 14930   160         0.01 nb-7-big-Data.db\n</pre></div><p>The above is showing we’re reading about 14KB at a time. That’s the size of the compressed page. This pattern is terrible for performance on cloud storage systems like EBS, where:</p><ol><li>Each read operation, no matter how small, counts against your provisioned IOPS</li>\n<li>Small reads waste IOPS quota while delivering minimal data</li>\n<li>You pay for IOPS allocation whether you use it efficiently or not</li>\n</ol><p>Looking at a wall clock performance profile, we can see compaction is spending a LOT of time waiting on disk, in the really wide column with the <code>pread</code> call at the top:</p><p><img src=\"https://rustyrazorblade.com/images/2025/wall-clock-profile-compaction.png\" alt=\"wall-clock-profile-compaction.png\" referrerpolicy=\"no-referrer\" /></p><p>Readahead is a disk optimization strategy where the operating system reads a larger block of data than was requested into memory. The objective is reduce latency and improve performance for sequential read operations. Unfortunately, when you don’t need the data it’s reading, it can be the source of major performance problems. In my experience, read ahead is one of the worst culprits in the world of Cassandra performance. It’s especially terrible for lightweight transactions and counters, where we perform a read before write.</p><p>My advice to Cassandra operators is to reduce readahead to 4KB to avoid unnecessary read amplification on the read path.</p><p>Readahead does have one place, however, where it can benefit performance. You <em>may</em> have already guessed that it’s compaction. Let’s take a step back and look at how the size of our reads impacts our throughput in a simple benchmark. Larger reads, initiated either from read ahead or the user, should deliver improved throughput, especially when we’re dealing with a quota on our IOPS (EBS), our drives have higher latency (SAN), or both.</p><h2 id=\"benchmarking\">Benchmarking</h2><p>I ran benchmark tests with sequential <code>fio</code> workloads using different request sizes on a 3K IOPS GP3 EBS volume. Here’s the configuration used:</p><div class=\"highlight\"><pre class=\"language-text\" data-lang=\"text\">[global]\nrw=read\ndirectory=data\ndirect=1\ntime_based=1\nfile_service_type=normal\nstonewall\nsize=100M\nnumjobs=12\ngroup_reporting\n[bs4]\nstonewall\nruntime=60s\nblocksize=4k\n[bs8]\nstonewall\nruntime=60s\nblocksize=8k\n[bs16]\nstonewall\nruntime=60s\nblocksize=16k\n[bs32]\nstonewall\nruntime=60s\nblocksize=32k\n[bs64]\nstonewall\nruntime=60s\nblocksize=64k\n[bs128]\nstonewall\nblocksize=128k\nruntime=60s\n[bs256]\nstonewall\nruntime=60s\nblocksize=256k\n</pre></div><p>When reviewing the results, the benefits of using larger request sizes were evident:</p><table><thead><tr><th>Request Size</th>\n<th>IOPS</th>\n<th>Throughput</th>\n</tr></thead><tbody><tr><td>4K</td>\n<td>3049</td>\n<td>11.9 MB/s</td>\n</tr><tr><td>8K</td>\n<td>3012</td>\n<td>23 MB/s</td>\n</tr><tr><td>16K</td>\n<td>3013</td>\n<td>47 MB/s</td>\n</tr><tr><td>32K</td>\n<td>3013</td>\n<td>94 MB/s</td>\n</tr><tr><td>64K</td>\n<td>1938</td>\n<td>121 MB/s</td>\n</tr><tr><td>128K</td>\n<td>957</td>\n<td>120 MB/s</td>\n</tr><tr><td>256K</td>\n<td>478</td>\n<td>120 MB/s</td>\n</tr></tbody></table><p>The data shows that using 256KB reads instead of 16KB reads would deliver almost 3x the throughput while using only 1/6th of the provisioned IOPS. That’s a massive efficiency improvement. Rather than chewing through all our IOPS to deliver a paltry 47MB/s of throughput, we’re only using about 500 for 120MB/s. That means if we can see these gains in the database, we’ll be able to compact faster, put more data on each node, and lower our total cost.</p><h2 id=\"the-solution-internally-buffering-sequential-reads\">The Solution: Internally Buffering Sequential Reads</h2><p>In <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-15452\" target=\"_blank\">CASSANDRA-15452</a>, I worked with my fellow Cassandra committer Jordan West to implement a solution: an efficient, internal read-ahead buffer for bulk reading operations. Here’s how it works:</p><ol><li>Instead of reading tiny chunks, we use a 256KB off-heap buffer</li>\n<li>Each read operation pulls in a full 256KB of data at once</li>\n<li>Compressed chunks are extracted from this buffer as needed</li>\n<li>The buffer is refilled only when necessary</li>\n</ol><p>This approach maximizes IOPS efficiency by using larger reads during compaction (as well as repair and range reads) that deliver more data per operation. For cloud environments, it’s a game-changer that directly aligns with storage provider recommendations. AWS EBS, for instance, <a href=\"https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-io-characteristics.html#ebs-io-iops\" target=\"_blank\">considers any I/O operation up to 256KB as a single operation</a>, so by using the largest possible size we should get optimal performance.</p><h2 id=\"real-world-impact-a-major-improvement-in-compaction-throughput\">Real-World Impact: A Major Improvement in Compaction Throughput</h2><p>When Jordan and I tested the implementation using <a href=\"https://github.com/rustyrazorblade/easy-cass-lab\" target=\"_blank\">easy-cass-lab</a> on EBS, the results were nothing short of spectacular. The <code>10.0.2.171</code> node is running our patched version, the other two nodes are running an unpatched release. The graphs clearly show a 2-3x improvement to throughput and a 3x reduction in IOPS.</p><p><img src=\"https://rustyrazorblade.com/images/2025/15452-bytes-read.png\" alt=\"15452-bytes-read.png\" referrerpolicy=\"no-referrer\" /></p><p><img src=\"https://rustyrazorblade.com/images/2025/15452-compaction.png\" alt=\"Compaction Throughput Comparison\" referrerpolicy=\"no-referrer\" /></p><p>You can see the results in the flamegraph as well. The calls to <code>pread</code> take up significantly less time.</p><p><img src=\"https://rustyrazorblade.com/images/2025/wall-clock-profile-compaction-after-15452.png\" alt=\"wall-clock-profile-compaction-after-15452.png\" referrerpolicy=\"no-referrer\" /></p><p>We can use <code>xfsslower</code> from <code>bcc-tools</code> again to watch the filesystem access:</p><div class=\"highlight\"><pre class=\"language-shell\" data-lang=\"shell\">$ sudo /usr/share/bcc/tools/xfsslower 0 -p $(cassandra-pid) | awk '$4 == \"R\" { print $0 }'\nTIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME\n14:40:29 CompactionExec 1782   R 262144  256         0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  512         0.06 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  768         0.06 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  1024        0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  1280        0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  1536        0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 241123  1792        0.07 nb-4-big-Data.db\n</pre></div><p>This is a lot better, now we’re fetching 256KB at a time using way fewer requests.</p><p>The EBS test configuration used a GP3 volume with 3K IOPS and 256MB throughput. With the existing code, compaction was bottlenecked by IOPS, peaking at exactly 3K IOPS but achieving only about 51MB/s throughput. With our optimization, the same operation used only ~500 IOPS to achieve around 106MB/s—a more than 2x improvement in throughput with 1/3IOPS.</p><p>In our most aggressive testing, <strong>we actually hit the EBS throughput limit rather than the IOPS limit</strong>. That’s a significant transformation in Cassandra’s resource utilization profile.</p><p>The patch also has the benefit of applying to anti-compaction, repair, and range reads. We can see a significant reduction in range reads, aka table scans:</p><p><img src=\"https://rustyrazorblade.com/images/2025/15452-range-reads.png\" alt=\"15452-range-reads.png\" referrerpolicy=\"no-referrer\" /></p><p>If you’re running Spark jobs using the Cassandra connector, you should see an improvement in performance, and your repair times should decrease.</p><h2 id=\"whats-next--can-we-do-more\">What’s next? Can we do more?</h2><p>Yes, absolutely! There’s several more improvements to IO that will help improve things. I’ll cover them here very quickly, and if there’s interest I’ll write about them in detail in a future post.</p><h3 id=\"avoid-reading-the-statistics\">Avoid Reading the Statistics</h3><p>When compacting, we read data out of the Statistics.db file before reading the data itself. This is completely unnecessary, as it’s stats about the data we’re about to read. Skipping this can reduce IO even further. Looking at a compaction’s IO activity, I see about 30% of the filesystem access is reading from <code>Statistics.db</code>:</p><div class=\"highlight\"><pre class=\"language-text\" data-lang=\"text\">14:40:29 CompactionExec 1782   R 4096    0           0.00 nb-3-big-Statistics.db\n14:40:29 CompactionExec 1782   R 701     4           0.00 nb-3-big-Statistics.db\n14:40:29 CompactionExec 1782   R 4096    0           0.00 nb-4-big-Statistics.db\n14:40:29 CompactionExec 1782   R 4096    4           0.00 nb-4-big-Statistics.db\n14:40:29 CompactionExec 1782   R 1962    8           0.00 nb-4-big-Statistics.db\n14:40:29 CompactionExec 1782   R 262144  0           0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 2115    0           0.01 nb-3-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  256         0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  512         0.06 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  768         0.06 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  1024        0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  1280        0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 262144  1536        0.07 nb-4-big-Data.db\n14:40:29 CompactionExec 1782   R 241123  1792        0.07 nb-4-big-Data.db\n</pre></div><p>This has already been fixed in <code>trunk</code> by Branimir Lambov in <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-20092\" target=\"_blank\">CASSANDRA-20092</a> and is being backported to 5.0 by Jordan.</p><h3 id=\"direct-io-for-compaction\">Direct I/O for Compaction</h3><p>Let’s talk more about page cache. Since we go through the Linux page cache when doing reads, we want to make sure it’s working optimally. Page cache lets us avoid going to disk! Unfortunately we also use it when reading for compaction. This is a problem because we’re pulling data into the page cache that we plan on deleting. To make room for the new data, other data will be evicted. If we compact 10GB of data, we’re pushing out a lot of valuable data from the page cache, meaning it needs to be fetched back into memory later on. Using <a href=\"https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/5/html/global_file_system/s1-manage-direct-io\" target=\"_blank\">Direct I/O</a> we can bypass the page cache entirely, which will prevent data from being evicted. This can be a huge help in latency sensitive systems or systems where IOPS are limited like EBS.</p><p>I’ve filed <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-19987\" target=\"_blank\">CASSANDRA-19987</a> to look at this.</p><h3 id=\"non-blocking-compression\">Non Blocking Compression</h3><p>Next, compression. When we’re writing to disk, we fill a buffer, sized by the <code>chunk_length_in_kb</code> table setting, compress, and write to disk. The compression here is a blocking call, which means we can spend a lot of time waiting on compression to finish, when we could be reading and merging the next chunk in parallel This can show up as a performance bottleneck, so I’ve filed <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-20085\" target=\"_blank\">CASSANDRA-20085</a> to look into it.</p><h3 id=\"better-memory-management\">Better Memory Management</h3><p>When a system is not bottlenecked on disk I/O, such as when using NVMe, the main issue we’ll run into is our heap allocation rate. I’ll go into details in a future post, but for now, it’s enough to know that the more memory we allocate, the worse our performance. Being smart about memory allocations can make a big difference in overall time spent, as allocations aren’t free. It also reduces both the frequency and duration of Garbage Collection. Big wins all around.</p><p>I recently profiled an instance where the row size was about 2KB (not out of the ordinary) and found that a single call was accounting for roughly 50% of memory allocated. Fixing this <em>one</em> thing has the potential to deliver a massive performance improvement, especially in workloads where we have either lots of fields, or large fields like serialized blobs.</p><p>Reaching again for async-profiler, this time we run it with <code>-e alloc</code> to track allocations and <code>--reverse</code> to reverse the stacks. I do this because the same underlying call comes from the read path and compaction, and I want to see the time in aggregate.</p><p><img src=\"https://rustyrazorblade.com/images/2025/allocation-profile-compaction.png\" alt=\"allocation-profile-compaction.png\" referrerpolicy=\"no-referrer\" /></p><p>Addressing this single allocation won’t just deliver faster compaction, but will reduce pressure on the heap, which in turn reduces GC overhead. As part of this series I’ll also be covering GC, as a lot’s changed since I wrote about it last.</p><p>I’ve filed <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-20428\" target=\"_blank\">CASSANDRA-20428</a> and there’s already a fair bit of discussion about different approaches to solving the problem.</p><h2 id=\"conclusion\">Conclusion</h2><p>Maximizing compaction throughput is critical for achieving higher node density with Apache Cassandra. The improvements in <a href=\"https://issues.apache.org/jira/browse/CASSANDRA-15452\" target=\"_blank\">CASSANDRA-15452</a> have removed one of the primary bottlenecks that previously limited practical node size in a lot of clusters.</p><p>By upgrading to Cassandra 5.0.4 (or later) you can:</p><ol><li>Dramatically improve compaction throughput</li>\n<li>Reduce IOPS consumption significantly</li>\n<li>Improve overall system stability during write-heavy workloads</li>\n<li>Increase the maximum practical data density per node</li>\n<li>Significantly reduce your cloud storage costs</li>\n</ol><p>This improvement, combined with the streaming optimizations discussed in the <a href=\"https://rustyrazorblade.com/post/2025/03-streaming/\">previous post</a>, creates a multiplier effect on your ability to increase node density. Each optimization removes a bottleneck, allowing you to push your hardware further and achieve more with less.</p><p>In my next post, I’ll be discussing how and why compaction strategies affect node density. Picking the right strategy can have a significant impact on your cluster’s performance and cost efficiency. Make sure you sign up for my <a href=\"https://rustyrazorblade.com/mailing-list/\">mailing list</a> if you’re interested in getting notified when it’s released!</p>If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please<a href=\"mailto:info@rustyrazorblade.com?subject=Consulting%20Services%20Inquiry\">reach out</a>if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.","id":"50f3d448-afe6-5e2f-a1f9-87786beecc12","title":"Cassandra Compaction Throughput Performance Explained","origin_url":"https://rustyrazorblade.com/post/2025/04-compaction-throughput/","url":"https://rustyrazorblade.com/post/2025/04-compaction-throughput/","wallabag_created_at":"2025-04-24T12:03:02+00:00","published_at":"2025-04-16T00:00:00+00:00","published_by":"['']","reading_time":14,"domain_name":"rustyrazorblade.com","preview_picture":"https://rustyrazorblade.com/images/2025/wall-clock-profile-compaction.png","tags":["cassandra","performance"],"description":"This is the second post in my series on improving node density and lowering costs with Apache Cassandra. In the previous post, I examined how streaming performance impacts node density and operational..."},{"content":"<p dir=\"auto\">Welcome to the Awesome Accord repository! This guide provides resources and examples for implementing ACID transactions in Apache Cassandra. Learn how to leverage distributed transactions for building reliable applications.</p><ul dir=\"auto\"><li><strong>Quick Start with Docker</strong>: Single-node deployment for immediate testing</li>\n<li><strong>Lab Environment</strong>: Multi-node cluster setup for development</li>\n<li><strong>Use Cases &amp; Examples</strong>: Production-ready implementations</li>\n<li><strong>Learning Resources</strong>: Documentation and best practices</li>\n</ul><p dir=\"auto\">Accord is in active development and still a feature branch in the Apasche Cassandra® Repo. You will find bug. What we ask is that you help with a contribution of a bug report.</p><p dir=\"auto\">You can use the <a href=\"https://github.com/pmcfadin/awesome-accord/discussions\">Github discussions</a> bug report forum for this or use the Planet Cassandra Discord channel for accord listed below. A bug report should have the folowing:</p><ul dir=\"auto\"><li>The data model used</li>\n<li>Actions to reproduce the bug</li>\n<li>Full stack trace from system.log</li>\n</ul><p dir=\"auto\">If you have suggestions about syntax or improving the overall developer expirience, we want to hear about that to! Add it as a suggestion or feature request using <a href=\"https://github.com/pmcfadin/awesome-accord/discussions\">Github discussions</a> or let us know in the Planet Cassandra Discord.</p><p dir=\"auto\">Now, on to the fun!</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker pull pmcfadin/cassandra-accord docker run -d --name cassandra-accord -p 9042:9042 pmcfadin/cassandra-accord\"><pre>docker pull pmcfadin/cassandra-accord\ndocker run -d --name cassandra-accord -p 9042:9042 pmcfadin/cassandra-accord</pre></div><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"brew tap rustyrazorblade/rustyrazorblade brew install easy-cass-lab\"><pre>brew tap rustyrazorblade/rustyrazorblade\nbrew install easy-cass-lab</pre></div><ul dir=\"auto\"><li><strong>Banking Transactions</strong>: Account transfers with ACID guarantees</li>\n<li><strong>Inventory Management</strong>: Race-free inventory tracking</li>\n<li><strong>User Management</strong>: Multi-table atomic operations</li>\n</ul><ul dir=\"auto\"><li>Provide feedback and bug reports in the <a href=\"https://github.com/pmcfadin/awesome-accord/discussions\">repository forum</a></li>\n<li><a href=\"https://discord.gg/GrRCajJqmQ\" rel=\"nofollow\">Join our Discord Community</a> for discussions and support</li>\n<li>Review our <a href=\"https://github.com/pmcfadin/awesome-accord/blob/main/CONTRIBUTING.md\">Contributor Guide</a></li>\n<li>Submit issues and improvements through GitHub</li>\n</ul><div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"/ ├── docker/ # Docker configuration and setup ├── easy-cass-lab/ # Multi-node testing environment ├── examples/ # Implementation examples │ ├── banking/ # Financial transaction examples │ ├── inventory/ # Stock management examples │ └── user-mgmt/ # User operations examples └── docs/ # Guides and documentation\"><pre>/\n├── docker/              # Docker configuration and setup\n├── easy-cass-lab/      # Multi-node testing environment\n├── examples/           # Implementation examples\n│   ├── banking/       # Financial transaction examples\n│   ├── inventory/     # Stock management examples\n│   └── user-mgmt/     # User operations examples\n└── docs/              # Guides and documentation\n</pre></div><p dir=\"auto\">Our <a href=\"https://github.com/pmcfadin/awesome-accord/blob/main/docs/README.md\">documentation</a> includes:</p><ul dir=\"auto\"><li>Comprehensive setup instructions</li>\n<li>Transaction patterns and implementations</li>\n<li>Performance optimization guides</li>\n<li>Troubleshooting and best practices</li>\n</ul><ol dir=\"auto\"><li>Choose your deployment option:\n<ul dir=\"auto\"><li><a href=\"https://github.com/pmcfadin/awesome-accord/blob/main/docker/README.md\">Docker Guide</a></li>\n<li><a href=\"https://github.com/pmcfadin/awesome-accord/blob/main/easy-cass-lab/README.md\">Easy-Cass-Lab Guide</a></li>\n</ul></li>\n<li>Follow the <a href=\"https://github.com/pmcfadin/awesome-accord/blob/main/docs/quickstart.md\">Quick Start Guide</a></li>\n<li>Explore <a href=\"https://github.com/pmcfadin/awesome-accord/blob/main/examples\">example implementations</a></li>\n<li>Connect with our <a href=\"https://discord.gg/GrRCajJqmQ\" rel=\"nofollow\">Discord community</a></li>\n<li>Feedback! <a href=\"https://github.com/pmcfadin/awesome-accord/discussions\">Github Discussions</a></li>\n</ol><div class=\"highlight highlight-source-sql notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"BEGIN TRANSACTION LET fromBalance = (SELECT account_balance FROM ks.accounts WHERE account_holder='alice'); IF fromBalance.account_balance &gt;= 20 THEN UPDATE ks.accounts SET account_balance -= 20 WHERE account_holder='alice'; UPDATE ks.accounts SET account_balance += 20 WHERE account_holder='bob'; END IF COMMIT TRANSACTION;\"><pre>BEGIN TRANSACTION\n    LET fromBalance = (SELECT account_balance \n                      FROM ks.accounts \n                      WHERE account_holder='alice');\n    IF fromBalance.account_balance &gt;= 20 THEN\n        UPDATE ks.accounts \n        SET account_balance -= 20 \n        WHERE account_holder='alice';\n        UPDATE ks.accounts \n        SET account_balance += 20 \n        WHERE account_holder='bob';\n    END IF\nCOMMIT TRANSACTION;</pre></div><p dir=\"auto\">Apache License 2.0</p>","id":"f304c7a0-b03c-5bc4-aa5d-df399a9124f4","title":"GitHub - pmcfadin/awesome-accord: Repository of all kinds of things to help you get up and running with ACID transactions on Apache Cassandra®","origin_url":"https://github.com/pmcfadin/awesome-accord","url":"https://github.com/pmcfadin/awesome-accord","wallabag_created_at":"2025-01-16T16:28:31+00:00","published_at":null,"published_by":"['pmcfadin']","reading_time":1,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/3e477fb2dd2b1ded1c5b53477f4848297badc75ece00c5b49bad1476fdb76167/pmcfadin/awesome-accord","tags":["acid","open.source","cassandra","accord"],"description":"Welcome to the Awesome Accord repository! This guide provides resources and examples for implementing ACID transactions in Apache Cassandra. Learn how to leverage distributed transactions for building..."},{"content":"<p dir=\"auto\">Visual Flow is an ETL tool designed for effective data manipulation via convenient and user-friendly interface. The tool has the following capabilities:</p><ul dir=\"auto\"><li>Can integrate data from heterogeneous sources:\n<ul dir=\"auto\"><li>AWS S3</li>\n<li>Cassandra</li>\n<li>Click House</li>\n<li>DB2</li>\n<li>Dataframe (for reading)</li>\n<li>Elastic Search</li>\n<li>IBM COS</li>\n<li>Kafka</li>\n<li>Local File</li>\n<li>MS SQL</li>\n<li>Mongo</li>\n<li>MySQL/Maria</li>\n<li>Oracle</li>\n<li>PostgreSQL</li>\n<li>Redis</li>\n<li>Redshift</li>\n</ul></li>\n<li>Leverage direct connectivity to enterprise applications as sources and targets</li>\n<li>Perform data processing and transformation</li>\n<li>Run custom code</li>\n<li>Leverage metadata for analysis and maintenance</li>\n</ul><p dir=\"auto\">Visual Flow application is divided into the following repositories:</p><p dir=\"auto\"><a href=\"https://github.com/ibagroup-eu/Visual-Flow/blob/main/CONTRIBUTING.md\">Check the official guide</a>.</p><p dir=\"auto\">Visual flow is an open-source software licensed under the <a href=\"https://github.com/ibagroup-eu/Visual-Flow/blob/main/LICENSE\">Apache-2.0 license</a>.</p>","id":"86395185-03f1-53f4-94a4-e3a2d2d45779","title":"GitHub - ibagroup-eu/Visual-Flow: Visual-Flow main repository","origin_url":"https://github.com/ibagroup-eu/Visual-Flow","url":"https://github.com/ibagroup-eu/Visual-Flow","wallabag_created_at":"2024-12-02T13:34:31+00:00","published_at":null,"published_by":"['ibagroup-eu']","reading_time":null,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/9187fdecad3a37939c1971bcdec19ffed4090307ee508b009f47c7bcd49a7f8d/ibagroup-eu/Visual-Flow","tags":["mongo","nocode","elasticsearch","open.source","cassandra","data.pipeline","elastic","aws.s3","etl","low.code","postgres"],"description":"Visual Flow is an ETL tool designed for effective data manipulation via convenient and user-friendly interface. The tool has the following capabilities:Can integrate data from heterogeneous sources:\nA..."},{"content":"<p dir=\"auto\"><a href=\"https://github.com/datastax/cql-proxy/actions/workflows/test.yml\"><img src=\"https://github.com/datastax/cql-proxy/actions/workflows/test.yml/badge.svg\" alt=\"GitHub Action\" class=\"c13\" referrerpolicy=\"no-referrer\" /></a> <a href=\"https://goreportcard.com/report/github.com/datastax/cql-proxy\" rel=\"nofollow\"><img src=\"https://camo.githubusercontent.com/e1c32ff51117d37ba38fd853bb54c63214d25a3a367d0de90a00a03124924acb/68747470733a2f2f676f7265706f7274636172642e636f6d2f62616467652f6769746875622e636f6d2f64617461737461782f63716c2d70726f7879\" alt=\"Go Report Card\" data-canonical-src=\"https://goreportcard.com/badge/github.com/datastax/cql-proxy\" class=\"c13\" referrerpolicy=\"no-referrer\" /></a></p><p dir=\"auto\"><a target=\"_blank\" rel=\"noopener noreferrer\" href=\"https://github.com/datastax/cql-proxy/blob/main/cql-proxy.png\"><img src=\"https://github.com/datastax/cql-proxy/raw/main/cql-proxy.png\" alt=\"cql-proxy\" class=\"c13\" referrerpolicy=\"no-referrer\" /></a></p><p dir=\"auto\"><code>cql-proxy</code> is designed to forward your application's CQL traffic to an appropriate database service. It listens on a local address and securely forwards that traffic.</p><p dir=\"auto\">The <code>cql-proxy</code> sidecar enables unsupported CQL drivers to work with <a href=\"https://astra.datastax.com/\" rel=\"nofollow\">DataStax Astra</a>. These drivers include both legacy DataStax <a href=\"https://docs.datastax.com/en/driver-matrix/doc/driver_matrix/common/driverMatrix.html\" rel=\"nofollow\">drivers</a> and community-maintained CQL drivers, such as the <a href=\"https://github.com/gocql/gocql\">gocql</a> driver and the <a href=\"https://github.com/scylladb/scylla-rust-driver\">rust-driver</a>.</p><p dir=\"auto\"><code>cql-proxy</code> also enables applications that are currently using <a href=\"https://cassandra.apache.org/\" rel=\"nofollow\">Apache Cassandra</a> or <a href=\"https://www.datastax.com/products/datastax-enterprise\" rel=\"nofollow\">DataStax Enterprise (DSE)</a> to use Astra without requiring any code changes. Your application just needs to be configured to use the proxy.</p><p dir=\"auto\">If you're building a new application using DataStax <a href=\"https://docs.datastax.com/en/driver-matrix/doc/driver_matrix/common/driverMatrix.html\" rel=\"nofollow\">drivers</a>, <code>cql-proxy</code> is not required, as the drivers can communicate directly with Astra. DataStax drivers have excellent support for Astra out-of-the-box, and are well-documented in the <a href=\"https://docs.datastax.com/en/astra/docs/connecting-to-astra-databases-using-datastax-drivers.html\" rel=\"nofollow\">driver-guide</a> guide.</p><p dir=\"auto\">Use the <code>-h</code> or <code>--help</code> flag to display a listing all flags and their corresponding descriptions and environment variables (shown below as items starting with <code>$</code>):</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"$ ./cql-proxy -h Usage: cql-proxy Flags: -h, --help Show context-sensitive help. -b, --astra-bundle=STRING Path to secure connect bundle for an Astra database. Requires '--username' and '--password'. Ignored if using the token or contact points option ($ASTRA_BUNDLE). -t, --astra-token=STRING Token used to authenticate to an Astra database. Requires '--astra-database-id'. Ignored if using the bundle path or contact points option ($ASTRA_TOKEN). -i, --astra-database-id=STRING Database ID of the Astra database. Requires '--astra-token' ($ASTRA_DATABASE_ID) --astra-api-url=&quot;https://api.astra.datastax.com&quot; URL for the Astra API ($ASTRA_API_URL) --astra-timeout=10s Timeout for contacting Astra when retrieving the bundle and metadata ($ASTRA_TIMEOUT) -c, --contact-points=CONTACT-POINTS,... Contact points for cluster. Ignored if using the bundle path or token option ($CONTACT_POINTS). -u, --username=STRING Username to use for authentication ($USERNAME) -p, --password=STRING Password to use for authentication ($PASSWORD) -r, --port=9042 Default port to use when connecting to cluster ($PORT) -n, --protocol-version=&quot;v4&quot; Initial protocol version to use when connecting to the backend cluster (default: v4, options: v3, v4, v5, DSEv1, DSEv2) ($PROTOCOL_VERSION) -m, --max-protocol-version=&quot;v4&quot; Max protocol version supported by the backend cluster (default: v4, options: v3, v4, v5, DSEv1, DSEv2) ($MAX_PROTOCOL_VERSION) -a, --bind=&quot;:9042&quot; Address to use to bind server ($BIND) -f, --config=CONFIG YAML configuration file ($CONFIG_FILE) --debug Show debug logging ($DEBUG) --health-check Enable liveness and readiness checks ($HEALTH_CHECK) --http-bind=&quot;:8000&quot; Address to use to bind HTTP server used for health checks ($HTTP_BIND) --heartbeat-interval=30s Interval between performing heartbeats to the cluster ($HEARTBEAT_INTERVAL) --idle-timeout=60s Duration between successful heartbeats before a connection to the cluster is considered unresponsive and closed ($IDLE_TIMEOUT) --readiness-timeout=30s Duration the proxy is unable to connect to the backend cluster before it is considered not ready ($READINESS_TIMEOUT) --idempotent-graph If true it will treat all graph queries as idempotent by default and retry them automatically. It may be dangerous to retry some graph queries -- use with caution ($IDEMPOTENT_GRAPH). --num-conns=1 Number of connection to create to each node of the backend cluster ($NUM_CONNS) --proxy-cert-file=STRING Path to a PEM encoded certificate file with its intermediate certificate chain. This is used to encrypt traffic for proxy clients ($PROXY_CERT_FILE) --proxy-key-file=STRING Path to a PEM encoded private key file. This is used to encrypt traffic for proxy clients ($PROXY_KEY_FILE) --rpc-address=STRING Address to advertise in the 'system.local' table for 'rpc_address'. It must be set if configuring peer proxies ($RPC_ADDRESS) --data-center=STRING Data center to use in system tables ($DATA_CENTER) --tokens=TOKENS,... Tokens to use in the system tables. It's not recommended ($TOKENS)\"><pre>$ ./cql-proxy -h\nUsage: cql-proxy\nFlags:\n  -h, --help                                              Show context-sensitive help.\n  -b, --astra-bundle=STRING                               Path to secure connect bundle for an Astra database. Requires '--username' and '--password'. Ignored if using the\n                                                          token or contact points option ($ASTRA_BUNDLE).\n  -t, --astra-token=STRING                                Token used to authenticate to an Astra database. Requires '--astra-database-id'. Ignored if using the bundle path\n                                                          or contact points option ($ASTRA_TOKEN).\n  -i, --astra-database-id=STRING                          Database ID of the Astra database. Requires '--astra-token' ($ASTRA_DATABASE_ID)\n      --astra-api-url=\"https://api.astra.datastax.com\"    URL for the Astra API ($ASTRA_API_URL)\n      --astra-timeout=10s                                 Timeout for contacting Astra when retrieving the bundle and metadata ($ASTRA_TIMEOUT)\n  -c, --contact-points=CONTACT-POINTS,...                 Contact points for cluster. Ignored if using the bundle path or token option ($CONTACT_POINTS).\n  -u, --username=STRING                                   Username to use for authentication ($USERNAME)\n  -p, --password=STRING                                   Password to use for authentication ($PASSWORD)\n  -r, --port=9042                                         Default port to use when connecting to cluster ($PORT)\n  -n, --protocol-version=\"v4\"                             Initial protocol version to use when connecting to the backend cluster (default: v4, options: v3, v4, v5, DSEv1,\n                                                          DSEv2) ($PROTOCOL_VERSION)\n  -m, --max-protocol-version=\"v4\"                         Max protocol version supported by the backend cluster (default: v4, options: v3, v4, v5, DSEv1, DSEv2)\n                                                          ($MAX_PROTOCOL_VERSION)\n  -a, --bind=\":9042\"                                      Address to use to bind server ($BIND)\n  -f, --config=CONFIG                                     YAML configuration file ($CONFIG_FILE)\n      --debug                                             Show debug logging ($DEBUG)\n      --health-check                                      Enable liveness and readiness checks ($HEALTH_CHECK)\n      --http-bind=\":8000\"                                 Address to use to bind HTTP server used for health checks ($HTTP_BIND)\n      --heartbeat-interval=30s                            Interval between performing heartbeats to the cluster ($HEARTBEAT_INTERVAL)\n      --idle-timeout=60s                                  Duration between successful heartbeats before a connection to the cluster is considered unresponsive and closed\n                                                          ($IDLE_TIMEOUT)\n      --readiness-timeout=30s                             Duration the proxy is unable to connect to the backend cluster before it is considered not ready\n                                                          ($READINESS_TIMEOUT)\n      --idempotent-graph                                  If true it will treat all graph queries as idempotent by default and retry them automatically. It may be\n                                                          dangerous to retry some graph queries -- use with caution ($IDEMPOTENT_GRAPH).\n      --num-conns=1                                       Number of connection to create to each node of the backend cluster ($NUM_CONNS)\n      --proxy-cert-file=STRING                            Path to a PEM encoded certificate file with its intermediate certificate chain. This is used to encrypt traffic\n                                                          for proxy clients ($PROXY_CERT_FILE)\n      --proxy-key-file=STRING                             Path to a PEM encoded private key file. This is used to encrypt traffic for proxy clients ($PROXY_KEY_FILE)\n      --rpc-address=STRING                                Address to advertise in the 'system.local' table for 'rpc_address'. It must be set if configuring peer proxies\n                                                          ($RPC_ADDRESS)\n      --data-center=STRING                                Data center to use in system tables ($DATA_CENTER)\n      --tokens=TOKENS,...                                 Tokens to use in the system tables. It's not recommended ($TOKENS)</pre></div><p dir=\"auto\">To pass configuration to <code>cql-proxy</code>, either command-line flags, environment variables, or a configuration file can be used. Using the <code>docker</code> method as an example, the following samples show how the token and database ID are defined with each method.</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 --rm datastax/cql-proxy:v0.1.5 --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-datbase-id&gt;\"><pre>docker run -p 9042:9042 \\\n  --rm datastax/cql-proxy:v0.1.5 \\\n  --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-datbase-id&gt;</pre></div><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 --rm datastax/cql-proxy:v0.1.5 -e ASTRA_TOKEN=&lt;astra-token&gt; -e ASTRA_DATABASE_ID=&lt;astra-datbase-id&gt;\"><pre>docker run -p 9042:9042  \\\n  --rm datastax/cql-proxy:v0.1.5 \\\n  -e ASTRA_TOKEN=&lt;astra-token&gt; -e ASTRA_DATABASE_ID=&lt;astra-datbase-id&gt;</pre></div><p dir=\"auto\">Proxy settings can also be passed using a configuration file with the <code>--config /path/to/proxy.yaml</code> flag. This can be mixed and matched with command-line flags and environment variables. Here are some example configuration files:</p><div class=\"highlight highlight-source-yaml notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"contact-points: - 127.0.0.1 username: cassandra password: cassandra port: 9042 bind: 127.0.0.1:9042 # ...\"><pre>contact-points:\n  - 127.0.0.1\nusername: cassandra\npassword: cassandra\nport: 9042\nbind: 127.0.0.1:9042\n# ...</pre></div><p dir=\"auto\">or with a Astra token:</p><div class=\"highlight highlight-source-yaml notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"astra-token: &lt;astra-token&gt; astra-database-id: &lt;astra-database-id&gt; bind: 127.0.0.1:9042 # ...\"><pre>astra-token: &lt;astra-token&gt;\nastra-database-id: &lt;astra-database-id&gt;\nbind: 127.0.0.1:9042\n# ...</pre></div><p dir=\"auto\">All configuration keys match their command-line flag counterpart, e.g. <code>--astra-bundle</code> is <code>astra-bundle:</code>, <code>--contact-points</code> is <code>contact-points:</code> etc.</p><p dir=\"auto\">Multi-region failover with DC-aware load balancing policy is the most useful case for a multiple proxy setup.</p><p dir=\"auto\">When configuring <code>peers:</code> it is required to set <code>--rpc-address</code> (or <code>rpc-address:</code> in the yaml) for each proxy and it must match is corresponding <code>peers:</code> entry. Also, <code>peers:</code> is only available in the configuration file and cannot be set using a command-line flag.</p><p dir=\"auto\">Here's an example of configuring multi-region failover with two proxies. A proxy is started for each region of the cluster connecting to it using that region's bundle. They all share a common configuration file that contains the full list of proxies.</p><p dir=\"auto\"><em>Note:</em> Only bundles are supported for multi-region setups.</p><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"cql-proxy --astra-bundle astra-region1-bundle.zip --username token --password &lt;astra-token&gt; --bind 127.0.0.1:9042 --rpc-address 127.0.0.1 --data-center dc-1 --config proxy.yaml\"><pre>cql-proxy --astra-bundle astra-region1-bundle.zip --username token --password &lt;astra-token&gt; \\\n  --bind 127.0.0.1:9042 --rpc-address 127.0.0.1 --data-center dc-1 --config proxy.yaml</pre></div><div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"cql-proxy ---astra-bundle astra-region2-bundle.zip --username token --password &lt;astra-token&gt; --bind 127.0.0.2:9042 --rpc-address 127.0.0.2 --data-center dc-2 --config proxy.yaml\"><pre>cql-proxy ---astra-bundle astra-region2-bundle.zip --username token --password &lt;astra-token&gt; \\\n  --bind 127.0.0.2:9042 --rpc-address 127.0.0.2 --data-center dc-2 --config proxy.yaml</pre></div><p dir=\"auto\">The peers settings are configured using a yaml file. It's a good idea to explicitly provide the <code>--data-center</code> flag, otherwise; these values are pulled from the backend cluster and would need to be pulled from the <code>system.local</code> and <code>system.peers</code> table to properly setup the peers <code>data-center:</code> values. Here's an example <code>proxy.yaml</code>:</p><div class=\"highlight highlight-source-yaml notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"peers: - rpc-address: 127.0.0.1 data-center: dc-1 - rpc-address: 127.0.0.2 data-center: dc-2\"><pre>peers:\n  - rpc-address: 127.0.0.1\n    data-center: dc-1\n  - rpc-address: 127.0.0.2\n    data-center: dc-2</pre></div><p dir=\"auto\"><em>Note:</em> It's okay for the <code>peers:</code> to contain entries for the current proxy itself because they'll just be omitted.</p><p dir=\"auto\">There are three methods for using <code>cql-proxy</code>:</p><ul dir=\"auto\"><li>Locally build and run <code>cql-proxy</code></li>\n<li>Run a docker image that has <code>cql-proxy</code> installed</li>\n<li>Use a Kubernetes container to run <code>cql-proxy</code></li>\n</ul><ol dir=\"auto\"><li>\n<p dir=\"auto\">Build <code>cql-proxy</code>.</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"go build\"><pre>go build</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Run with your desired database.</p>\n<ul dir=\"auto\"><li>\n<p dir=\"auto\"><a href=\"https://astra.datastax.com/\" rel=\"nofollow\">DataStax Astra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"./cql-proxy --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;\"><pre>./cql-proxy --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;</pre></div>\n<p dir=\"auto\">The <code>&lt;astra-token&gt;</code> can be generated using these <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html\" rel=\"nofollow\">instructions</a>. The proxy also supports using the <a href=\"https://docs.datastax.com/en/astra/docs/obtaining-database-credentials.html#_getting_your_secure_connect_bundle\" rel=\"nofollow\">Astra Secure Connect Bundle</a> along with a client ID and secret generated using these <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html\" rel=\"nofollow\">instructions</a>:</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"./cql-proxy --astra-bundle &lt;your-secure-connect-zip&gt; --username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;\"><pre>./cql-proxy --astra-bundle &lt;your-secure-connect-zip&gt; \\\n--username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;\n</pre></div>\n</li>\n<li>\n<p dir=\"auto\"><a href=\"https://cassandra.apache.org/\" rel=\"nofollow\">Apache Cassandra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"./cql-proxy --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]\"><pre>./cql-proxy --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]</pre></div>\n</li>\n</ul></li>\n</ol><ol dir=\"auto\"><li>\n<p dir=\"auto\">Run with your desired database.</p>\n<ul dir=\"auto\"><li>\n<p dir=\"auto\"><a href=\"https://astra.datastax.com/\" rel=\"nofollow\">DataStax Astra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 datastax/cql-proxy:v0.1.5 --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;\"><pre>docker run -p 9042:9042 \\\n  datastax/cql-proxy:v0.1.5 \\\n  --astra-token &lt;astra-token&gt; --astra-database-id &lt;astra-database-id&gt;</pre></div>\n<p dir=\"auto\">The <code>&lt;astra-token&gt;</code> can be generated using these <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html\" rel=\"nofollow\">instructions</a>. The proxy also supports using the <a href=\"https://docs.datastax.com/en/astra/docs/obtaining-database-credentials.html#_getting_your_secure_connect_bundle\" rel=\"nofollow\">Astra Secure Connect Bundle</a>, but it requires mounting the bundle to a volume in the container:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -v &lt;your-secure-connect-bundle.zip&gt;:/tmp/scb.zip -p 9042:9042 --rm datastax/cql-proxy:v0.1.5 --astra-bundle /tmp/scb.zip --username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;\"><pre>docker run -v &lt;your-secure-connect-bundle.zip&gt;:/tmp/scb.zip -p 9042:9042 \\\n--rm datastax/cql-proxy:v0.1.5 \\\n--astra-bundle /tmp/scb.zip --username &lt;astra-client-id&gt; --password &lt;astra-client-secret&gt;</pre></div>\n</li>\n<li>\n<p dir=\"auto\"><a href=\"https://cassandra.apache.org/\" rel=\"nofollow\">Apache Cassandra</a> cluster:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"docker run -p 9042:9042 datastax/cql-proxy:v0.1.5 --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]\"><pre>docker run -p 9042:9042 \\\n  datastax/cql-proxy:v0.1.5 \\\n  --contact-points &lt;cluster node IPs or DNS names&gt; [--username &lt;username&gt;] [--password &lt;password&gt;]</pre></div>\n</li>\n</ul></li>\n</ol><p dir=\"auto\">If you wish to have the docker image removed after you are done with it, add <code>--rm</code> before the image name <code>datastax/cql-proxy:v0.1.5</code>.</p><p dir=\"auto\">Using Kubernetes with <code>cql-proxy</code> requires a number of steps:</p><ol dir=\"auto\"><li>\n<p dir=\"auto\">Generate a token following the Astra <a href=\"https://docs.datastax.com/en/astra/docs/manage-application-tokens.html#_create_application_token\" rel=\"nofollow\">instructions</a>. This step will display your Client ID, Client Secret, and Token; make sure you download the information for the next steps. Store the secure bundle in <code>/tmp/scb.zip</code> to match the example below.</p>\n</li>\n<li>\n<p dir=\"auto\">Create <code>cql-proxy.yaml</code>. You'll need to add three sets of information: arguments, volume mounts, and volumes. A full example can be found <a href=\"https://github.com/datastax/cql-proxy/blob/main/k8s/cql-proxy.yml\">here</a>.</p>\n</li>\n</ol><ul dir=\"auto\"><li>\n<p dir=\"auto\">Argument: Modify the local bundle location, username and password, using the client ID and client secret obtained in the last step to the container argument.</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"command: [&quot;./cql-proxy&quot;] args: [&quot;--astra-bundle=/tmp/scb.zip&quot;,&quot;--username=Client ID&quot;,&quot;--password=Client Secret&quot;]\"><pre>command: [\"./cql-proxy\"]\nargs: [\"--astra-bundle=/tmp/scb.zip\",\"--username=Client ID\",\"--password=Client Secret\"]\n</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Volume mounts: Modify <code>/tmp/</code> as a volume mount as required.</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"volumeMounts: - name: my-cm-vol mountPath: /tmp/\"><pre>volumeMounts:\n  - name: my-cm-vol\n  mountPath: /tmp/\n</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Volume: Modify the <code>configMap</code> filename as required. In this example, it is named <code>cql-proxy-configmap</code>. Use the same name for the <code>volumes</code> that you used for the <code>volumeMounts</code>.</p>\n<div class=\"snippet-clipboard-content notranslate position-relative overflow-auto\" data-snippet-clipboard-copy-content=\"volumes: - name: my-cm-vol configMap: name: cql-proxy-configmap\"><pre>volumes:\n  - name: my-cm-vol\n    configMap:\n      name: cql-proxy-configmap        \n</pre></div>\n</li>\n</ul><ol start=\"3\" dir=\"auto\"><li>\n<p dir=\"auto\">Create a configmap. Use the same secure bundle that was specified in the <code>cql-proxy.yaml</code>.</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl create configmap cql-proxy-configmap --from-file /tmp/scb.zip\"><pre>kubectl create configmap cql-proxy-configmap --from-file /tmp/scb.zip </pre></div>\n</li>\n<li>\n<p dir=\"auto\">Check the configmap that was created.</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl describe configmap cql-proxy-configmap Name: cql-proxy-configmap Namespace: default Labels: &lt;none&gt; Annotations: &lt;none&gt; Data ==== BinaryData ==== scb.zip: 12311 bytes\"><pre>kubectl describe configmap cql-proxy-configmap\n  Name:         cql-proxy-configmap\n  Namespace:    default\n  Labels:       &lt;none&gt;\n  Annotations:  &lt;none&gt;\n  Data\n  ====\n  BinaryData\n  ====\n  scb.zip: 12311 bytes</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Create a Kubernetes deployment with the YAML file you created:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl create -f cql-proxy.yaml\"><pre>kubectl create -f cql-proxy.yaml</pre></div>\n</li>\n<li>\n<p dir=\"auto\">Check the logs:</p>\n<div class=\"highlight highlight-source-shell notranslate position-relative overflow-auto\" dir=\"auto\" data-snippet-clipboard-copy-content=\"kubectl logs &lt;deployment-name&gt;\"><pre>kubectl logs &lt;deployment-name&gt;</pre></div>\n</li>\n</ol><p dir=\"auto\">Drivers that use token-aware load balancing may print a warning or may not work when using cql-proxy. Because cql-proxy abstracts the backend cluster as a single endpoint this doesn't always work well with token-aware drivers that expect there to be at least \"replication factor\" number of nodes in the cluster. Many drivers print a warning (which can be ignored) and fallback to something like round-robin, but other drivers might fail with an error. For the drivers that fail with an error it is required that they disable token-aware or configure the round-robin load balancing policy.</p>","id":"17fac8a9-8b96-51ec-a7dd-dd3809bba528","title":"GitHub - datastax/cql-proxy: A client-side CQL proxy/sidecar.","origin_url":"https://github.com/datastax/cql-proxy","url":"https://github.com/datastax/cql-proxy","wallabag_created_at":"2024-11-01T17:26:01+00:00","published_at":null,"published_by":"['datastax']","reading_time":8,"domain_name":"github.com","preview_picture":"https://opengraph.githubassets.com/c2528e3426d98910ed27819e048b4c1081fab2ed2c7adbea6e6a3b1872deb30a/datastax/cql-proxy","tags":["migration","proxy","cassandra","cql"],"description":" cql-proxy is designed to forward your application's CQL traffic to an appropriate database service. It listens on a local address and securely forwards that traffic.The cql-proxy sidecar enables unsu..."}]}]}},"staticQueryHashes":[]}