Illustration Image

Cassandra.Link

The best knowledge base on Apache Cassandra®

Helping platform leaders, architects, engineers, and operators build scalable real time data platforms.

7/15/2020

Reading time:5 min

Real-time Cassandra

by Acunu

Real-time Cassandra SlideShare Explore You Successfully reported this slideshow.Real-time CassandraUpcoming SlideShareLoading in …5× 2 Comments 5 Likes Statistics Notes Igor Skakovskyi , Systems Architect at Systems Architect, Technical Lead, Core Developer Hanokh Aloni , Department Manager at NGSoft Ltd. at NGSoft Ltd. Tung Hoang , Web Developer at CMCSoft Ltd. Co. at CMCSoft Ltd. Co. Marouen JILANI , Software Engineer - distributed/web at Rakuten Steve Min , Data Engineer at Netmarble Games at Netmarble Games No DownloadsNo notes for slide 1. Real-timeCassandra Richard Lowrichard@acunu.com @richardalow 2. Outline• What is real-time?• How do databases implement real-time queries?• Why is Cassandra ideal for real-time applications?• Writing real-time applications with Cassandra 3. What is real-time? 4. “Of or relating to a system in which input data is processed withinmilliseconds” dictionary.com “Occurring immediately” webopedia“...the most important requirement of a real-time system ispredictability and not performance” wikipedia “...a time frame that is very brief, appearing to be immediate.” wisegeek.com“Often real-time response times are understood to be in the order ofmilliseconds and sometimes microseconds” wikipedia 5. Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’ 6. Real-time definition• Definition a query is processed in real- time if the time to get the answer is at most a constant times the transfer time plus the round-trip timetresponse ≤ C(ttransfer + tping ) 7. Real-time definition• The more you ask for, the longer it takes• For small queries, request dominated by round trip time• No query can take less time than the time to receive it 8. Real-time definition• Users on faster networks expect a faster response• What we mean by real-time is getting faster 9. Implications• What does this mean for the database?• Use Google Analytics example• Simple query: ‘How many page views have there been from France in the last 24 hours?’ 10. Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total 11. Solution 1• grep *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of logs a day• Single disk would take 70s• Need a beefy server to do this• Needs to grow as your audience grows 12. Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe it is on disk - 5ms seek• No need to scale speed with traffic 13. Implications• Real-time queries can only read about as much data as they send to the requester• Need to precompute answers• Store data in a query-centric rather than data-centric view 14. Age of data• A real-time query will often need to query new data• But not necessarily • Could run batch process pre-compute answers 15. Solutions 16. Solutions• How make sure don’t read any more than you have to? • Denormalisation • Organisation of data • Counters 17. Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Effective block size 1MB 18. Denormalisation• Store items accessed at similar times near to each other• Involves copying• Copying isn’t bad • Storage costs <$100 per TB 19. Organisation of data• If read 100 items off disk, ensure they are next to each other• Saves reading extra data around them and index lookups 20. Fast range queries• Get me all keys in the range E to I A F H [E, I] I M X 21. Fast range queries• What happens when you insert? G A Q F [E, I] A G F H vs H I I M M Q X X 22. Counters• For queries that simply count, increment the counter• Implement inc, dec, get• Store multiple counts e.g. week, day, hour 23. Cassandra and real-time• Write optimised• Fast merging• Distributed counters 24. Write optimised• All writes are sequential on disk• Each write is written multiple times during compactions 25. Fast mergingHow get from this: to this? G A Q F + G A H F I H M I Q M X X 26. Fast merging• Write out new ordered SSTable• When big enough, merge with existing 27. G BQ F G A K B Q F Z G HA A B A IF F F F KH H G H MI I K I QM M Q M XX X Z X Z 28. How fast? 29. Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast 30. Other requirements 31. What else do we need? Real-time analytics High value getting quick response High cost if service is down Need high availability 32. What else do we need? Real-time analytics High value getting quick response Need low latency Need data geographically close 33. Cassandra and HA• No SPOF• Choose point on consistency and availability curve • Tuneable consistency • Replication• Multi data-centre support 34. Cassandra and low latency• Can configure caches• Can parallelise reads• Multi-DC support enables world-wide replication• Can choose lower consistency to avoid round-trips to other DCs 35. Writing real-time apps with Cassandra 36. Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes 37. Acunu Analytics• Provides simple RESTful interface to Cassandra counters• Push processing into ingest phase AA event Cassandra counter updates 38. Acunu Analytics• Event template, e.g., select : ["COUNT", "AVG(loadTime)"], type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0] }• Specifies “blow-up” strategy according to supported queries• Need to know basics of query in advance, but not whole thing 39. Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• time + hierarchy bucketing• efficient ‘group’ semantics• works with Apache Cassandra 40. Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these properties• Discussed how Acunu Analytics helps when writing real-time apps Recommended Office 365 for EducatorsOnline Course - LinkedIn Learning Teacher Tech Tips WeeklyOnline Course - LinkedIn Learning Teaching Techniques: Creating Multimedia LearningOnline Course - LinkedIn Learning BI, Reporting and Analytics on Apache CassandraVictor Coustenoble Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble 69 claves para conocer Big DataStratebi Developing with CassandraSperasoft Acunu and Hailo: a realtime analytics case study on CassandraAcunu Virtual nodes: Operational AspirinAcunu Acunu Analytics and Cassandra at Hailo All Your Base 2013Acunu About Blog Terms Privacy Copyright LinkedIn Corporation © 2020 × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Illustration Image
Real-time Cassandra

Successfully reported this slideshow.

Real-time Cassandra
Real-timeCassandra    Richard Lowrichard@acunu.com   @richardalow
Outline• What is real-time?• How do databases implement real-time  queries?• Why is Cassandra ideal for real-time  applica...
What is real-time?
“Of or relating to a system in which input data is processed withinmilliseconds” dictionary.com                           ...
Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’
Real-time definition• Definition a query is processed in real-  time if the time to get the answer is at most  a constant ti...
Real-time definition• The more you ask for, the longer it takes• For small queries, request dominated by  round trip time• ...
Real-time definition• Users on faster networks expect a faster  response• What we mean by real-time is getting faster
Implications• What does this mean for the database?• Use Google Analytics example• Simple query:     ‘How many page views ...
Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total
Solution 1• grep   *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of  logs a day• Single disk would take...
Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe ...
Implications• Real-time queries can only read about as  much data as they send to the requester• Need to precompute answer...
Age of data• A real-time query will often need to query  new data• But not necessarily • Could run batch process pre-compu...
Solutions
Solutions• How make sure don’t read any more than  you have to? • Denormalisation • Organisation of data • Counters
Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Eff...
Denormalisation• Store items accessed at similar times near to  each other• Involves copying• Copying isn’t bad • Storage ...
Organisation of data• If read 100 items off disk, ensure they are  next to each other• Saves reading extra data around the...
Fast range queries• Get me all keys in the range E to I                  A                  F                  H         [...
Fast range queries• What happens when you insert?    G                         A    Q                         F           ...
Counters• For queries that simply count, increment the  counter• Implement inc, dec, get• Store multiple counts e.g. week,...
Cassandra and real-time• Write optimised• Fast merging• Distributed counters
Write optimised• All writes are sequential on disk• Each write is written multiple times during  compactions
Fast mergingHow get from this:     to this?        G                A        Q                F        +                  ...
Fast merging• Write out new ordered SSTable• When big enough, merge with existing
G   BQ   F    G           A    K           B    Q           F    Z           G                HA   A   B   A   IF   F   F ...
How fast?
Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast
Other requirements
What else do we need?         Real-time analytics  High value getting quick response    High cost if service is down     N...
What else do we need?          Real-time analytics   High value getting quick response         Need low latency  Need data...
Cassandra and HA• No SPOF• Choose point on consistency and availability  curve • Tuneable consistency • Replication• Multi...
Cassandra and low         latency• Can configure caches• Can parallelise reads• Multi-DC support enables world-wide  replic...
Writing real-time apps  with Cassandra
Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes
Acunu Analytics• Provides simple RESTful interface to  Cassandra counters• Push processing into ingest phase              ...
Acunu Analytics• Event template, e.g.,    select : ["COUNT", "AVG(loadTime)"],    type : {       time : [TIME(HOUR; MIN; S...
Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• t...
Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these  properties• Di...
Real-time Cassandra
Real-time Cassandra
Real-time Cassandra

Upcoming SlideShare

Loading in …5

×

  1. 1. Real-timeCassandra Richard Lowrichard@acunu.com @richardalow
  2. 2. Outline• What is real-time?• How do databases implement real-time queries?• Why is Cassandra ideal for real-time applications?• Writing real-time applications with Cassandra
  3. 3. What is real-time?
  4. 4. “Of or relating to a system in which input data is processed withinmilliseconds” dictionary.com “Occurring immediately” webopedia“...the most important requirement of a real-time system ispredictability and not performance” wikipedia “...a time frame that is very brief, appearing to be immediate.” wisegeek.com“Often real-time response times are understood to be in the order ofmilliseconds and sometimes microseconds” wikipedia
  5. 5. Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’
  6. 6. Real-time definition• Definition a query is processed in real- time if the time to get the answer is at most a constant times the transfer time plus the round-trip timetresponse ≤ C(ttransfer + tping )
  7. 7. Real-time definition• The more you ask for, the longer it takes• For small queries, request dominated by round trip time• No query can take less time than the time to receive it
  8. 8. Real-time definition• Users on faster networks expect a faster response• What we mean by real-time is getting faster
  9. 9. Implications• What does this mean for the database?• Use Google Analytics example• Simple query: ‘How many page views have there been from France in the last 24 hours?’
  10. 10. Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total
  11. 11. Solution 1• grep *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of logs a day• Single disk would take 70s• Need a beefy server to do this• Needs to grow as your audience grows
  12. 12. Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe it is on disk - 5ms seek• No need to scale speed with traffic
  13. 13. Implications• Real-time queries can only read about as much data as they send to the requester• Need to precompute answers• Store data in a query-centric rather than data-centric view
  14. 14. Age of data• A real-time query will often need to query new data• But not necessarily • Could run batch process pre-compute answers
  15. 15. Solutions
  16. 16. Solutions• How make sure don’t read any more than you have to? • Denormalisation • Organisation of data • Counters
  17. 17. Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Effective block size 1MB
  18. 18. Denormalisation• Store items accessed at similar times near to each other• Involves copying• Copying isn’t bad • Storage costs <$100 per TB
  19. 19. Organisation of data• If read 100 items off disk, ensure they are next to each other• Saves reading extra data around them and index lookups
  20. 20. Fast range queries• Get me all keys in the range E to I A F H [E, I] I M X
  21. 21. Fast range queries• What happens when you insert? G A Q F [E, I] A G F H vs H I I M M Q X X
  22. 22. Counters• For queries that simply count, increment the counter• Implement inc, dec, get• Store multiple counts e.g. week, day, hour
  23. 23. Cassandra and real-time• Write optimised• Fast merging• Distributed counters
  24. 24. Write optimised• All writes are sequential on disk• Each write is written multiple times during compactions
  25. 25. Fast mergingHow get from this: to this? G A Q F + G A H F I H M I Q M X X
  26. 26. Fast merging• Write out new ordered SSTable• When big enough, merge with existing
  27. 27. G BQ F G A K B Q F Z G HA A B A IF F F F KH H G H MI I K I QM M Q M XX X Z X Z
  28. 28. How fast?
  29. 29. Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast
  30. 30. Other requirements
  31. 31. What else do we need? Real-time analytics High value getting quick response High cost if service is down Need high availability
  32. 32. What else do we need? Real-time analytics High value getting quick response Need low latency Need data geographically close
  33. 33. Cassandra and HA• No SPOF• Choose point on consistency and availability curve • Tuneable consistency • Replication• Multi data-centre support
  34. 34. Cassandra and low latency• Can configure caches• Can parallelise reads• Multi-DC support enables world-wide replication• Can choose lower consistency to avoid round-trips to other DCs
  35. 35. Writing real-time apps with Cassandra
  36. 36. Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes
  37. 37. Acunu Analytics• Provides simple RESTful interface to Cassandra counters• Push processing into ingest phase AA event Cassandra counter updates
  38. 38. Acunu Analytics• Event template, e.g., select : ["COUNT", "AVG(loadTime)"], type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0] }• Specifies “blow-up” strategy according to supported queries• Need to know basics of query in advance, but not whole thing
  39. 39. Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• time + hierarchy bucketing• efficient ‘group’ semantics• works with Apache Cassandra
  40. 40. Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these properties• Discussed how Acunu Analytics helps when writing real-time apps

×

Related Articles

cluster
troubleshooting
datastax

GitHub - arodrime/Montecristo: Datastax Cluster Health Check Tooling

arodrime

4/3/2024

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt! 
We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Join Our Newsletter!

Sign up below to receive email updates and see what's going on with our company

Explore Related Topics

AllKafkaSparkScyllaSStableKubernetesApiGithubGraphQl

Explore Further

cassandra