7/15/2020

Reading time:5 min

Real-time Cassandra

by Acunu

Real-time Cassandra SlideShare Explore You Successfully reported this slideshow.Real-time CassandraUpcoming SlideShareLoading in …5× 2 Comments 5 Likes Statistics Notes Igor Skakovskyi , Systems Architect at Systems Architect, Technical Lead, Core Developer Hanokh Aloni , Department Manager at NGSoft Ltd. at NGSoft Ltd. Tung Hoang , Web Developer at CMCSoft Ltd. Co. at CMCSoft Ltd. Co. Marouen JILANI , Software Engineer - distributed/web at Rakuten Steve Min , Data Engineer at Netmarble Games at Netmarble Games No DownloadsNo notes for slide 1. Real-timeCassandra Richard Lowrichard@acunu.com @richardalow 2. Outline• What is real-time?• How do databases implement real-time queries?• Why is Cassandra ideal for real-time applications?• Writing real-time applications with Cassandra 3. What is real-time? 4. “Of or relating to a system in which input data is processed withinmilliseconds” dictionary.com “Occurring immediately” webopedia“...the most important requirement of a real-time system ispredictability and not performance” wikipedia “...a time frame that is very brief, appearing to be immediate.” wisegeek.com“Often real-time response times are understood to be in the order ofmilliseconds and sometimes microseconds” wikipedia 5. Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’ 6. Real-time deﬁnition• Deﬁnition a query is processed in real- time if the time to get the answer is at most a constant times the transfer time plus the round-trip timetresponse ≤ C(ttransfer + tping ) 7. Real-time deﬁnition• The more you ask for, the longer it takes• For small queries, request dominated by round trip time• No query can take less time than the time to receive it 8. Real-time deﬁnition• Users on faster networks expect a faster response• What we mean by real-time is getting faster 9. Implications• What does this mean for the database?• Use Google Analytics example• Simple query: ‘How many page views have there been from France in the last 24 hours?’ 10. Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total 11. Solution 1• grep *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of logs a day• Single disk would take 70s• Need a beefy server to do this• Needs to grow as your audience grows 12. Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe it is on disk - 5ms seek• No need to scale speed with trafﬁc 13. Implications• Real-time queries can only read about as much data as they send to the requester• Need to precompute answers• Store data in a query-centric rather than data-centric view 14. Age of data• A real-time query will often need to query new data• But not necessarily • Could run batch process pre-compute answers 15. Solutions 16. Solutions• How make sure don’t read any more than you have to? • Denormalisation • Organisation of data • Counters 17. Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Effective block size 1MB 18. Denormalisation• Store items accessed at similar times near to each other• Involves copying• Copying isn’t bad • Storage costs <$100 per TB 19. Organisation of data• If read 100 items off disk, ensure they are next to each other• Saves reading extra data around them and index lookups 20. Fast range queries• Get me all keys in the range E to I A F H [E, I] I M X 21. Fast range queries• What happens when you insert? G A Q F [E, I] A G F H vs H I I M M Q X X 22. Counters• For queries that simply count, increment the counter• Implement inc, dec, get• Store multiple counts e.g. week, day, hour 23. Cassandra and real-time• Write optimised• Fast merging• Distributed counters 24. Write optimised• All writes are sequential on disk• Each write is written multiple times during compactions 25. Fast mergingHow get from this: to this? G A Q F + G A H F I H M I Q M X X 26. Fast merging• Write out new ordered SSTable• When big enough, merge with existing 27. G BQ F G A K B Q F Z G HA A B A IF F F F KH H G H MI I K I QM M Q M XX X Z X Z 28. How fast? 29. Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast 30. Other requirements 31. What else do we need? Real-time analytics High value getting quick response High cost if service is down Need high availability 32. What else do we need? Real-time analytics High value getting quick response Need low latency Need data geographically close 33. Cassandra and HA• No SPOF• Choose point on consistency and availability curve • Tuneable consistency • Replication• Multi data-centre support 34. Cassandra and low latency• Can conﬁgure caches• Can parallelise reads• Multi-DC support enables world-wide replication• Can choose lower consistency to avoid round-trips to other DCs 35. Writing real-time apps with Cassandra 36. Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes 37. Acunu Analytics• Provides simple RESTful interface to Cassandra counters• Push processing into ingest phase AA event Cassandra counter updates 38. Acunu Analytics• Event template, e.g., select : ["COUNT", "AVG(loadTime)"], type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0] }• Speciﬁes “blow-up” strategy according to supported queries• Need to know basics of query in advance, but not whole thing 39. Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• time + hierarchy bucketing• efﬁcient ‘group’ semantics• works with Apache Cassandra 40. Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these properties• Discussed how Acunu Analytics helps when writing real-time apps Recommended Office 365 for EducatorsOnline Course - LinkedIn Learning Teacher Tech Tips WeeklyOnline Course - LinkedIn Learning Teaching Techniques: Creating Multimedia LearningOnline Course - LinkedIn Learning BI, Reporting and Analytics on Apache CassandraVictor Coustenoble Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble 69 claves para conocer Big DataStratebi Developing with CassandraSperasoft Acunu and Hailo: a realtime analytics case study on CassandraAcunu Virtual nodes: Operational AspirinAcunu Acunu Analytics and Cassandra at Hailo All Your Base 2013Acunu About Blog Terms Privacy Copyright LinkedIn Corporation © 2020 × Public clipboards featuring this slideNo public clipboards found for this slideSelect another clipboard ×Looks like you’ve clipped this slide to already.Create a clipboardYou just clipped your first slide! Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips. Description Visibility Others can see my Clipboard

Read this article if you want to know more about Real-time Cassandra

Real-time Cassandra

SlideShare Explore You

Successfully reported this slideshow.

Real-time Cassandra

Upcoming SlideShare

Loading in …5

×

2 Comments

1. Real-timeCassandra Richard Lowrichard@acunu.com @richardalow
2. Outline• What is real-time?• How do databases implement real-time queries?• Why is Cassandra ideal for real-time applications?• Writing real-time applications with Cassandra
3. What is real-time?
4. “Of or relating to a system in which input data is processed withinmilliseconds” dictionary.com “Occurring immediately” webopedia“...the most important requirement of a real-time system ispredictability and not performance” wikipedia “...a time frame that is very brief, appearing to be immediate.” wisegeek.com“Often real-time response times are understood to be in the order ofmilliseconds and sometimes microseconds” wikipedia
5. Real-time queries• ‘Give me X’• ‘How many Y?’• ‘What is the top K?’• ‘How many distinct Z from P?’
6. Real-time deﬁnition• Deﬁnition a query is processed in real- time if the time to get the answer is at most a constant times the transfer time plus the round-trip timetresponse ≤ C(ttransfer + tping )
7. Real-time deﬁnition• The more you ask for, the longer it takes• For small queries, request dominated by round trip time• No query can take less time than the time to receive it
8. Real-time deﬁnition• Users on faster networks expect a faster response• What we mean by real-time is getting faster
9. Implications• What does this mean for the database?• Use Google Analytics example• Simple query: ‘How many page views have there been from France in the last 24 hours?’
10. Requirement• Response is one number• With overhead, say ~1KB• Ping time 1ms• 10Mbit connection => 1KB in ~1ms• 2ms total
11. Solution 1• grep *.fr /var/log/apache2/*.log• Suppose have 1M hits an hour => 7GB of logs a day• Single disk would take 70s• Need a beefy server to do this• Needs to grow as your audience grows
12. Solution 2• Maintain a counter for each country• Increment the counter on each hit• On query just read the counter• Maybe it is on disk - 5ms seek• No need to scale speed with trafﬁc
13. Implications• Real-time queries can only read about as much data as they send to the requester• Need to precompute answers• Store data in a query-centric rather than data-centric view
14. Age of data• A real-time query will often need to query new data• But not necessarily • Could run batch process pre-compute answers
15. Solutions
16. Solutions• How make sure don’t read any more than you have to? • Denormalisation • Organisation of data • Counters
17. Denormalisation• Hard drive performance constraints: • Sequential IO at 100s MB/s • Seek at 100 IO/s• Avoid random IO• Effective block size 1MB
18. Denormalisation• Store items accessed at similar times near to each other• Involves copying• Copying isn’t bad • Storage costs <$100 per TB
19. Organisation of data• If read 100 items off disk, ensure they are next to each other• Saves reading extra data around them and index lookups
20. Fast range queries• Get me all keys in the range E to I A F H [E, I] I M X
21. Fast range queries• What happens when you insert? G A Q F [E, I] A G F H vs H I I M M Q X X
22. Counters• For queries that simply count, increment the counter• Implement inc, dec, get• Store multiple counts e.g. week, day, hour
23. Cassandra and real-time• Write optimised• Fast merging• Distributed counters
24. Write optimised• All writes are sequential on disk• Each write is written multiple times during compactions
25. Fast mergingHow get from this: to this? G A Q F + G A H F I H M I Q M X X
26. Fast merging• Write out new ordered SSTable• When big enough, merge with existing
27. G BQ F G A K B Q F Z G HA A B A IF F F F KH H G H MI I K I QM M Q M XX X Z X Z
28. How fast?
29. Distributed counters• Distributed, fault tolerant replicated counters• No need for distributed locks• Super fast
30. Other requirements
31. What else do we need? Real-time analytics High value getting quick response High cost if service is down Need high availability
32. What else do we need? Real-time analytics High value getting quick response Need low latency Need data geographically close
33. Cassandra and HA• No SPOF• Choose point on consistency and availability curve • Tuneable consistency • Replication• Multi data-centre support
34. Cassandra and low latency• Can conﬁgure caches• Can parallelise reads• Multi-DC support enables world-wide replication• Can choose lower consistency to avoid round-trips to other DCs
35. Writing real-time apps with Cassandra
36. Real-time apps• Need to write code using a client library• Design data-model• If queries change, code changes
37. Acunu Analytics• Provides simple RESTful interface to Cassandra counters• Push processing into ingest phase AA event Cassandra counter updates
38. Acunu Analytics• Event template, e.g., select : ["COUNT", "AVG(loadTime)"], type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0] }• Speciﬁes “blow-up” strategy according to supported queries• Need to know basics of query in advance, but not whole thing
39. Features• Simple, real-time, incremental analytics• work done on ingest• sum, count, distinct, avg, stddev, min-max etc• time + hierarchy bucketing• efﬁcient ‘group’ semantics• works with Apache Cassandra
40. Summary• Formalise what real-time means• Deduced how data must be stored• Explored how Cassandra has these properties• Discussed how Acunu Analytics helps when writing real-time apps

×

Visibility Others can see my Clipboard

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Contact Info

Resources

Properties

Follow Us