cluster.yml 2.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
  1. columns:
  2. - column: "Failure Type"
  3. - column: "RPO"
  4. - column: "RTO (RF1 - single AZ)"
  5. - column: "RTO (RF2 - multiple AZs)"
  6. rows:
  7. - Failure Type: "**Machine failure**"
  8. RPO: 0
  9. RTO (RF1 - single AZ): |
  10. Time to spin up new machine + possible rehydration time, depending on the
  11. objects on the machine:
  12. - If non-upsert sources, no rehydration time(i.e., does not require
  13. rehydration).
  14. - If upsert sources, rehydration time.
  15. - If sinks, no rehydration time (i.e., does not require rehydration).
  16. - If compute, rehydration time.
  17. - If serving, rehydration time.
  18. Additionally, there may be some time to catch up with changes that may
  19. have occurred during the downtime.
  20. To reduce rehydration time, scale up the cluster.
  21. RTO (RF2 - multiple AZs): |
  22. Can be:
  23. - 0 if only compute and serving objects are on the machine.
  24. - Time to spin up new machine if sources or sinks are on the machine.
  25. In addition, cluster RTO is affected if the [`environmentd` is
  26. down](#environmentd) (seconds to minutes).
  27. - Failure Type: "**Single AZ failure**"
  28. RPO: 0
  29. RTO (RF1 - single AZ): |
  30. *For managed clusters*
  31. Time to spin up new machine + possible rehydration time, depending on the
  32. objects on the machine:
  33. - If non-upsert sources, no rehydration time(i.e., does not require
  34. rehydration).
  35. - If upsert sources, rehydration time.
  36. - If sinks, no rehydration time (i.e., does not require rehydration).
  37. - If compute, rehydration time.
  38. - If serving, rehydration time.
  39. Additionally, there may be some time to catch up with changes that may
  40. have occurred during the downtime.
  41. To reduce rehydration time, you can scale up the cluster.
  42. During downtime, single AZ PrivateLinks are impacted.
  43. RTO (RF2 - multiple AZs): |
  44. Can be:
  45. - 0 if only compute and serving objects are on the machine.
  46. - Time to spin up new machine if sources or sinks are on the machine.
  47. In addition, cluster RTO is affected if the [`environmentd` is
  48. down](#environmentd) (seconds to minutes).
  49. - Failure Type: "**Regional failure (or 2 AZs failures)**"
  50. RPO: |
  51. At most, 1 hour (time since last backup, based on hourly backups).<br>
  52. RTO (RF1 - single AZ): |
  53. ~1 hour (time to check pointers).
  54. RTO (RF2 - multiple AZs): |
  55. High/Significant. Consider using a [regional failover strategy](/manage/disaster-recovery/#level-3-a-duplicate-materialize-environment-inter-region-resilience).