This post is my personal study technical note Nutanix Hyperconverged Solution drafted during my personal knowledge update. This post is not intended to cover all the part of the solution and some note is based on my own understanding of the solution. My intention to draft this note is to outline the key solution elements for quick readers.
The note is the first part of the whole note.
The Nutanix Solution is composed by below components:
- Hardware: Nutanix Node/Block
- Software: Nutanix Controller VM & Acropolis Operation System
- Distributed Storage Fabric (DSF)
- App Mobility Fabric (AMF)
- Acropolis Hypervisor (AHV)
- Software: Industry standard hypervisor (ESXi, Hyper-V and AHV)
- Software: Nutanix Prism Element/Central
Nutanix Node – Standard x86 server
- Local storage (SSD & HDD)
- Home partition is mirrored across first two SSD, metadata is mirrored across SSD, OpLog is distributed across SSD.
- Curator Reservation in each SSD/HDD.
- Five Network Adapter
- Two 10GbE adapter – empty SFP+ adapters
- Two 1GbE adapters
- One 10/100 Ethernet adapter for CIMC
Nutanix Block is a bundle of hardware/software housing up to four nodes in 2RU. Node naming in block:
- A(front left bottom)
- B(front left up)
- C(front right up)
- D(front right bottom)
- NX-1050 (1 metadata SSD + 4 data HDD)
- NX-3050/3060 (1 metadata SSD + 1 hot-tier data SSD + 4 data HDD)
- SX-1065-G5, SX-1065S, SX-1065-G4, NX-1020 (1 SSD + 2 data HDD)
The controller VM has below components:
- Stargate – Data IO mgmt for cluster (Move data between Hypervisor and Nutanix)
- Medusa – Access Interface for Cassandra (Abstraction layer of Casandra)
- Cassandra – Distributed metadata store (Run on each node with distributed database)
- Zeus – Access Interface for ZooKeeper
- Prism – Mgmt Interface (UI, nCLI and API)
- Zookeeper – Manage cluster configuration (Run on 3-5 nodes with one leader for writing operation)
- Curator – MapReduce and cleanup for cluster (Run on every node with a master node)
Some other services includes:
- Genesis – Cluster component & service manager (runs on each node)
- Chronos – Job and task scheduler (runs on each node with master elected)
- Cerebro – Replication/DR manager(runs on each node with master elected)
- Pithos – vDisk configuration manager
The control VM network interface is configured as below:
- Network: Backplane LAN – eth2
- Network: Hypervisor LAN – eth1
- Network: Management LAN – eth0
- Minimum three nodes – unlimited nodes
- Max 12 nodes in starter license
- Acropolis Slave runs on every CVM with an elected Acropolis Master (scheduling, execution, IPAM, etc.)
Single node cluster is supported for running a limited number of VMs.
- Need AOS version 5.5
- Unlike single-node replication target (VM creation/Snapshot restoration is not supported in replication target)
- 2 SSD with min. 2 HDD
- Single SSD failure will put node into read-only mode and back to normal until a SSD with Cassandra data been picked up. (Override mode is provided but not best practice)
- Read operation is same as multi-node cluster
- Write operation will replicate write to two different disks on the same node
- No cluster expansion
- No Encryption
File System – DSF
Acropolis Distributed Storage Fabric (DSF)
- Storage Pool – A group of physical storage device (HDD/SSD). Can span multiple Nutanix nodes.
- Storage Container – A logical segmentation of storage pool and contain VMs or files. Map to host with NFS/SMB.
- VDisk – A subset of open storage in container providing storage for VM and composed by vBlocks. For NFS container, vDisk creation is handled by cluster.
- vBlock – 1MB chunk of vDisk address space.
- Volume Group – A collection of logically related virtual disks. Provide benefits for backup, protection, restoration and migration.
- Datastore/SMB share – A logical container for files necessary for VM operations.
- CVM access SCSI controller directly (ESXi: VM-Direct Path; Hyper-V: Pass Through.)
In general, for single cluster, one storage pool with one container uses all available storage will suit the needs of most customers.
Multi Storage Tiers in DSF (MapReduce Tiering Technology, Map Reduce Tiering will migrate the data across data between SSD/HDD depends on the data temperature.)
Storage Capacity Optimization:
- Erasure Coding
- Increase the efficiency, perform erasure coding for cold data
- Data block is written to two or three nodes initially and perform erasure coding later to provide efficiency
- Option to choose post or inline process
- Post-process need delay time defined by customer, no recommended value. (4-6 hours delay for general user data and file server.)
- Inline-process is recommended for workload perform batch processing.
- Enabled on container or vDisk
- Cache or Capacity deduplication with choice
- Cache deduplication for read cache and disabled by default. Need starter or higher license.
- Capacity deduplication for persistent data. Disabled by default. Need pro or higher license. Enable with cache deduplication enabled.
- RAM requirement (Cache Dedup 24GB/Capacity Dedup 32GB)
Write IO Data Flow
- Data IO is passed from VM to Controller VM
- Controller VM writes the IO to the OpLog portion on metadata SSD
- Data then is replicated across to other metadata SSD on other nodes
- The data in OpLog will be drained asynchronously to Lower tier extent store by ILM.
Read IO Data Flow
- Read can be initiated from any node.
- If the local cache (OpLog or Unified cache) do not have a copy, then reference to local extent store.
- If local copy is not available, remote copy will be fetched and store at local for future reads.
Continue to read …