This is a multi-part series:
Okay, so you’re sold on using ZooKeeper to be your service locator or configuration repository. Your services will all talk to ZooKeeper when they start up to find out who they are, who their neighbors are, and generally how to get on with all the other animals at the zoo.
But what service locator do you use to find out where ZooKeeper itself is? (ZooKeeper is actually in one or more places, since it typically runs on multiple servers in a production environment.) The answer probably depends on the scope of your problem: a 10,000 node cluster will be different than a few dozen services. Your best options are drawn from the service locator patterns already built into your OS or environment. Here we’ll talk about 3 options.
Your ZooKeeper instance itself is configured with a simple text configuration file. The quorum of machines will read a file like
tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
If everyone agreed where this was, and everyone mounted the same file system, you could parse the file to give your ZK client the list of servers to connect to on startup. Now you’ve got a filename hard-coded into your system, but you have to start somewhere, right? Each machine image you work from could have /conf/zk/zoo.configas a soft link to the right configuration for that machine.
If you knew you’d always have 5 ZK machines running and your infrastructure supported dynamic DNS entries, you could hardcode your list as “zookeeper1.app.company.com“, “zookeeper2.app.company.com“, etc. If a ZK node failed and was restarted by your services team or cloud infrastructure, the new host address could be re-registered under the appropriate DNS entry. Now you’ve got some DNS names hard-coded into your system, but you have to start somewhere, right?
Cluster Job Management
We took a rather unorthodox approach once whereby we asked an unexpected source for help in finding ZooKeeper. Because we ran it on an HPC cluster, it was managed by a job system: in this case LSF, though SLURM would amount to the same thing. The job management system launched ZooKeeper, found server(s) for it to run on, and relaunched it if it crashed.
We arranged for ZooKeeper to be launched with a distinctive command line
$ zkserver.sh start -PRODUCTION_ZK
And when we wanted to find it, we would ask the system to tell us about jobs with that distinctive command line
$ runningjobs -user account | grep 'PRODUCTION_ZK' | ...
Where the ... was some lovely text processing to finesse the actual hostnames out of the running-job record produced by the runningjobs command. (Real executable names changed to protect the innocent.)
It appears that others have come along and attempted to expose ZooKeeper at a higher level, and they have their own ways of dealing with this problem. Of particular interest are