General understand of Oracle NoSQL - NoSQL Database

Hi all,
Please can help me understand the following:
1) How master is initially designated? Using PAXOS algorithm or chosen by administrator during install?
2) Can one node change from master to replica? In which cases? Is it only due to network partition ?
3) How many versions of records are maintained in storage nodes? Is that number configurable?
4) From your point of view, is Oracle NoSQL CP? AP? Or even CA? In which case, can’t Oracle NoSQL meet Partition tolerance requirement? With the example when master is in minority partition ?

893771 wrote:
Please can help me understand the following:
1) How master is initially designated? Using PAXOS algorithm or chosen by administrator during install?When the nodes in a Rep Group come up, they elect a master. The algorithm and protocol is PAXOS, but the election of a master is based on which node is most up-to-date. Presently, there is no way to choose which node is master during the install, but this is something we are working on for a future release. We are well-aware that this capability is needed.
2) Can one node change from master to replica? In which cases? Is it only due to network partition ?In general, any node can be elected as a master. Failure of the master is a good reason for an election to be held among the remaining replicas. A network partition is not required for an election to be held.
3) How many versions of records are maintained in storage nodes? Is that number configurable?1. No.
4) From your point of view, is Oracle NoSQL CP? AP? Or even CA? In which case, can’t Oracle NoSQL meet Partition tolerance requirement? With the example when master is in minority partition ?The answer to this question is complicated, but I will offer you the following food for thought and leave the answer as an exercise to the reader:
We recommend that the system be configured to always use at least simple majority replica ack policy.
Charles Lamb


Dual Site Cluster behaviour in case of fabric failure between sites

I have question regarding to dual site, 4 node(2 node per site), Solaris Cluster 3.2 and Oracle RAC 10gR2 installation with following configuration:
Two site cluster with two nodes and one storage per site. All nodes are connected to both storages over two different fabrics.
One multi-owner metaset is configured and consists of following metadevices:
1. SVM mirror(one side from site1 storage, one side from site2 storage) raw device for crs Voting.
2. SVM mirror(one side from site1 storage, one side from site2 storage) raw device for crs OCR.
3. Two plain SVM stripes from site1 storage for ASM failgroup 1
4. Two plain SVM stripes from site2 storage for ASM failgroup 2
3. and 4. is mirrored with ASM as two way mirror.
Oracle db instance on all four nodes will write actively to this ASM mirror.
Solaris Cluster quorum is solved with quorum server in third site.
Now what happens, when both fabrics(that connect these two sites together) fail in a way that nodes in one site cannot see storage in another site, but can still see storage in their own site(for example someone manages to cut all fabric cables between sites). And interconnect between two sites remains functional. In this case, in both sites will remain exactly 50% of metadb-s of shared metaset, which is enough for SVM to carry on.
Now what happens? Whole cluster panics? Two nodes in one site will panic? Or both sites will carry on reading and writing independently and will produce data corruption?
This would be a fairly unusual double failure and thus one that Solaris Cluster doesn't guarantee to survive automatically. I've discussed it with some of my colleagues and we (currently believe - although we are having further discussions) that the outcome might be non-deterministic. Note - it's not one of the standard tests performed.
The symmetry of the problem means that timing will probably govern the outcome. Each node will try to declare the remote drives unavailable and put the submirror into an errored (maintenance) state. It's possible, in a worst case scenario, that all the mirrors could end up with both submirrors in maintenance state. If one arrived in that state before the other, the likelihood is that CRS would terminate that node because the voting disk would be unavailable.
I'll post a follow up if I get more information.
So, let me see if I get this right:
You have a 4 node cluster, 2 at either site. You have a dual, stretched fabric SAN. (four switches total).
Based on the way you phrased your question, chances are all of the SAN fibers run through the same conduit (thus creating a single point of failure in what is otherwise a fully redundant setup)
If this is the case, then first question is this: Are your heartbeats also going through this same fiber conduit? If that is the case, then the cluster will split 2 and 2, and it will be a race to the quorum server. First group there wins and the loosers will panic. Problem solved. :-)
If the heartbeats are going over some physically divergent path (we have a wireless shot for one and a fiber shot for the other), then you are left with an oracle question and not a sun question.
ASM should note the loss of #3 (or #4) and visa versa and take corrective action by evicting 2 of the 4 nodes. Some voting will occur and a pair of randomly chosen nodes will be evicted from the cluster.
Then again, RAC is not exactly designed for geographical clustering, that is what they have dataguard for. Of course, this will not stop some folks from using it as such.
See for a discussion of some of the scenarios with RAC and geo clustering.
As an aside, the RAC clusters i've seen tend to be very sensitive to latency on the interconnects. Splitting them up like this increases latency, thus destroying the performance improvement of rac that the oracle sales folks like to sell ya on?. Would you be better off with a 2 node RAC cluster at one site with another db instance at the other site and dataguard to keep the two in sync?
I concur with John on this one. For me, Oracle RAC is all about performance and being able to distribute load across all the nodes.
Having latency across some nodes defeats the purpose. John's recommendation of having 2 local nodes with sync/replication to the other site is what I would have gone with design-wise.
But I am sure there was a reason why yours is designed that way.
Erick Ramirez
Melbourne, Australia 
One word of caution on the latency issue - while low latency is definitely important it doesn't play as big a part in RAC scalability as one might expect. Broadly speaking application scalability is dependent on:
* Application design: minimising table/index hotspots, locking, etc
* Distributed lock manager (DLM) code path
* Interconnect bandwidth and latency
So, a poorly written application could blow huge numbers of CPU cycles doing more work than it has to. c.f. table scans compared to indexes access. Minimising this will mean that the cycles spent traversing the DLM code will be minimised. However, remember that this code too takes time to execute. Finally, packets are sent out over the interconnect to request locks or data. Lack of bandwidth can strangle the scalability just as much as high latency.
So given enough bandwidth, the benefit of a faster interconnect will be related (but usually in a non-linear fashion) to the ratio of the time taken to traverse the DLM stack vs the network latency. If, therefore, the DLM takes 50 micro-secs (usec) to execute [I don't know what this figure is], changing from a gigabit ethernet with a latency of say 30usec to an InfiniBand interconnect with say 10usec latency won't treble your performance. The gain will almost certainly be much lower than (50+30)/(50+10), i.e. x1.3.
What happens when you separate your cluster sites - the latency must increase by a minimum of 5nano-sec/meter (due to the speed of light). Therefore a 10Km separation adds a 20,000m round trip, or 100usec. Using the same formula as above, this might result in decrease of (50+30)/(100+50), i.e x0.53.
The same is true for disk I/O, it is the relative increase in latency that is the killer, specifically for single threaded, sequential writes (such as the lgwr does). This latency cannot be 'magic-ed' away. If your application only works well with a disk I/O latency of 1milli-sec (1ms), then adding an extra 1ms that comes from 100Km node separation will probably cause your application to stop performing adequately.
Unfortunately, the only real way to determine all this for certain is to benchmark it or try it!
Thank you all for giving feedback!
Tim, yes it is double failure. And we possibly do not look for automatic survival of this, our wish is to retain data integrity. Be it whole four node panic if necessary. And yes, there is no such test case in WINGS documentation, if you did mean WINGS acceptance tests. I hope you will get more information, waiting for your follow up
John, yes there are two fabrics(four switches). The requirement for network infrastructure is that separated interconnects and fabrics should leave from data centers to different directions - one thing is requirement, other thing is real life. The buildings have for example only one entrance to cable tubes. Its would be nice to loose interconnect :) then its clear what happens. We are talking about hypothetical risk, but it exists - physical cable damage of only both fabric connections, unexperienced fabric admin.
The whole thing is not so Geo cluster, it depends on naming conventions, distances and of course clustering theory. We named it type of campus cluster. We have up to 10km connections.
Erick, The requirement for such setup comes from service availability requirements, yes RAC is intended for performance boost but we have priority on service availability, not performance. And of course data integrity - is obvious that dual failure can produce downtime. The initial question was risen by fear that we can produce data corruption.
I agree with all of you that disaster recovery solution should be preferred(dataguard and/or geocluster).
Actually, oracle supports such configuration under name of "Extended" RAC, where the oracle vote resides at third site. We do not have fabric connection nor storage at third site. Ok, we can use nfs(dont know what to think about voting with nfs) for oracle voting disk. We have Solaris Cluster vote as tcp/ip service, which is new to us as well, but has shown itself as quite robust voting mechanism during tests. Link for this:
We will most likely test this situation(both fabric connection lost between sites) tomorrow with fabric zoning(we have test installation in one site only at the moment and need to simulate the situation with fabric zones). I will notify the outcome.
It's not part of the WINGS acceptance plans (to my knowledge) or the standard fault injection tests we do in engineering, though we do test for various failures of single and multiple network and disk interconnects. Having this symmetric failure would be peculiar to campus clusters that had inter-site cables exiting the build via the same route (which is not recommend for exactly this reason).
However, regardless of the failure scenario, data integrity must be preserved. If it isn't it classes as a priority 1 bug! I expect data integrity to be maintained under this failure scenario.

Sun Cluster Geographic Edition and Oracle

I have to setup a system with an emergency system in a remote location. The system consist of a database server with an Oracle data base, a MQ Server and an application server. The connection to the remote site is a 4 Gbit link which also carries other traffic. The database is about 10 GBytes, there is no requirement to have data replicated in 'real time', the data on the emergency system may follow the production system in few minutes to hours. The emergency system is only used in the case of a disaster, e.g earth quake on the main site, it is mainly built because the client has a legal obligation to do so.
The current idea is to to run a two node cluster on the main site, with 3 resource groups, each containing a zone one with the database, one with MQ Server, and the last one with the application.
On the remote site would be a one node cluster, this cluster would be synchronized with Sun Cluster Geographic Edition and Sun StorageTek Availability Suite.
Has anyone built such a system and has any experience to share, or would it be more wise to relay on Oracle tools to replicate the data to the remote site.
Obviously, my colleagues and I, in the Geo Cluster group build and test Geo clusters all the time :-)
We have certainly built and tested Oracle (non-RAC) configurations on AVS. One issue you do have, unfortunately, is that of zones plus AVS (see my Blueprint for more details Consequently, you can't built the configuration you described. The alternative is to sacrifice zones for now and wait for the fixes to RG affinities (no idea on the schedule for this feature) or find another way to do this - probably hand crafted.
If you follow the OHAC pages ( and look at the endorsed projects you'll see that there is a Script Based Plug-in on the way (for OHACGE) that I'm writing. So, if you are interested in playing with OHACGE source or the SCXGE binaries, you might see that appear at some point. Of course, these aren't supported solutions though.
Hi Tim,
I have been reading the blue print you wrote. If I understood it correctly, the restriction regarding AVS does not apply to clustered fail over containers.
This of course means I have to sacrifice the Oracle HA Agent, put as the primary reason to use Solaris Cluster is to survive a hardware failure, this is not that much a concern to us.
There remains at least one question if my assumption regarding fail over containers is correct: can Oracle live with a file system which has been replicated by AVS, e.g will Oracle start on the remote node without any manual intervention, (except activating the remote node in Sun Cluster Geographic Edition) ?
Correct, a fail-over contain has a physical node list and therefore it can have the same affinities as the AVS RG. You are correct that you can't currently use the HA-Oracle agent in conjunction with a fail-over container.
The question of Oracle starting without intervention on a replicated file system is yes, assuming the file system recovers cleanly on the standby site. It's probably worth an entire paper on the pros and cons of various approaches. If you can tolerate synchronous replication, then that will avoid any data loss. If you have to replicate asynchronously then you will probably lose some data. Mixing async and sync replication for various parts of the database can lead to awkward problems where the tablespace is not recoverable due to 'block fractures'.

Replication fail-over and reconfiguration

I would like to get a conversation going on the topic of Replication, I have
setup replication on several sites using the Netscape / iPlanet 4.x server
and all has worked fine so far. I now need to produce some documentation and
testing for replication fail-over for the master. I would like to hear from
anyone with some experience on promoting a consumer to a supplier. I'm
looking for the best practice on this issue. Here is what I am thinking,
please feel free to correct me or add input.
Disaster recovery plan:
1.) Select a consumer from the group of read-only replicas
2.) Change the database from Read-Only to Read-Write
3.) Delete the replication agreement (in my case I am using a SIR)
4.) Create a new agreement to reflect the supplier status of the chosen
replica (again a SIR for me)
5.) Reinitialize the consumers (Online or LDIF depending on your number of
That is the general plan so far. Other questions and topics might include:
1.) What to do when the original master comes back online
2.) DNS round-robin strategies (Hardware assistance, Dynamic DNS, etc)
3.) General backup and recovery procedures when: 1.) Directory is corrupted
2.) Link is down / network is partitioned 3.) Disk / server corruption /
Well I hope that is a good basis for getting a discussion going. Feel free
to email me if you have questions or I can help you with one of your issues.
Best regards,
Ray Cormier

Only 2 Replication Group

After followed the NoSQL administrator's guide and installed NoSQL on my testing environment, I found that is only created 2 replication group.
I read the administrator's guide over and over and couldn't find how to increase the number of replication group. The guide only talk about how to identify the number of replication groups but no where it says how to set/change it.
Am I missing something? 
The system decides on the number of replication groups automatically. In the current release, the decision is fairly simple: it depends on the number of Storage Nodes and the Replication Factor. For a given Replication Factor, having more Storage Nodes will tend to result in more groups.
For example, if you had nine Storage Nodes, and a Replication Factor of 3, the system would create 3 replication groups. 
Hi.. Alan
How NoSQL select replication node in to group. 
I'm sorry, I don't understand your question. Please elaborate.

Balanced Storage Nodes

The administration guide defines "balanced" storage nodes as storage nodes with exactly one replication group, and recommends to make the entire data store balanced.
I have workload of almost 50/50 updates/reads ratio. Each record is read exactly once before its updated again. I think my data set takes about 20G, so 60G with replication factor of 3. With 16G RAM servers, I'm thinking of getting 9 machines to store my data set, and allow some room for growth.
To have balanced configuration, I should create 3 replication groups of 3 servers each? But given my read-write ratio, this will leave most of the servers unused.
Wouldn't 9 replication groups of 3, with each server hosting 3 replication nodes, one of which can perform writes, make better sense? Do I have a way to make sure I'm running with 1 master per server? 
I assume you have gone through the sizing exercize in the Admin Guide and worked with the spreadsheet.
Whatever number of Rep Groups you end up having, NoSQL Database will evenly distribute the records across all of them. I'm not sure if your comment about many of the nodes going underutilized was in reference to that or not. Assuming a uniform workload across the keyspace, it will be a uniform workload across the Rep Groups.
All writes will go to the master Rep Node in the relevant Rep Group. Presently we do not have much support for 'maste affinity' but we may in a fututre release.
Depending on the Consistency parameter passed to the api call, a read may or may not go to a replica. It sounds like you are doing RMW, so it may be perfectly reasonable to use a Consistency policy which allows reading from a replica. The subsequent write would go to the master in any case.
I hope this is useful.
Charles Lamb 
Almost useful :)
One clarification: With replication factor of 3, I have one node serving writes and two serving reads on each group. With all the writes going to the master, I have 50% of the workload going to 30% of the servers, and the other 50% of the workload going to 60% of the servers. That's the lack of balance I'm worried about.
The sizing exercise addresses disk space, and was helpful in that regard. It did not address throughput (or I missed something significant). In theory, I'd love to place one master on each storage node, to maximize the number of servers that serve writes. 
You can have multiple Rep Nodes per Storage Node, but you have to be careful of resource contention (I.e. you should have each Rep Node on a separate spindle and you need to be careful that the Rep Nodes are from different Rep Groups). Also, we don't give you very good control of master affinity so it would be possible for multiple masters to end up on the same Storage Node. Better support for this may be available in the fututre.
Charles Lamb