State of data nodes - NoSQL Database

Hi all,
Does the client (and other nodes) maintain current the state of all nodes?
If yes, which information and how do they know such information?
Thanks 

Yes, the client knows about the topology of the rep groups and rep nodes in the rep groups. That information is initialized when the client first makes contact with one (any) of the nodes in the system on the first request. It is kept updated through the values returned from any calls the client makes to any rep nodes.
This information includes (among other things): topology (partition -> Rep Groups), rep node state (e.g. master, replica), replica lag behind the master, average response time.
Charles Lamb 

Hi Charles,
That information is initialized when the client first makes contact with one (any) of the nodes in the system on the first request.1) If that information is initialized when the client first makes contact with one of the nodes. How does the client do to contact that node if it does not know?
It is kept updated through the values returned from any calls the client makes to any rep nodes.2) Each time there is an answer to a request, the node will send those information in addition to answer? Is that good for latency?
3) Is that information centralized anywhere? Who is responsible to keep current the topology? How can we be sure all nodes have the same information?
This information includes (among other things): topology (partition -> Rep Groups), rep node state (e.g. master, replica), replica lag behind the master, average response time.4) "replica lag behind the master" is calculated per record per node per primary key ? How can this information be calculated?
Thanks. 

user962305 wrote:
Hi Charles,
That information is initialized when the client first makes contact with one (any) of the nodes in the system on the first request.1) If that information is initialized when the client first makes contact with one of the nodes. How does the client do to contact that node if it does not know?It must know the ipaddr/name:port of at least one node in the system. It only has to make contact with some node to start up.
>
It is kept updated through the values returned from any calls the client makes to any rep nodes.2) Each time there is an answer to a request, the node will send those information in addition to answer? Is that good for latency?Correct. It is fine for latency. The changes to topology are very infrequent. The size of the topology changes are relatively small in any case so latency would not be an issue. If the size were large (and it's not), then throughput might be an issue. But it is not.
3) Is that information centralized anywhere? Who is responsible to keep current the topology? How can we be sure all nodes have the same information?Yes. Have you read about the "admin" process?
>
This information includes (among other things): topology (partition -> Rep Groups), rep node state (e.g. master, replica), replica lag behind the master, average response time.4) "replica lag behind the master" is calculated per record per node per primary key ? How can this information be calculated?It is per node in the rep group. It is calculated by VLSNs, a topic which is too complex for me to go into here.
Charles 

Charles,
You said the information about topology is centralized. Where is that information centralized? At the client? At the master only for per replication group? Where do the admin process resides? I have never heard about that process. 

893771 wrote:
You said the information about topology is centralized. Where is that information centralized? At the client? At the master only for per replication group? Where do the admin process resides? I have never heard about that process.The admin is a replicated process backed by a replicated database (residing on Rep Nodes). Yes, it contains the topology. The topology is cached on the client and each of the rep nodes.
Charles Lamb 

Hi,
i created kstore on 3 nodes,2 Replica with 200 Partition,so can you Please tell me that how to remotely connect ycsb to 3 Nodes setup? 

Abu Taiyyab wrote:
i created kstore on 3 nodes,2 Replica with 200 Partition,so can you Please tell me that how to remotely connect ycsb to 3 Nodes setup?I'm not sure I understand the question. You always connect to any NoSQL Database cluster the same way: you supply the name of the KVStore, and one or more hostname:port pairs any of the underlying rep nodes.
Charles Lamb 

Hi Charles,
We have created a setup of 3 storage nodes kvstore (replication factor 3 and one replication group). We wish to run the YCSB tool on this setup from an external node which is not a part of this setup. Could you guide us in how to achieve this? i.e. any changes to config file etc.

Related

Network Partitions

I am unclear on how ONoSQL should behave in the event of a network partition.
Assuming a replica group is split between two data centers that ceased to communicate. Does the master replica stay where it is (even if it is on a network partition with only a minority of the replicas)? How does the "other" network partition "know" that it shouldn't elect a new master replica?
-- Gwen 
user12003335 wrote:
I am unclear on how ONoSQL should behave in the event of a network partition.
Assuming a replica group is split between two data centers that ceased to communicate. Does the master replica stay where it is (even if it is on a network partition with only a minority of the replicas)? How does the "other" network partition "know" that it shouldn't elect a new master replica?When a network partition occurs, a new election is held. A master is only elected if there is a majority of nodes available to elect one. Hence, on the minority side of a network partition, no master will be elected. The replicas on the minority side may continue to service read requests as long as the consistency properties passed into the requests by the client can be satisfied. e.g. Consistency.NONE requests could be satisfied, but Consistency.ABSOLUTE requests could not.
Charles Lamb 
A colleague has added more details to what I said above:
"If as a result of a network partition the "minority node partition" had a node in the master state, it will continue to remain in the master state but not be able to process durable writes (since it's not in communication with a simple majority) until the partition is resolved and it notices the presence of the new master. The master on the minority side does not call for an election. This is because the master cannot distinguish between a temporary disconnect of a node and a true network partition.
A downside of having the node on the minority side continue to think that it's in the master state, is that it thinks it's absolutely consistent and as a result may respond incorrectly to read requests with time based or absolute consistency requirements.
So it's desirable for the master to relinquish mastership and call for an election, when it's not in touch with a simple majority, that is, it's not authoritative, but we want to avoid doing so on temporary network disconnects, since there is a cost associated with holding an election.
A solution we have discussed in the past, is for a non-authoritative master to call for an election, when it is consistently in this state for some configurable amount of time."
Charles Lamb 
To make sure I understand:
If I have a 7 node replication group, and 4 crash, the remaining nodes will not elect a new master and will not be able to accept writes until I bring the crashed nodes back? 
user12003335 wrote:
To make sure I understand:
If I have a 7 node replication group, and 4 crash, the remaining nodes will not elect a new master and will not be able to accept writes until I bring the crashed nodes back?That's correct. Break the above into two scenarios: (1) the master was one of the 4 nodes that crashed, and (2) the master was one of the 3 surviving nodes. In case (1), there is no surviving master so no one to accept writes. Further, since there is not a majority, no master can be elected. In case (2), a master survives, but it will not be able to commit any write requests sent to it, assuming that all write requests specify a durability with a replica ack policy of Simple Majority or All. The above post is pointing out that in case (2), an election will not be held, but the mastership will remain in the minority.
I hope this helps.
Charles Lamb 
user12003335 wrote:
If I have a 7 node replication group, and 4 crash, the remaining nodes will not elect a new master and will not be able to accept writes until I bring the crashed nodes back?Or, bring back at least one of the crashed nodes. As long as the total number of nodes that are up and communicating forms a majority (assuming the usual default "majority" configuration), you're back in business. 
Hi Charles,
Does that mean for this release no election will be held for non-authoritative master ?
Thanks 
Hi all,
In such case, what do all others nodes and clients know about the master for that RG?
Do they have de same master for that RG?
Can the client know that the master on the minority is not the true master?
I am wondering the time is sufficient to ask for new election.
Election should be held when the other partition is already online or where all nodes communicating are in the majority. I think defining just a time will not be sufficient.
Can you clarify?
Thanks 
893771 wrote:
Does that mean for this release no election will be held for non-authoritative master ?Correct. We have an SR open to provide for an election when the nodes notice they are in this state after some configurable amount of time.
>
In such case, what do all others nodes and clients know about the master for that RG?They know which node was, and still is, the master. If the master is a node in the minority group, they know which one is the master. Assuming clients can reach that node (i.e. they are not on the wrong side of the network split), then they will continue to send write requests to that node. Assuming the clients specify a durability of simple majority, and assuming that the master can still not reach a majority of the nodes, these write requests will be rejected (because the durability constraints can't be satisfied).
Do they have de same master for that RG?There is only one master at any given time.
Can the client know that the master on the minority is not the true master? Assuming the client specifies a durability of simple majority, and assuming that the master can not reach a majority of the nodes to commit the transaction, the write requests will be rejected. The client will still think that the node is a master because, after all, at some point the other nodes in the group may reappear and write requests will then succeed. Until there is an election, the node that is the master is the only master in the system that any other nodes know about.
I am wondering the time is sufficient to ask for new election.
Election should be held when the other partition is already online or where all nodes communicating are in the majority. I think defining just a time will not be sufficient.If other nodes come on line, and if those nodes coming on line form a majority, then writes will proceed and there will be no need for an election. The "fix" we are thinking of doing is that when the system notices that there is not a majority, then after some user-configurable time, an election will be called for. This is different from the case you mention where some of the missing nodes come back up. In that case, no election is necessary.
Charles Lamb 
Hi Charles,
I am missing some thing. I don't understand why there will be no elections.
The scenario is the following:
The master is in minority partition. Why can't nodes in the majority partition ask for an election?
They will see absence of heartbeat. So why can't we have one master on each partition?
And if it is possible to have a master on each partition, can please re-answer to preceding questions?
Thanks for your answer. 
user962305 wrote:
Hi Charles,
I am missing some thing. I don't understand why there will be no elections.
The scenario is the following:
The master is in minority partition. Why can't nodes in the majority partition ask for an election?
They will see absence of heartbeat. So why can't we have one master on each partition?
And if it is possible to have a master on each partition, can please re-answer to preceding questions?Yes, the nodes in the majority partition will hold an election. Yes, there would be two masters. However, assuming the clients request simple_majority for all requests, the master in the minority partition will not ack commits because the rep ack policy can't be met.
The issue we mentioned earlier is that the nodes in the minority partition should call for an election, but do not. We believe that our fix would be to have them call for an election when they notice that they are in this state after a configurable amount of time.
Charles Lamb 
Thanks Charles.
That's mean you will have to deal with conflict resolution in some cases (depending of configuration).
From your point of view, will the client know the new master on the majority partition or the previous master on the minority partition? 
893771 wrote:
Thanks Charles.
From your point of view, will the client know the new master on the majority partition or the previous master on the minority partition?It would depend on how the partitioning affected the client, wouldn't it?
Charles Lamb 
I agree with you. 
Hi Charles
=> The client will still think that the node is a master because, after all, at some point the other nodes in the group may reappear and write requests will then succeed.
Is that sure? Since master in the minority will rollback data not send to replica
=> That's mean you will have to deal with conflict resolution in some cases (depending of configuration).
Is that sure? Since master in the minority will rollback data not send to replica
Thanks for your update

Multi master replication and Directory 5.0/5.1

With regards to multi master configurations with directory 5.0/5.1 can
anyone elborate on how to specifically set the replica id numbers for
such a configuration?
for a given suffix -
(a) do the masters have to have the same or different replica ids
(from what i understand different).
(b) what number do you put for the replica ids on the consumer servers
(in the replica config there is room for 1 number, that accoridng to
the docs is supposed be the same as the master replica id from which
it will receive updates. but in mmr isn't it supposed to be able to
recieve updates form both masters - so how do you reference the
replica id's of both the masters instead of just one?)
Since first writing this question i tried it out on some boxes and what i
had to do was say for the replicated suffix o=internet -
master 1: replica id i set to 10
master 2: replica id i set to 11 (different form the first master)
slave 1: i set to 10. if i didn't set this to the same as one of the
masters, the replica initialization from the master would fail with errors
(i assume somewhere in the initialization file there are replcia id
references?). after the initialization from the first master, i created a
replication agreement form the second master to this slave and subsequent
updates seemed to work fine - but the slave had to be set to the same id
as the master for the initialization to work.
slave 2: i set to 11. if i didn't set this to the same as one of the
masters, the replica initialization would fail with errors. after the
initialization from the second master, i created a replication agreement
from the first master to this one and it seemed to work fine.
after this i tested some updates and they seemed to work fine in the
various elements of the mmr config.
I guess what im after is clarfiication that this is in fact the right
approach as the docs are not clear enough (to me at least).
Thanks in advance.. 
I have had some problems with MM replication due to what I too regard as
anemic documentation. First, a dedicated consumer with 5.0 SP1 and
above should have a replica ID of 255--this is the case for all
dedicated consumers. That leaves 1-254 for other replica IDs. Masters
MUST have different IDs; however, it is not clear to me whether or not
the individual suffixes/db nodes on the same system must have different
IDs. In my testing I had 7 suffix/db nodes, and on master 1 they all
had an id of 1, with my master 2 all ids of 2. Although MM worked (I
could load tens of thousands of entries over and over again without
problems), I experienced inconsistent and random failures of the
directory server with no errors in the error log at all. My changlelogs
would all reset to 0 and require that all my replication agreements be
reinitialized. On two occasions my ACIs were wiped out. I am now in
the process of testing master 1 with each suffix/db node having a unique
ID, and the same for master 2, basically I am using IDs 1-7 on master
and 8-14 on master 2. I have not tested much thus far, so at this point
I am not able to give you definitive information or clarification if
this strategy is more stable or not, but will post if/when I do.
Thought you would want to know someone else is experiencing similar
frustrations as you . . . .
Cullhane Gibbs wrote:
>
With regards to multi master configurations with directory 5.0/5.1 can
anyone elborate on how to specifically set the replica id numbers for
such a configuration?
for a given suffix -
(a) do the masters have to have the same or different replica ids
(from what i understand different).
(b) what number do you put for the replica ids on the consumer servers
(in the replica config there is room for 1 number, that accoridng to
the docs is supposed be the same as the master replica id from which
it will receive updates. but in mmr isn't it supposed to be able to
recieve updates form both masters - so how do you reference the
replica id's of both the masters instead of just one?)
Since first writing this question i tried it out on some boxes and what i
had to do was say for the replicated suffix o=internet -
master 1: replica id i set to 10
master 2: replica id i set to 11 (different form the first master)
slave 1: i set to 10. if i didn't set this to the same as one of the
masters, the replica initialization from the master would fail with errors
(i assume somewhere in the initialization file there are replcia id
references?). after the initialization from the first master, i created a
replication agreement form the second master to this slave and subsequent
updates seemed to work fine - but the slave had to be set to the same id
as the master for the initialization to work.
slave 2: i set to 11. if i didn't set this to the same as one of the
masters, the replica initialization would fail with errors. after the
initialization from the second master, i created a replication agreement
from the first master to this one and it seemed to work fine.
after this i tested some updates and they seemed to work fine in the
various elements of the mmr config.
I guess what im after is clarfiication that this is in fact the right
approach as the docs are not clear enough (to me at least).
Thanks in advance.. 
Hi,
Thanks for that - i didn't notice the sp1 service release note item you
quote there and that clears it all up.
Will give it a go next time i have a couple of unpaid hours to spare ;)
Thanks!
"Cullhane Gibbs" <cullhane#hotmail.com> wrote in message
news:9vbmgu$n3i1#ripley.netscape.com...
With regards to multi master configurations with directory 5.0/5.1 can

Oracle NoSQL Database White Paper (Implementation)

Hi all,
Please, I would like to have an anwser to the following questions:
1) What do you mean by topology - when you say "the client driver maintains a copy of the topology?"
2) Does Oracle NoSQL Database works like client/server? Do we have to install the client on every PC which has to use the store? How do we do if we have a new PC? Should end users have a client on their PC? I don't really understand where the client should be installed.
3) You say: "The 30 replication groups are stored on 30 storage nodes, spread across the two data centers" Why 30 storage nodes? It should have been 90 nodes since we have a replication factor of 3.
4) How can the client have the response to a read request? Is the client reading directly the answer from a log file on one of the replication node? The one on which read request has been forwarded?
Thanks 
user962305 wrote:
Hi all,
Please, I would like to have an anwser to the following questions:
1) What do you mean by topology - when you say "the client driver maintains a copy of the topology?"Topology refers to the location of the nodes in the system and what partitions they contain. By knowing the topology, the client can contact the node holding the requested data directly.
2) Does Oracle NoSQL Database works like client/server? Do we have to install the client on every PC which has to use the store? How do we do if we have a new PC? Should end users have a client on their PC? I don't really understand where the client should be installed.Any application which wants to use NoSQL Database needs to have the kvclient-M.N.P.jar in its classpath. This jar can be found in the KVHOME/lib directory.
3) You say: "The 30 replication groups are stored on 30 storage nodes, spread across the two data centers" Why 30 storage nodes? It should have been 90 nodes since we have a replication factor of 3.Can you tell me the location of the paper and what page you are referring to?
4) How can the client have the response to a read request? Is the client reading directly the answer from a log file on one of the replication node? The one on which read request has been forwarded?If a request has been forwarded, then the response is returned to the client through the forwarding node.
Charles Lamb 
Hi Charles,
You said
<< Can you tell me the location of the paper and what page you are referring to?
http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf
Subsection "Implementation" page 9 nearby picture "Figure 4: Architecture"<< If a request has been forwarded, then the response is returned to the client through the forwarding node.
Do you mean every node (including the client) can forward a request?
How does log-structured storage system helps in returning the response to the client?Thanks. 
user962305 wrote:
3) You say: "The 30 replication groups are stored on 30 storage nodes, spread across the two data centers" Why 30 storage nodes? It should have been 90 nodes since we have a replication factor of 3.You are correct. It should be 90 storage nodes. Thanks for pointing this out.
Charles Lamb

Error while configuring KVStore

Hi,
I am trying to install Oracle NoSQL release 2.0.26. I have created just two storage nodes for the time being. While I am trying to create a topology, I see an error -
"Can't build an initial topology with 0 shards. Please increase the number or capacity of storage nodes"
Can anyone tell me what this error is.
Thanks. 
Hi,
I am trying to install Oracle NoSQL release 2.0.26. I have created just two storage nodes for the time being. While I am trying to create a topology, I see an error -
"Can't build an initial topology with 0 shards. Please increase the number or capacity of storage nodes"
Can anyone tell me what this error is.That error says that each KVStore has to contain at least one shard. The more shards your store contains, the better your store is at servicing write requests.
Every Storage Node hosts one or more Replication Nodes. Replication Nodes are organized into shards. A shard contains a single Replication Node which is responsible for performing database writes, and which copies those writes to the other Replication Nodes in the shard. This is called the master node. All other Replication Nodes in the shard are used to service read-only operations. These are called the replicas. Production KVStores should contain multiple shards. At installation time you provide information that allows Oracle NoSQL Database to automatically decide how many shards the store should contain. The more shards that your store contains, the better your write performance is because the store contains more nodes that are responsible for servicing write requests.
Please let me know if you have any more questions. You can find some basic instructions for installing a simple multi-node system with 1 shard replicated across 3 hosts at http://docs.oracle.com/cd/NOSQL/html/quickstart.html
Related documentation:
Identify the Target Number of Shards - http://docs.oracle.com/cd/NOSQL/html/AdminGuide/store-config.html#num-rep-group
The KVStore - http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/introduction.html#kvstore
Thanks,
Bogdan 
In addition to the information that Bogdan supplied, the most likely issue is that you have specified a replication factor of 3, and you only have two storage nodes. NoSQL DB will not place more than one Replication Node from the same shard on a single Storage Node, because that would make the shard vulnerable to storage node failure.
Edited by: Linda Lee on Apr 22, 2013 4:26 PM
We will try to make that error message more explicit, to provide a better hint as to what the problem is. 
Thanks Linda Lee and Bogdan. When I created the third storage node, everything worked fine.
Thank you so much for the help!

Node in detached state causes partial unavailability

Hi, I am using Oracle NoSQL Ver: 12cR1.2.1.8. I have 9 nodes - 3 replication groups, each with 3 replication nodes. Due to an unknown issue, one of the nodes became detached - "Status: RUNNING,DETACHED at sequence number: 106,381,564". After that, some of the clients were still connecting (or maybe remained connected) to this detached node and all requests were failing with timeouts (after 5 seconds). Is that something that the user of the client API should take care of? I was actually expecting that the client updates the list of nodes in the background and would connect to another node as soon as such a condition is detected. That is the way it behaves when nodes get killed / shutdown at least. Best regards,Dimo
dimo wrote:
 
I am using Oracle NoSQL Ver: 12cR1.2.1.8. I have 9 nodes - 3 replication groups, each with 3 replication nodes. Due to an unknown issue, one of the nodes became detached - "Status: RUNNING,DETACHED at sequence number: 106,381,564". After that, some of the clients were still connecting (or maybe remained connected) to this detached node and all requests were failing with timeouts (after 5 seconds).
 
Is that something that the user of the client API should take care of? I was actually expecting that the client updates the list of nodes in the background and would connect to another node as soon as such a condition is detected. That is the way it behaves when nodes get killed / shutdown at least.
No, applications that use the client API are not responsible for doing anything about server node failures.  The client API will use whatever replication nodes are available.  Depending on the client consistency and durability requirements, one or more server node failures may not be noticed by callers of the client API at all. In many cases, if a single server node fails, as DETACHED status indicates, the client API calls will continue to succeed.  For example, if you are using a durability with ReplicaAckPolicy.SIMPLE_MAJORITY and the store has a replication factor of 3, the remaining two nodes in the shard will be able to handle the request.  What operation were you performing here when you got a timeout? The DETACHED status means that there is a problem with that node.  You should look at the debugging log files in the KVROOT/STORENAME/log directory to look for information about what may be going wrong. - Tim
Hi Tim, we are using simple majory consistency - it works if one node per replication group is down (2/3 is a majority). In contrast, when this node was in the detached state, some clients were still connecting to it and all requests from those clients were failing. Exactly the opposite of what one would expect. Cheers,Dimo
Hi Dimo. If there were a set of update requests that were directed to a node that failed, all of those requests would fail with an exception and would need to be retried by the application.  That problem should not effect future requests or read requests, though. Can you confirm whether these were read or update requests, and whether by "all requests" you mean that the requests were continuing to fail? Is it possible that the problem involves a network partition where the client can only reach the failed node? If none of these ideas cover what you are seeing, then you should probably contact me offline and send me stack traces.  You can email me at my first name dot last name at the usual company name. - Tim Blackman
Hi Tim, The read requests were failing. We do not write as often to the store (we batch load the whole store from our SQL database and then only read during the normal operation). This was going on for a few days so I assume there was no network partition during that whole period. It also started working correctly as soon as we fixed that one node (I guess as soon as we brought it down for repair). I will try to get you some stacktraces although I assume that the logs haven been rolled away already. Cheers,Dimo

Categories

Resources