Is there a good way to import a bulk of records into a kvstore? - NoSQL Database

I have one billion records.
I wrote a Java program and call store.put(), but the performance is very poor.
Is there a better way? 

894377 wrote:
I have one billion records.
I wrote a Java program and call store.put(), but the performance is very poor.Writing one billion records (in any system) is going to require some tuning. You didn't say anything about your configuration (number of machines, size of machines, amount of memory, size of cache, amount of disk, type of disk, etc). You may be using kvlite. If so, kvlite is not intended for any kind of serious performance -- it is only meant to be a way for someone to use the API in a relatively "small" environment.
I suggest that you start by reading Chapter 2 in the Admin Guide, [Planning Your Installation|http://download.oracle.com/docs/cd/NOSQL/html/AdminGuide/installplanning.html]. It contains information about sizing your NoSQL Database properly.
Charles
Edited by: Charles Lamb on Nov 2, 2011 8:59 AM

Related

Data loads are slower on the server?

Howdy,
We're inserting about 100,000 byte arrays, each of 608,000 bytes into an informix List column via JDBC.
My problem is the performance drops (it's nearly twice as slow) when I run it in the test evironment as compared to my development environment, and I expected it to be substantially faster. I'm getting between 7 and 9 rows a second in development, but it drops to 3 to 4 rows per second in test... and that's not fast enough.
My development environment uses THE test database on the unix server (sparc solaris 16 CPU's & 8 Gig), and a local weblogic server on windows development machine (2.8 duo with 2Gig), talking over a really fast LAN (I think it's a million megabits per second, but's that's not my area).
The test database and the test weblogic server are both on the same unix machine... so there's no network involved... so I expected a substantial performance increase... but got the opposite. Solaris's prstat utility reports that the server is barely working during the load... which is indicative that the process is I/O bound... but throughput has actually dropped when we're NOT using the network.
I'm stumped... Please does anyone have any brilliant insights? My next step is vendor support... and that's allways a joy.
Thanx all. Keith.
Edited by: corlettk on 31/03/2008 12:18 - I should have said I'm using a script to start the JVM in both cases, so they've got the same memory allocations. 
Hello Keith,
These are the things I actually can think of:
1) Do you use the same driver? There mabe some restrictions. I had a similar problem sometime ago using DB/2, changed the driver (I think I used the network driver before and changed to the application) increasing performance.
2) Is it the same Database? There maybe some logging facility active on your server.
3) Are Indexes involved? They may slow down the insert. You then can insert into some table without indexes and do a "insert into XXXX select" afterwards.
4) How do you insert the rows? There may be a problem when commit is done to often. But this should also reduce performance in your developer scenario.
I will continue thinking about it ....
Lars. 
Thanx Lars.
1) I'll definately look into "network vs local" optimised JDBC drivers (and settings)... that might explain it. Thank you!
2) It is the very same database and table in both instances, It's a "real" table (as apposed to session-temporary-table or whatever) which is (re)created in both instances at the beginning of each run. Logging is active on that database, but has not been activated on that table. More food for thought though, thanx.
3) Only Informix's internal primary index is involved... it's faster to create index's & stat's in informix after the table is loaded with data.
4) The rows are inserted in a "tight loop". It commits every 100 rows.... or else we'd run out of lock-lists before the load finished. Commiting every row didn't appear to impact performance... From experience it doesn't UNLESS logging is active on that table.
I allready owe you four beers. Thanking you sir.
Keith. 
Hello Keith,
I just looked into the informix guide (http://publib.boulder.ibm.com/infocenter/idshelp/v111/index.jsp?topic=/com.ibm.jdbc_pg.doc/jdbc32.htm). There seems to be some server side driver available.
Where can I get that beer?
Lars. 
My modem keeps throwing an DontPoorThatBeerIntoMeYouIdiotException... so maybe the next time you're in Australia. 
Hello Keith,
So I presume it to be kind of brackish when I reach Australia. Keith please drink that beer for me! Skoal!
Curiously wating for any solution you find,
Lars. 
#Op. Sounds like your disk can't handle the load from both the application server and your database. What happens if you e.g. turn off all logging from the application server and other processes? 
kaj,
Sounds like your disk can't handle the load from both the application server and your database.Yep, that's a possibility... Except... Ummm... I wouldn't have a clue where to start to "turn off logging"... and don't imagine that it'll make any difference because (presuming that this machine is our "standard" setup, and that our "standard" setup hasn't been changed from how I'm used to them being setup, certainly nobody has informed me, but they wouldn't, especially now we've got ITIL to ensure that nobody knowswhatthefucksgoingon) /home is mounted on an internal disk (for transfer speed) and /data is on a T3 disk array (for cost abatement... or come to think of it, it might even be on the SAN these days, see above mushroom syndrome, and I don't even know how to check).
I'll eliminate some other possibilities first though
Big thanks ;-) Keith.

Heap memory size question.

Hi all ! I have developed a web app and I want to host it in a hosting company, so depending on my memory needs I will need to apply for one plan or another. Im not a expert on this field so if you have any hints or observations please let me know.
The app makes use of sql and php.
The results are obtained from netbeans' profiler.
This is a graphic:
[http://i201.photobucket.com/albums/aa166/juanmanuelsanchez/profile.png|http://i201.photobucket.com/albums/aa166/juanmanuelsanchez/profile.png]
Timestamp Heap             Size (Bytes)      Used Heap (Bytes)
12-ene-2010 23:54:21       21229568          15472504
12-ene-2010 23:54:22       21229568          15472504
12-ene-2010 23:54:24       21229568          15472504Wich should be acceptable levels, regarding memory, loaded classes, etc.?
Thanks a lot !
Edited by: juanmanuelsanchez on Jan 13, 2010 11:11 AM 
juanmanuelsanchez wrote:
..
Wich should be acceptable levels, regarding memory,..Depends on what the application is doing.
juanmanuelsanchez wrote:
.. loaded classes, etc.?If some one says 20, are you going to change your design so that you only have 20 classes?
Don't fall into the premature optimization traps. 
The application will make heavy use of sql querys to find information. There is little processing or heavy operations on the servlet.
Im not going to change my design, the only way would be that the number is extremelly high, or there is a real flaw that someone points out.
I cant remember the number of classes in the app at the moment, but it should be between 30 and 40.
Each object contains 3 classes, one for constructors, etc. Another for BD querys related to the object, and another to organize/make operations based on the results of the query.
Thanks for the help.

How to find memory used by page tables

Is there a way to find out how much memory is currently being used by page tables? I am new to Solaris ;-)
I want to quantify the advantages of Intimate Shared Memory (in the context of a large Oracle database with lots of concurrent users). I want to contrast this against Linux which does not have a method of allowing different processes to share page tables that map onto shared memory. Thus, with a large number of concurrent connections where each connection creates a new process that maps onto the Oracle shared memory, a significant amount of memory can be consumed just by the page table entries. 
Do you have access to the "Solaris Internals" book (volume 2)? It has a lot of information about how memory is used.
--
Darren 
Yes, a very recent acquisition :-) ......I'm busy working my way through it, just taught myself about mdb today but I still can't figure how to get the amount of memory being used by page tables only. The dcmd ::memstat can give me the total amount of memory being used by the kernel but I would like some more detail than that.
On Linux you can simply look at /proc/meminfo and it contains a wealth of information. I was hoping Solaris would be similar....but isn't that always the case when doing something new? We hope it's like what we already know :-) Below is an example from Linux showing that 1860 kB have been to store page table entries.
[root#makalu ~]# cat /proc/meminfo | grep PageTables
PageTables: 1860 kB
[root#makalu ~]#
Edited by: BrettSchroeder on Mar 12, 2008 1:37 PM 
Yeah, sorry. One of those (many) areas that I have no experience with. However, it appears to me that "page tables" are an x86 architecture thing that doesn't exist in the same way on SPARC. So I find it very easy to believe that x86-only details are less likely to be "front and center" in documentation and debugging.
You can also pull up the OpenSolaris source browser and type "page tables" into the search. There's not that many files that have that term referenced. Maybe one of the comments will make sense to what you're looking for.
Good luck!
--
Darren

Java Caching Framework

Hi All,
I m in process of evaluating some open source java caching framework which can help our web application to reduce response time.
Though i have some open source caching framework in my list like
JCS
OSCache
JOCache
But i have never used any one of the caching framework.If any one in the group have used them in past or is working on some open source based framework,please do share there experience so that it can help us in deciding the best available solution.
Thanks in advance
-Umesh 
You might want to add ehcache to your list of possibilities. 
I kw about it and its there in the list but what i kw abt this solution is that it require a high amount of memory which on initial stage we don't have 
umesh_awasthi wrote:
I kw about it and its there in the list but what i kw abt this solution is that it require a high amount of memory which on initial stage we don't have"High"? You mean its memory footprint can't be configured? You do realize that using a cache is automatically going to increase the amount of memory your application uses, don't you?
I'm assuming here that "kw" is a new text-speak abbreviation for "know". I haven't seen this one before. It's bad enough trying to decipher the ones I do know. Could you post in English in future? 
Yes "Kw" means know sorry for this short cut.
as i have already told that i have not used any one of them this is the first time we are in process of using them..
we were told by one of our friend about the high memory usage about it..
Regarding disk space that is not going to be a problem at any stage but the current memory can be a issue at initial stage 
I would advise you to avoid the DBR* technology. Try things out to find out how they actually work in your environment. Sometimes it may be difficult to build a test system where you can do a proper stress test, but it's still better than relying on a vague remark your friend (or some anonymous person on the Oracle forums) made.
* DBR: Design By Rumour 
Thanks Clap for the suggestion and i am agree that can be done ..
this will help us to get the real time evaluation about the positive and negatives of the solutions.. 
You may want to check out Hazelcast . It is an open source distributed, transactional distributed cache for Java. Hibernate second level cache plug-in is also available .
Hazelcast is released under Apache license. It also have distributed lock, topic, multimap, queue and executor service implementations. [This 10 minute video|http://www.hazelcast.com/screencast.jsp] is very good to get started.
-talip
Edited by: talip_ozturk on May 2, 2010 1:59 PM
Edited by: talip_ozturk on May 2, 2010 2:16 PM

Shared objects across JVMs?

I'm sure this topic must come up fairly frequently, but try as I might, I couldn't find a satisfying answer for what I'm looking for.
I have an object that loads a very large in-memory datastructure from disk early in its lifetime, after which time the datastructure is read-only.
My application runs a number of different JVMs on my machine, and each of these JVMs has to load the same datastructure, wasting time and immense amounts of memory.
I know I can likely improve the load time by storing a serialized version of the object itself on disk, but that doesn't help with the more serious problem: the memory usage.
Any suggestions for a good, lightweight way to share this read-only memory across the JVMs? 
You could have one server instance that loads and serves portions of this structure to other jvms. Possible ways to do that are e.g. RMI or web service with text or binary data. 
I've looked at RMI in the past, but if I understand correctly, though I can use it to pass the object between JVMs, it won't actually let me share the memory. Which leaves me no better off than I am at present. Am I misunderstanding? (I hope so; I'd like a nice solution to this problem!) 
I was thinking that you could use RMI to serve requested parts of this structure as needed. You can't AFAIK share memory directly between jvms, but this way you would only communicate what's needed at each time. One question though, is something preventing you from just using a database to store and serve this structure? 
Brynjar wrote:
One question though, is something preventing you from just using a database to store and serve this structure?Mainly performance. The structure gets accessed an unbelievable number of times inside an already compute-bound bit of code (that takes days to run as it is). That would also be my concern with RMI; you're right, I could serve up individual requests for bits of data (looking much like a database), but then I'd worry that the overhead of RMI would kill me. Thus the hope that I can somehow just share the read-only memory between JVMs.
Edited by: dougcook on Feb 8, 2009 9:54 AM 
dougcook wrote:
Brynjar wrote:
One question though, is something preventing you from just using a database to store and serve this structure?Mainly performance. The structure gets accessed an unbelievable number of times inside an already compute-bound bit of code (that takes days to run as it is). That would also be my concern with RMI; you're right, I could serve up individual requests for bits of data (looking much like a database), but then I'd worry that the overhead of RMI would kill me. Thus the hope that I can somehow just share the read-only memory between JVMs.
Edited by: dougcook on Feb 8, 2009 9:54 AMIf you need fast access to all of the structure in multiple applications concurrently, then I think you're out of luck using RMI as well, or any other socket based communication for that matter, for the same performance reason. How large is this structure and how is is currently loaded? Also, do the applications need to be separate? You would be able to share the memory if they were all running within the same jvm. 
The data structure is currently a few hundred megabytes and growing, eating a few gigs total across 8 JVMs on an 8-core machine with 24 gigs of RAM.
The architecture of the compute platform I'm using (hadoop) is what's forcing me to multiple JVMs, and I've little control over that, unfortunately.
Much appreciate your help and suggestions. 
Well if you have to use hadoop for distributing workload then I think you'll have to live with the memory usage. Best bet then I would think is to optimize the loading time, e.g. with serialization like you suggested yourself. What kind of data is this, and does all of it need to be loaded up front? 
dougcook wrote:
I'm sure this topic must come up fairly frequently, but try as I might, I couldn't find a satisfying answer for what I'm looking for.
I have an object that loads a very large in-memory datastructure from disk early in its lifetime, after which time the datastructure is read-only.
My application runs a number of different JVMs on my machine, and each of these JVMs has to load the same datastructure, wasting time and immense amounts of memory. What do you mean "on your machine"?
Realistically the only time this should come up would be in a server situation and one which the number of machines is really irrelevant.
At any rate look at the following.
[http://www.danga.com/memcached/]
Or perhaps the following which I accidently came across when looking for the above link. I haven't used this although I have used the previous.
[http://memcachedb.org/]

Categories

Resources