I finally jumped on the NoSQL bandwagon and gave Redis a try.
I’ve been hearing about NoSQL for quite some time as a lightweight but much faster database system (the speed and ease being the advantage over RDBMS, with the disadvantage being the lack of relations). One of the several NoSQL systems is Redis, which I read about recently.
import redis
rs = redis.Redis('localhost') # or the host address
rs.zcard('en:1gms') # will return the cardinality of the ordered set 'en:1gms'
rs.zscore('en:1gms', 'hello') # will return the count for hello
Of course it doesn’t do smoothing, but I read about Redis and I was excited to try it. I can now use Redis for anything that has large data and needs fast look-ups =D
Obviously NoSQL is just a flat dictionary, but I am sure Redis is using some efficient mechanism for storing the data. It is written in ANSI C, so it only makes sense. Besides, Redis comes with ‘batteries included’ in that it has the server which, once running, can serve any client; a persistent data store, which comes back alive even if the server is stopped and restarted; the values don’t have to be strings but can be more complex data structures themselves, and finally, it has clients in a number of languages.
Redis provides quite a few benefits over a classical RDBMS, as enumerated here – for example better, more efficient data structures. However, a caveat is that once the data becomes greater than the memory and the system starts paging, the performance degrades radically. So perhaps this method is not the silver bullet for looking up all n-grams instantaneously if you don’t have enough memory. But it can still be useful (and it’s easy to use, and it comes with batteries included as mentioned earlier) for several other scenarios.
Go learn yourself some Redis

I’d be interested in knowing how Redis does with the 4-grams or 5-grams. The 1-grams can be easily stored in memory to obviate the use of Redis, but the 4-grams and 5-grams can not. Can you provide some info on how these faired?
You are right. The experiment started off great but on my workstation it got stuck somewhere in the middle of doing the 4grams… it successfully indexed the 1-grams, but then switched to the 4-grams folder and just wouldn’t finish. Redis is apparently only good if you can fit the entire data in the RAM, and I had 8G RAM as opposed to about 80G n-grams
Maybe Redis is interesting if someone does a smaller-scale project where fast access is needed.
I am facing a situation in Python and I just stumbled upon your blog. I have 1Million key-value pairs where keys are strings and values are in most cases 1000 valued list. Each item in the 1000 valued list consists of a 2 valued list pair (integer, float). I am using a normal machine (4GB RAM).
What would be your recommendation?
I also posted a query on StackOverflow and received various responses.
http://stackoverflow.com/questions/8923387/storing-a-list-of-1-million-key-value-pairs-in-python
1 million key-value pairs really feels like something Redis can handle. I worked with 13 million 1-grams from the Web1T data and it worked just fine on a 8G machine. I think the lengths of the values themselves should not be an issue – do let me know if it worked well. If it does the indexing then the lookup is instantaneous.