Using Redis for fast-access to Google Web1T n-gram data

I finally jumped on the NoSQL bandwagon and gave Redis a try.

I’ve been hearing about NoSQL for quite some time as a lightweight but much faster database system (the speed and ease being the advantage over RDBMS, with the disadvantage being the lack of relations). One of the several NoSQL systems is Redis, which I read about recently.

It is essentially a flat dictionary, that stores keys and values, but the values themselves can be data structures like sets, lists, dictionaries, ordered sets etc. It is an in-memory data store, and it is also persistent – meaning once you populate the dictionary and have the server running, the data is there forever. You can stop the server and restart it, and come back and the data you uploaded will still be there.
The main advantage is that it’s blazingly fast, and has benchmarks of 100,000 set and 80,000 get operations per second.
I installed Redis on my work machine, which was 4 lines of command line instructions. I next started the Redis server. Then I installed the Python client to Redis, called python-redis, which is available from the Ubuntu repos. Then I wrote a script that reads through the n-grams and uploads them into the Redis data store. The script is available from my GutHub account.
I chose ordered sets for this – it seemed to be a good idea to have 5 dictionaries, one for 1gms, one for 2gms, etc, and have the following structure (each of them is an ordered set):
key = {1gms/ 2gms/ 3gms/ 4gms/ 5gms}
value = ordered set, all n-grams with the scores
It took 20 minutes to upload 13 million 1gms into the store, and now the look-up is instantaneous. And the best thing is that as long as the Redis server is running on my work machine, anybody can access the store from anywhere.
The steps are (from Python CLI/ REPL)

import redis
rs = redis.Redis('localhost') # or the host address
rs.zcard('en:1gms') # will return the cardinality of the ordered set 'en:1gms'
rs.zscore('en:1gms', 'hello') # will return the count for hello

Even if I stop the server now and restart it, all these counts will be there. Redis has methods for sets, dictionaries, ordered sets, lists, etc. All that you see above, starting with ‘z’ are meant for ordered sets.

Of course it doesn’t do smoothing, but I read about Redis and I was excited to try it. I can now use Redis for anything that has large data and needs fast look-ups =D

Obviously NoSQL is just a flat dictionary, but I am sure Redis is using some efficient mechanism for storing the data. It is written in ANSI C, so it only makes sense. Besides, Redis comes with ‘batteries included’ in that it has the server which, once running, can serve any client; a persistent data store, which comes back alive even if the server is stopped and restarted; the values don’t have to be strings but can be more complex data structures themselves, and finally, it has clients in a number of languages.

Redis provides quite a few benefits over a classical RDBMS, as enumerated here – for example better, more efficient data structures. However, a caveat is that once the data becomes greater than the memory and the system starts paging, the performance degrades radically. So perhaps this method is not the silver bullet for looking up all n-grams instantaneously if you don’t have enough memory. But it can still be useful (and it’s easy to use, and it comes with batteries included as mentioned earlier) for several other scenarios.

Go learn yourself some Redis 🙂


4 thoughts on “Using Redis for fast-access to Google Web1T n-gram data

  1. I’d be interested in knowing how Redis does with the 4-grams or 5-grams. The 1-grams can be easily stored in memory to obviate the use of Redis, but the 4-grams and 5-grams can not. Can you provide some info on how these faired?

    • You are right. The experiment started off great but on my workstation it got stuck somewhere in the middle of doing the 4grams… it successfully indexed the 1-grams, but then switched to the 4-grams folder and just wouldn’t finish. Redis is apparently only good if you can fit the entire data in the RAM, and I had 8G RAM as opposed to about 80G n-grams 🙂 Maybe Redis is interesting if someone does a smaller-scale project where fast access is needed.

  2. I am facing a situation in Python and I just stumbled upon your blog. I have 1Million key-value pairs where keys are strings and values are in most cases 1000 valued list. Each item in the 1000 valued list consists of a 2 valued list pair (integer, float). I am using a normal machine (4GB RAM).

    What would be your recommendation?

    I also posted a query on StackOverflow and received various responses.

    • 1 million key-value pairs really feels like something Redis can handle. I worked with 13 million 1-grams from the Web1T data and it worked just fine on a 8G machine. I think the lengths of the values themselves should not be an issue – do let me know if it worked well. If it does the indexing then the lookup is instantaneous.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s