google « My pensieve

Tag: google

December 30, 2015

Let us not ignore GoogleWebLight, another walled garden in the name of faster web content

by viggy — Categories: internet, social — Tags: basics, free, google, weblight — Leave a comment

Free Basics is evil but let us not ignore GoogleWebLight. It is a new firewall to the internet, one which centralizes the entry to the internet, tries to capture all the data with user privacy completely sacrificed. All this with a sugar coating of fast access to web content.
On the evil side of the digital world, now there are two gaints competing against each other on who can raise to the supreme position of ‘read-everything, know-all’ gaint. The end goal is very simple, understanding the user very personally for successful targeted advertisement. The battle that netizens/hacktivists are fighting against Free Basics from Facebook to protect NetNeutrality has received lot of attention and clearly we will find its results in due course. The battle is intense with Facebook having put gone full force to the extent that it even started attacking the NetNeutrality supporters directly in its full page print media ads about Free Basics.

However, the other gaint has been playing the more silent battle and winning quiet remarkably. Seriously, in terms of capturing the market and looking at the long term goals, Facebook can easily learn some major lessons from Google. Let us look at how an average internet user browses internet in India. There are lot of studies which show that smartphones are the basic device with which majority of internet users in India browse internet. This market is completely captured by the Android Operating System controlled by Google. In this, Chrome and the default Android Browser seem to the major browser that the user uses to browse, again both of which are controlled by Google.

Now since this was not enough, Google has added the feature in Chrome which redirects user to its own site, GoogleWebLight if it detects that the user has a slow internet and then loads the webpage within GoogleWebLight. This it suggests is to improve the speed with which the site is loaded. It naturally tries to remove the Javascript and CSS, may be some heavy images. Whether the site loads faster, maybe debatable, the fact that now Google is now redirecting all the users to its own site, instead of sending the user directly to the site which the user clicked seems completely against privacy of the user, breaks the internet as all the hyperlinks are now modified to redirect user to GoogleWebLight and also kind of acts like a firewall to the internet.

There is no doubt that large section of internet users in India are using internet through a slow connection. Hence there is need for a solution which helps to load the content faster and also such that their data usage is limited. But is redirecting the user to some other site such that the site which user intended to visit is embedded in the new site is definitely not the best design, infact it seems to be a very bad design. One simpler suggestion could just have been using a Chrome browser addons which disables loading of certain type of web elements when loading a site on the browser. Another way could have been developing a standard which encourages website developers to also have a site for slow internet users.

From the perspective of a website developer, GoogleWebLight is evil because now I have no idea how my website will be shown to the user. It seems you can add a meta tag to disable loading of the site through GoogleWebLight but then Chrome shows a warning to the user that ‘This page is slow to load’ or something to that effect. Definitely not something which I want my user to see while waiting for my page to load.

Overall, it is important for us to be careful on how GoogleWebLight evolves and as hacktivists, we need to come up with better ways to ensure that people with slow internet connection get better experience browsing internet.

March 13, 2015

Bigtable: snippets/notes from the original Google’s paper

by viggy — Categories: Uncategorized — Tags: bigtable, chubby, google, gsoc, mapreduce, sstable — Leave a comment

Following are the lines that I made marked in the original Google’s paper on BigTable.

Introduction:

Bigtable has achieved several goals: wide applicability, scalability, high performance and high availability.
Bigtable does not suport a full relational data model; instead, it provides clients with a simple data that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and coloumn names that can be arbitary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings.

Data Model:
A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
(row:string, column: string, time:int64) -> string

Row:
The row keys in a table are arbitary strings(currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic(regardless of the number of different columns being read or written in the row).
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines.

Columns:
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type( we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family: after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small(in the hundreds at most), and that families rarely change during operation. In contrast a table may have an unbounded number of columns.
A column key is named using the following syntax: family:qualifier. Column family names must be printable, but qualifiers may be arbitary strings. Access Control and both disk and memory accounting are performed at the column-family level.

Timestamps:
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp, Bigtable timestamps are 64-bit integers.
The client can specify either that only the last n versions of a cell be kept, or that only new enough versions be kept.

API:

Bigtable suports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable can eb used with MapReduce , a framework for running large scale parallel computations developed at Google.

Building Blocks:

Bigtable uses the distributed Google File System(GFS) to store log and data files. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.

The Google SSTable file format is used internally to store Bigtable. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be com-
pletely mapped into memory, which allows us to perform lookups and scans without touching disk.

Bigtable relies on a highly-available and persistent distributed lock service called Chubby. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. Each Chubby client maintains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.

Implementation:

The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.
Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the master is lightly loaded in practice.

Tablet Location:
We use a three-level hierarchy analogous to that of a B+ tree to store tablet location information.
With a modest limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 2^34 tablets.
The client library caches tablet locations. If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses.

Compactions:
When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies.
Every minor compaction creates a new SSTable.
We bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. A major compaction, produces an SSTable that contains no deletion information or deleted data.

Refinements:
Locality groups:

Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet.

Compression:

The user-specified compression format is applied to each SSTable block. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme, which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data.

Caching for read performance:

To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read.

Bloom Filters:

A read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom filters should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.

Exploiting immutability:

Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable.
The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.

September 2, 2014

Taming googlegroups for those without Gmail Account

by viggy — Categories: FOSS, FSMK — Tags: community, forum, FSMK, google, groups, join, subscribe, unsubscribe — Leave a comment

I have recently migrated my primary email address to my own domain and started using it for all communication purposes. One of the main thing is to subscribe to various Google Groups which GNU/Linux User’s Groups under FSMK have. The straight forward way, you would think, would be to just search for the groups in groups.google.com website, then there should be a subscribe button where you would enter your email address and then you should receive a email to confirm that you want to subscribe. This is what most of the mailing lists software would do. However not Google Groups. If you want to subscribe to Google Groups, when you click to join the group, it asks you to sign in a Google Account, hence any non-google account wont be able to subscribe to a group.

The other way is to request the group admins to add your email address in the group, this will not work always as most of the times I wouldnt even know the admins of the group. Also it means that you now have to depend on the other person to find time to add you which will be slow. Idea of public mailing lists is that anyone interested should be able to subscribe. Unfortunately google does a reasonable job in ensuring it becomes very difficult for you unless you have a Google Account.

So I found out a more difficult way to subscribe/unsubscribe google groups and this is again borrowed from traditional mailing lists softwares.

If you know the google group’s mailing address, say

glugdsce@googlegroups.com

You can subscribe to the group, by sending a mail to

glugdsce+subscribe@googlegroups.com

To unsubscribe, send a mail to

glugdsce+unsubscribe@googlegroups.com

Similarly, to directly visit the page of the group in a browser, you can use the following link

https://groups.google.com/forum/#!forum/glugdsce

In all above cases, just remove glugdsce with the name of the group that you are interested in.

Overall, I think Google Groups continue the evils of Google and if you are a well established group/community, it is always better to host your own mailing list and move out of Google Groups as soon as possible.

October 18, 2013

Being committed to free software is the only way forward

by viggy — Categories: FOSS, FSMK, linux, software — Tags: contribution, facebook, google — Leave a comment

One of my friend shared this article about how Facebook is now a leading open source contributor and how most of the web giants are encouraging open innovation. The article also mentioned about how this trickle down is helping the community in general. Though I largely agree to the points mentioned in the article, I think the author missed one main point.
It only makes sense to open source your changes rather than maintaining forks
Its a common practice for any start-up to go with free software tools. When you are experimenting lot of ideas, free software tools help you to quickly come up with prototypes which you can get it validated with potential customers. The teams can also work independently without having to deal with licensing costs and other agreements with proprietary software companies.
What I think the article missed to point out is that when companies start developing their infrastructure around free software tools, it becomes a necessity for them to contribute back their code to the parent repository. Else maintaining and merging different code bases as and when new released for the particular project is released becomes very time consuming defeating the whole purpose of using free software tools. Hence I think it is not that Facebook and Google want to open source their technologies(Definitely, it can be that they want to contribute to the community), but the choice of becoming free software users ensures that you also contribute your changes back to the repository.
For example, we know that Google dumped its Google Compute Engine Linux and shifted to Debian as per this report.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30