gsoc « My pensieve

Tag: gsoc

August 23, 2015

Running GeoWave from command line for HBase options

by viggy — Categories: Uncategorized — Tags: commandline, Geowave, gpx, gsoc, hbase, jar, tools — Leave a comment

GeoWave provides GeoWaveMain class which can be used to run GeoWave from command line. For regular users GeoWave provides a wrapper script for the same which would be installed in case you are using the rpm or geowave through other packages. In my case, since I was running in a development environment, I am directly using the GeoWaveMain class with options from the command line. Here I document the results that I got while running the hbase related commands from the GeoWaveMain class from command line. The GeoWave documentation talks about running the commands from command line here. I used the same and the existing tests in GeoWave source code to deduce the commands.

In the following examples, I am running a local hbase and zookeeper instance.

First, to run the commands, we need to generate the tools jar using the command in the deploy source directory,

mvn package -P geowave-tools-singlejar -Dmaven.test.skip=true

I am skipping tests as some of the tests fail in my local setup due to various reason and I do not want that to be reason for the maven build to fail.

Next I also generate the hbase datastore jar using the following mvn build command

mvn package -P hbase-container-singlejar -Dmaven.test.skip=true -Dfindbugs.onlyAnalyze=true

Here I am also asking FindBugs just to analyze as currently FindBugs fails the build due to the autogenerated code in FilterProtos, which is used for Filters in HBase. I have not yet been able to find a solution to fix the same and general solution was to skip findbugs for the build process using this switch. This would still generate the FindBugs.xml which can be analysed later using FindBugs GUI separately.

The last two command should generate geowave-deploy-0.8.8-SNAPSHOT-tools.jar and geowave-deploy-0.8.8-SNAPSHOT-hbase-singlejar.jar in the target directory. Since we also use geotools-vector and gpx as formats in following commands, we need to copy their jar also here or refer to them in the classpath. In our case, we just copy the jar to the target directory.

Assuming that user’s current working directory is target, following command is used to run localhbaseingest:

java -cp geowave-deploy-0.8.8-SNAPSHOT-tools.jar:geowave-deploy-0.8.8-SNAPSHOT-hbase-singlejar.jar:geowave-format-vector-0.8.8-SNAPSHOT.jar mil.nga.giat.geowave.core.cli.GeoWaveMain -localhbaseingest -z localhost:2181 -n geowave -f geotools-vector -b ../../test/data/hail_test_case/hail-box-temporal-filter.shp

To get hdfshbaseingest to work, you need to start Yarn from your hadoop installation. In this, we also use “gpx” format rather than “geotools-vector”(I need to understand why?). Hence we use the jar of gpx format and copy the same in our current working directory. Also this command expects that the directory gpx is already exists in HDFS file system. To create this, you can run the following command from your hadoop installation:

<Hadoop-Installation-Dir>/bin/hadoop fs -mkdir /gpx

Then we run the following command:

java -cp geowave-deploy-0.8.8-SNAPSHOT-tools.jar:geowave-deploy-0.8.8-SNAPSHOT-hbase-singlejar.jar:geowave-format-gpx-0.8.8-SNAPSHOT.jar: mil.nga.giat.geowave.core.cli.GeoWaveMain -hdfshbaseingest -z localhost:2181 -n geowave -f gpx -b ../../test/data/hail_test_case/hail-box-temporal-filter.shp -hdfs localhost:9000 -hdfsbase “/” -jobtracker “localhost:8032”

-hdfs is the hostname:port of the hadoop installation

-hdfsbase is the parent directory in which we want to ingest

-jobtracker is the hostname:port of the yarn installation.

Currently for hdfshbasestage, I am getting the following error which need to be fixed:

➜ target git:(GEOWAVE-406) ✗ java -cp geowave-deploy-0.8.8-SNAPSHOT-tools.jar:geowave-deploy-0.8.8-SNAPSHOT-hbase-singlejar.jar:geowave-format-gpx-0.8.8-SNAPSHOT-tools.jar:. mil.nga.giat.geowave.core.cli.GeoWaveMain -hdfshbasestage -b ~/workspace/geowave/extensions/formats/gpx/src/test/resources/ -hdfs localhost:9000 -hdfsbase “/gpx/” -f gpx
2015-08-23 04:41:27,085 FATAL [main] ingest.AbstractIngestHBaseCommandLineDriver (AbstractIngestHBaseCommandLineDriver.java:applyArguments(146)) – Error parsing plugins
java.lang.IllegalArgumentException: Unable to find SPI plugin provider for ingest format ‘gpx’
at mil.nga.giat.geowave.core.ingest.AbstractIngestHBaseCommandLineDriver.getPluginProviders(AbstractIngestHBaseCommandLineDriver.java:196)
at mil.nga.giat.geowave.core.ingest.AbstractIngestHBaseCommandLineDriver.applyArguments(AbstractIngestHBaseCommandLineDriver.java:141)
at mil.nga.giat.geowave.core.ingest.AbstractIngestHBaseCommandLineDriver.run(AbstractIngestHBaseCommandLineDriver.java:75)
at mil.nga.giat.geowave.core.cli.GeoWaveMain.main(GeoWaveMain.java:48)

Similarly for posthbasestage I am getting the same error:

➜ target git:(GEOWAVE-406) ✗ java -cp geowave-deploy-0.8.8-SNAPSHOT-tools.jar:geowave-deploy-0.8.8-SNAPSHOT-hbase-singlejar.jar:geowave-format-gpx-0.8.8-SNAPSHOT-tools.jar:. mil.nga.giat.geowave.core.cli.GeoWaveMain -posthbasestage -hdfs localhost:9000 -hdfsbase “/gpx/” -f gpx -z localhost:2181 -n geowave -jobtracker “localhost:8032”
2015-08-23 04:44:16,261 FATAL [main] ingest.AbstractIngestHBaseCommandLineDriver (AbstractIngestHBaseCommandLineDriver.java:applyArguments(146)) – Error parsing plugins
java.lang.IllegalArgumentException: Unable to find SPI plugin provider for ingest format ‘gpx’
at mil.nga.giat.geowave.core.ingest.AbstractIngestHBaseCommandLineDriver.getPluginProviders(AbstractIngestHBaseCommandLineDriver.java:196)
at mil.nga.giat.geowave.core.ingest.AbstractIngestHBaseCommandLineDriver.applyArguments(AbstractIngestHBaseCommandLineDriver.java:141)
at mil.nga.giat.geowave.core.ingest.AbstractIngestHBaseCommandLineDriver.run(AbstractIngestHBaseCommandLineDriver.java:75)
at mil.nga.giat.geowave.core.cli.GeoWaveMain.main(GeoWaveMain.java:48)

August 23, 2015

Using Google’s Protocol Buffer library to write GeoWave Filters for HBase datastore

by viggy — Categories: Uncategorized — Tags: filter, Geowave, gsoc, hbase, iterator — Leave a comment

Accumulo provides Iterators which can be run on Tablet Servers as Filters during Scan. GeoWave uses this in form of local Client Filters and Distributable Filters which run on Tablet Servers when any scan is performed. As part of adding support for HBase, I needed to implement these filters in HBase. I have currently implemented two Filters, SingleEntryFilter and CqlHBaseQueryFilter which are counterparts for SingleEntryFilterIterator and CqlQueryFilterIterator in Accumulo.

Hbase makes use of Google’s Protocol Buffer to serialize data on client side and send it across to tablet servers. In this blog, I explain how I used the protobuf-java library to write the SingleEntryFilter for HBase in GeoWave.

Protobuf auto generates part of the code by using .proto file and its own code generator. The .proto file needs to contain the information about the arguments your class accepts in its constructor, which package the class needs to be generated in, etc. The arguments that class supports needs to be serializable. Since in our case, we are migrating from iterators, all the needed data are expected to be serializable as they need to be serialized even in case of iterators for Accumulo. I created a ‘protobuf’ directory inside local source directory ‘extensions/datastores/hbase/src/main/’ in GeoWave source code and in that created the following SingleEntryFilters.proto file.

option java_package = “mil.nga.giat.geowave.datastore.hbase.query.generated”;
option java_outer_classname = “FilterProtos”;
option java_generic_services = true;
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;

message SingleEntryFilter {
required bytes adapterId = 1;
required bytes dataId = 2;
}

Now to generate the classes using protobuf, we need to install protobuf compiler on the machine. You can download the compiler from here.The README.txt given along with the compiler is quite explanatory for installing it.

After successful installation, by default the protoc compiler executable would be in src directory. Go to the source directory in which you want the generated package to be added. In my case, it was in <geowave-src-directory>/extensions/datastores/hbase/src/main/ .

Now, you can the following command.

<path-to-protoc-installation-dir>/src/protoc -I=. –java_out=java/ protobuf/SingleEntryFilters.proto

protobuf/SingleEntryFilter.proto is the path to the .proto file from your current directory.

This generated the necessary FilterProtos class. Now we need to create the SingleEntryFilter class. We use the FilterBase class provided by HBase to create new custom classes. Lars George’s book, HBase: The Definitive Guide explains developing custom Filters for hbase and I used the example shared in the github repo for the book to develop Custom Filter for developing SingleEntryFilter.

It was only through that example that I came to know that I need to also implement toByteArray and parseFrom methods in the Custom Filter. Later I also found in HBase log that parseFrom method generates a DeserializationException which informs user about extending it in derived class.

June 23, 2015

Accumulo and Hbase – Differences in its API, classes

by viggy — Categories: tech — Tags: Accumulo, api, classes, difference, Geowave, gsoc, hbase, Key, KeyValue, value — Leave a comment

As part of my GSoC project, I am working on adding support of HBase in GeoWave. Currently Accumulo is already well supported. Hence one of my main task is to understand the existing tests and examples and add support for the same for a HBase store.
One of the main challenge I am facing is there is no easily available mapping of Accumulo Classes to HBase classes. So in this blog, I will try to do the same.

Basic Entity Classes
Accumulo	HBase	Remarks
Key	KeyValue	HBase doesnt have separate entity for Key and Value. It stores everything as byte[].
Value	KeyValue
Mutation	RowMutations	Accumulo exposes put/putDelete method whereas HBase has Put and Delete classes to Put in and Delete a single row respectively.

Scanner/Writer Classes
Accumulo	HBase	Remarks
Scanner/ScannerBase	Scan	Accumulo has a structured class heirarchy for scanning where in HBase I have only used Scan as of now.

TODO: To add about establishing connection, authentication, Write to store, Iterators/Coprocessors

April 8, 2015

Setting up Geowave and integrating it with GeoServer for the development environment

by viggy — Categories: FSMK, project, Uncategorized — Tags: Accumulo, GeoServer, Geowave, gsoc, Hadoop, Installation, ubuntu, Zookeeper — Leave a comment

Geowave is a library used to store, index and analyze geospatial data on top of Accumulo which is a free software implementation of Google’s Big Table. Accumulo in turn makes use of Zookeeper to handle distributed synchronization and uses Hadoop FileSystem for distributed and scalable storage of the data. Geowave decomposes multi-dimensional data to single dimensional data using a transformation called as Space Filling Curves. GeoServer is a java based server which provides a platform to view and edit geospatial data. Hence at an abstract level, we can summarize that the data which is transformed by goewave and stored(ingested) in Accumulo can be extracted and viewed from geoserver.
Following article will explain how we can setup the system to have Geowave and Geoserver working with Accumulo for a development environment.

Part 1: (more…)

March 13, 2015

Bigtable: snippets/notes from the original Google’s paper

by viggy — Categories: Uncategorized — Tags: bigtable, chubby, google, gsoc, mapreduce, sstable — Leave a comment

Following are the lines that I made marked in the original Google’s paper on BigTable.

Introduction:

Bigtable has achieved several goals: wide applicability, scalability, high performance and high availability.
Bigtable does not suport a full relational data model; instead, it provides clients with a simple data that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and coloumn names that can be arbitary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings.

Data Model:
A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
(row:string, column: string, time:int64) -> string

Row:
The row keys in a table are arbitary strings(currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic(regardless of the number of different columns being read or written in the row).
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines.

Columns:
Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type( we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family: after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small(in the hundreds at most), and that families rarely change during operation. In contrast a table may have an unbounded number of columns.
A column key is named using the following syntax: family:qualifier. Column family names must be printable, but qualifiers may be arbitary strings. Access Control and both disk and memory accounting are performed at the column-family level.

Timestamps:
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp, Bigtable timestamps are 64-bit integers.
The client can specify either that only the last n versions of a cell be kept, or that only new enough versions be kept.

API:

Bigtable suports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable can eb used with MapReduce , a framework for running large scale parallel computations developed at Google.

Building Blocks:

Bigtable uses the distributed Google File System(GFS) to store log and data files. Bigtable depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.

The Google SSTable file format is used internally to store Bigtable. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be com-
pletely mapped into memory, which allows us to perform lookups and scans without touching disk.

Bigtable relies on a highly-available and persistent distributed lock service called Chubby. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. Each Chubby client maintains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.

Implementation:

The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.
Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the master is lightly loaded in practice.

Tablet Location:
We use a three-level hierarchy analogous to that of a B+ tree to store tablet location information.
With a modest limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 2^34 tablets.
The client library caches tablet locations. If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses.

Compactions:
When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies.
Every minor compaction creates a new SSTable.
We bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. A major compaction, produces an SSTable that contains no deletion information or deleted data.

Refinements:
Locality groups:

Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet.

Compression:

The user-specified compression format is applied to each SSTable block. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme, which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data.

Caching for read performance:

To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read.

Bloom Filters:

A read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom filters should be created for SSTables in a particular locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair.

Exploiting immutability:

Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable.
The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.

May 2025
M	T	W	T	F	S	S
« Apr
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31