Monday, September 24, 2018

AWS - concepts and features

AWS has stood on top in the area of *aaS cloud infrastructure provider. It is one of the prime reason as to why so many startups spin up and have an IT system running in no time...

Some concept on AWS are explained below -

IAM - Identity Access Management -  this is basically the authentication and authorization system for an AWS service. Needed for security, identity and compliance and segregation of duties.

Features -
  • Centralized access to AWS account
  • Shared access to AWS account
  • Access to a program / user
  • Granular access
  • Password rotation policy
  • Multifactor authentication - this can be setup in 3 ways - a. virtual device b. hard device c. SMS facility. Option c is going to be not supported from Q2 2019. Option b has some cost. Option a is totally free and one needs to simply download a token generating software mobile app e.g. Google Authenticator and add AWS account there.
  • Identity federation (google, linkedin, facebook or active directory :))
  • Supports PCI-DSS compliance
  • Integrates with many other AWS services
Terms:
  • Users and User Groups (say admin group, dev group, hr group)
  • Roles (assigned to aws resources)
  • Resources (e.g. aws ec2 instance, or aws s3 instance, aws db instance)
  • Permissions (set of policies)
  • Policy - Technical representation (in JSON) of an entitlement telling what actions can be done on what resources.
         Example of a Policy (it allows all actions on all resources, could be an admin level permission)

 {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

AWS comes with a lot of pre-canned permissions for almost all of the AWS resources, so all we need to do is to create a user, optionally put him/her in a resource group and attach permissions to it.

Region:

Select a region which is closed to you. It is quite possible that some of the AWS services might not be available on that region, so the region needs to change.
  • IAM is universal and not Region dependent.
  • Root account has complete Admin access
  • New user has NO permissions to start with. They are assigned Access Key ID and Secret Access Keys when first created. These are used to access AWS via command line or via APIs. They are once view types can't be reused. Need to regenerate.
  • Always setup MFA on root account
  • You can create and customize password strength and rotation policies.
  • Power User allows access to all AWS services except management of groups and users within IAM.
AWS S3  - It is simply a "file system as a service". 

  • It is an object based storage. Files and documents can be stored. You simply upload files via Http. 
  • 0 to 5 TB is the limit to file size. 
  • Unlimited storage. 
  • Files are stored in buckets (having a universal namespace). e.g. s3-eu-west-1.amazon.aws.com/
  • Data consistency - Read after Write for PUTs of new files; Eventual Consistency for override PUTs and DELETES. S3 is spread across multiple availability zones, so there can be dirty reads due to replication latency.
  • S3 is object based key value store: Object consists of Key, Value, Metadata, Version, Sub-Resource (ACLs, Torrents). 
  • Availability: Built for 99.99% availability. Gives 99.9% availability. 
  • Durability: 11 9s durability.
  • Tiered Storage, LifeCycle mgmt., Versioning, Encryption and Securing using ACLs and bucket policies.

MongoDB - features

MongoDB is a very popular NoSQL schema-less database written in C++ and it is easy to configure and administer. MongoDB is well suited for Bigdata, mobile & social infrastructure. Initially released in 2009.

Some of its features are:

1. Support for ad-hoc queries 
In MongoDB, you can search by field, range query and it also supports regular expression searches.
2. Indexing
You can index any field in a document. It allows what is known as secondary indexes.
3. Replication & Consistency
MongoDB supports Master Slave replication. A master can perform reads and writes and a slave copies data from the master and can only be used for reads or back up (not writes). It is eventually consistent.
4. Duplication of data (High-Availability)
MongoDB can run over multiple servers. The data is duplicated to keep the system up and also keep its running condition in case of hardware failure.
5. Load balancing
It has an automatic load balancing configuration because of data placed in shards(#4).
6. Supports map reduce and aggregation tools.
7. Uses JavaScript instead of procedures.
8. Provides high performance.
9. Stores files of any size easily without complicating your stack.
10. Easy to administer in the case of failures.
11. Supports JSON data model with dynamic schemas
12. No triggers

Friday, September 21, 2018

System Design concepts

Following are some of the system design concepts asked in an interview for a technical architect/senior software engineer position. It might be a good way to judge a candidate's breadth -

High Level Design:
  • Vertical vs Horizontal scaling
  • CAP theorem
  • ACID vs BASE (atomicity, consistency, isolation & durability; basically available soft state eventual consistency)
  • Partitioning/Sharding of data
  • Consistent Hashing
  • Optimistic vs Pessimistic locking
  • Strong vs Eventual consistency
  • Relational DB vs NoSQL DB
  • Types of NoSQL DBs-
  • Key Value (Redis cache)
  • Wide Column (Cassandra)
  • Document based (MongoDB)
  • Graph based (Neo4J)
  • Caching
  • Data center/Racks/Hosts
  • CPU/Memory/Harddrive/Nw bandwidth
  • Random vs sequential read write on disk
Low Level Design:
  • Http vs Http2 vs Websockets
  • TCP/IP model
  • IP4 vs IPv6
  • TCP vs UDP
  • DNS lookup
  • HTTPS & TLS/SSL
  • Public key infrastructure & certificate authority
  • Symmetric vs Asymetric key
  • Load balancer
  • CDNs and Edge servers
  • Bloom filters and Count-min sketch
  • Design Patterns & OOD
  • Virtual Machines and containers
  • Publish-Subscribe/Queues
  • Map Reduce
  • Multi-threading, concurrency, locks, synchronization, CAS.


Thursday, September 20, 2018

Distributed Systems - CAP Theorem

Distributed caches or databases are based on the CAP theorem. CAP stands for Consistency, Availability and Partition Tolerance. Lets see what each of these are and subsequently understand the theorem.

Consistency - This is not the Consistency in ACID in RDBMS world. In NoSQL world, this means that every read will return back the most up-to-date copy of the data and every write will modify the most up-to-date copy of the data or return with an error. i.e. system is giving consistent results no matter what node in the system the request lands up into.

Availability - This means that no requests returns with an error, with out the guarantee that it contains the most recent write.

Partition Tolerance - Network partition gets created when few nodes in the cluster are not able to talk to the other nodes due to some network failures, packet losses etc. Partition Tolerance means that system continues to operate despite an arbitrary no of messages being dropped (or delayed) by the network between the nodes.

Now, CAP theorem says that it is not possible for a distributed system to simulteneously provide more that 2 out of the above 3 guarantees.

Most distributed systems are AP compliant. i.e. the give up on Consistency but ensure Availability and Partition Tolerance.

In presence of a network partition which is quite normal, a request can either be denied service (unavailable) but remaining consistent, or can return results (hence be available) but serviced on an inconsistent copy (created due to n/w partitioning) of data. Thus, you have to choose from either consistency or availability in presence of a network partition.

If there is no network partition, the system is consistent and available.

Database systems designed with traditional ACID guarantees in mind such as RDBMS choose consistency over availability, whereas systems designed around the BASE philosophy, common in the NoSql movement for example, choose availability over consistency.

Wednesday, September 19, 2018

Big Data Landscape

Big Data Landscape

Analytics tools & frameworks:
  • Zookeeper
  • Hadoop (HDFS)
  • YARN
  • Pig
  • Hive
  • HBase
  • Kafka
  • Spark
  • Storm
  • Flume
  • Oozie, Thriftserver
  • Provided by Cloudera, Hortonworks

No SQL data stores:
  • Amazon Dynamo DB
  • Apache Cassandra
  • MongoDB
  • Couchbase
  • HBase

Search tools:
  • Solr
  • Elastic Search
  • Amazon CloudSearch

Distributed Caches:
  • Redis
  • MemCache
  • Aerospike
Cloud (*aas) providers:
  • Amazon AWS
  • Microsoft Azure
  • Google cloud
Programming languages for analytics, machine learning algos:
  • Python
  • Scala
  • R

Product Ideas

You sign up on various tools, websites, use interfaces and some user experience simply says to shout  wow! and then you forget. This blog is and effort towards capturing those ideas -

a. A simple short sign up email directly from the co-founder, ceo, cto enquiring about :
  • what made you come here?
  • what did you like, what you didn't?
  • what can be improved better?
b. make the interface more catchy and cool. Use cool, catchy names for e.g say for a chatbot for servicing.

more to come...as and when discovered!

Monday, September 17, 2018

Whats new in Java 7 and Java 8

Java 8 is released much time back now, with the much awaited functional programming constructs, lambda expressions, streams api. Numerous other related enhancements like default functions, functional interfaces needed to be introduced to support above much wider and powerful constructs. Here is a summary of the enhancements.

It is to be noted that some of these feature are (borrowed or present) in the newer programming languages like scala or python.

Java 7 Features:

Code Conciseness, Redundancy:
  • Automatic type inference in generics instance creation.
  • try with resources for automatic cleanup of resources without the need of writing a finally block, if there is a finally block it is executed after the resources are auto cleaned up.
  • switch-case statements can work on string literals.
  • Multiple Exceptions can be caught in one catch block helping in code reuse.
Readability:
  • Numeric literals can contain _ for better readability
Performance and Concurrency:
  • JVM Performance improvements
  • Multithreaded custom class loaders
  • Concurrency Utilities - Fork/Join Framework, ThreadLocalRandom, Phaser class introduced similar to CyclicBarrier
Other Features:
  • SafeVarargs annotation
  • Internationalization support
  • Several enhancements in the Java swing framework

Java 8 Features:


Functional Programming:
  • Functional interfaces
  • Lambda expressions - treat functionality as method argument and code as data, instances of single method interfaces called functional interfaces.
  • Method references
Performance and Concurrency:
  • Sequential or parallel map reduce transformations.
  • Parallel Sorting using Arrays.parallelSort(). It uses fork/join pool introduced in java 7.
Annotations:
  • Repeating Annotations
  • Type annotations (annotations can now be added where-ever a Type is used)
New APIs:
  • New Stream API (java.util.stream) - provide functional style operations on a stream of elements, integrated with collections api to enable collections-stream conversion. 
  • Interfaces can now default and static methods avoiding the need of utility classes. e.g. forEach default method now in the "Collection" interface.
  • Java IO API improvements
  • Java Collection API improvements
  • New Date-Time package.
  • Java DB 10.10.
JVM:
  • PermGen is removed in HotSpot JVM.

Others:
  • Improved type inference when passing generic arguments
  • Method parameter reflection

Consistent Hashing

"Hashing" is a well known mechanism to enable fast searches. A "hash" is obtained by applying "some" (mathematical or a characteristic) function on the input(s). For an example -

Say we want to store a n words in a hash-table, such that retrieval of a word is faster than the normal O(n) search. A simplest hash function in this case would be to simply get the first letter of the word. The hash-set obtained will be [ab...z]. "america", will map to bucket 'a', "hash" will map to the bucket 'h'. To search of for a word, simply get the first letter and directly jump to that bucket; and then perhaps do a linear search.

Another hash function could be len(string).

Obviously, these approaches are sub-optimal, since multiple words will have a common hash and hence need to search linearly once a bucket is known. This can be bettered by using a hash function, such that a unique hash is generated for a word, but the trade-off is that it will lead to a much bigger and dynamically (sized) hash table.

A balance is generally made by adopting the following approach. A mathematical fn is used to calculate the hash first and then a fixed size (say N) hash table is formed. The hash obtained is mod'ed (hash % N) to obtain a number between 0 to N-1. Thus, whatever be the hash, by doing the mod we have ensure that the bucket index to put the input in is always between 0 and N-1. Hence, fixed sized hash table works. To keep a fixed cap on the search operation, we also need to ensure that, the size of each bucket is also fixed to say O(k), k being called as the load factor. If we found that, a bucket has over grown the size k, it suggests that we need to increase the hash table size to a value greater that N (say 2N).

Now that is a heavy operation, in the sense that all the keys need to be remapped by doing the new mod with 2N.

As an example, consider the following words, say 3 buckets, this is how they are assigned.

WORD | HASH | BUCKET
jue        | 32434 |  1
same     | 34244  |  3
jelly      | 7686   |  1
aman     | 35734 | 2
anuj      | 45642 | 1
deepak  | 226    |  3
julie      | 24324 | 2

So, jue, jelly, anuj got assigned to "1"; aman, "julie" to "2" and "same","deepak" to "3"
or this is how the map looks:

 1 -> jue, jelly, anuj
 2 -> aman, Julie
 3->  same, deepak

Say, load factor was 4, hence adding a new word say "foo", fell to bucket 1 and hence reached the load factor threshold and hence, we increased the no. of buckets to say 6(2*3). we will remod each entry now by 6, calculate new bucket position and done! This requires some good cpu cycles and performance overhead, since all the keys need to be remapped to new buckets.

The approach of CONSISTENT HASHING comes to rescue here which is detailed below. Before delving into that, lets also understand that same hashing/consistent hash approach is also used in distributed caches or distributed data stores (redis, kafka, load balancers, sharding system) to balance the load or data to multiple nodes (machines). Since the volume dealt in these system is huge, talking about billions and trillions of records, re-hashing is never a choice, hence CONSISTENT HASHING is used.

The approach ensures that upon resizing the buckets in the cache or with increase or decrease in the no. of nodes in the cluster only k/N keys needs to be remapped as opposed to re-mapping entire k keys, where N is the max of nodes (before and after).

It starts with mapping the hash to a circle instead of a array of fixed size and also mapping the available nodes (servers) to the same circle. E.g. With in a hash boundary say 0 to MAX_INT, 0 being mapped to 0 degrees, and MAX_INT mapped to 360 degrees, any hash in between 0 and MAX_INT will map to some point on the circle based on it degree by this formula:
(hash/MAX_INT)*360.

Similarly, using the same pattern, the nodes are also put on the circle.

A convention can be chosen that, any "input" will be put on the nearest node found by moving in the clock wise (or anti-clockwise based on the agreed convention).

Example - four nodes fall as below on the circle:

A (10 degrees)
B (60 degrees)
C (135 degrees)
D (200 degrees)

Say, input "julie" resulted to 100 degrees, moving in clockwise direction it will map to Node C. Say, "aman" resulted to 50 degrees, so it will map to Node B.

Say, in event of node failure of say node B, only B's entries will be moved to Node C. Rest entries of on A, C and D will remain as is.

Now, you might argue that this approach might be biased to make a particular node loaded with all the data (or requests). This can be fixed by assigning equals weights (or may be different weights if nodes differ in capacity). Say we assign a weight 10 to each node. So, we will distribute A1 to A10 randomly on the circle, B1 to B10 randomly on the circle and so on D1 and D10 randomly on the circle.
Now the angle of the input can fall close to any of An or Bn or Cn or Dn, hence getting mapped to a random node, more the weights better the chances of an inputs arriving to that node.

That in summary what is called consistent hashing, and doesn't suffer from the cluster re-size performance problem. Used very heavily in distributed systems, no-sql data stores like Redis Cache, Cassandra and Kafka, CDNs and load balancers.

For further reading refer -

https://en.wikipedia.org/wiki/Consistent_hashing

Sunday, September 16, 2018

Coding Guidelines

I think writing code is an "art", it has to be intuitive, free-flowing, easy to understand and maintain. It is easier said than done. Hence, some guidelines should be followed.

The following are guidelines that help achieve this. These are gained (some learnt and discovered) over my last 11+ years of software development experience. Primarily they are java centric but (in most cases) apply to other programming languages as well.

Naming Conventions:

1. variable naming - a variable name should be descriptive and in camel case. It is a noun. Best is camel case e.g. secondsPerMinute. If a variable name is used multiple times in the code, better to shorten it and put a comment. e.g. fooCount can be fooCnt with a nice comment that it stores the number of foo in the system. adding context to the name helps understand the code better. Avoid a, b, j as names, use "counter, index" instead for simple counter variables in loops instead of j,k.

Sense tells me that there should be a limit to the length as well. Try to use sensible short forms for words if the full-form makes the length goes beyond 25 chars. e.g noOfEmployeesInAfghanistan can be emplInAF //not bad!

This applies to all types of variables - local, class-level, method parameters.

2. constants - general guideline is to use CAPITAL_CASE (or UPPER_CASE) with an underscore separating words.

3. function naming - should again be descriptive and in camel case. It is a verb. e.g. getDefectedCars(), calculateFoo() etc. Overloaded methods with same type and no. of arguments are better distinguished by including the differentiator in the method name. e.g.
  • getEmployeeById(String employeeId), 
  • getEmployeeByName(String employeeName)
4. Class, Interface naming - Should be in CamelCamelCase. e.g. InsurancePolicy, NumberComparator. Class names are singular and are nouns or verbs. again descriptive...absolutely no cost (except a little bit memory but that's okay) involved in long names.

5. Package names - they are flatcase with each word separated by . e.g. com.hgoyal.metrics.domain

Class Design

1. Should be catering to a single functionality or domain.
   e.g. a separate class for reading data from the database, a different class for accessing cache,  different for transformations.

2. Always extend an Interface or more appropriately first define an interface then a class implementing it.

3. Avoid very long classes - bigger means there is scope to split into multiple.

4. Carefully pay attention to declaring members as public, protected or private. Limit the visibility as much as possible.

5. Methods should be small in size, encapsulate logic in smaller methods.

6. use composition over inheritance.

7. use "design patterns" as much as possible with appropriate tweaking.

Organization

1. Organize classes into meaningful packages. again pay attention to growing package size.

2. Choose different folders or source code, test cases, images, static content like html, jsps, javascripts.
src/main/java - for java source files
src/test/java - for corresponding test files - the package names for test classes is same as the package name for source classes. See below:
resources - for test case inputs, images.
WEB-INF/images/ - for images
WEB-INF/html/ - for html files
WEB-INF/js/  - for javascript files

3. a read-me.txt file at the root of the project containing a brief description about the project and its usage helps.

4. Always format the code to ensure consistency. Stick to a "brace" convention - e.g.

for ()
{

}

or

for () {
}

Needless to say that nicely formatted code is easier to read, debug, understand.

Documentation

1. This aspect is as important as rest of the above. Self documenting code should address 50% of the documentation concerns, but it is also essential to provide additional documentation for fairly involved logic snippets, methods, class usage, use case patterns, edge cases, possible exceptions and errors.

2. Do write class level, method level comments. use java-docs format for java.

3. package-info.java gets created by Eclipse IDE for writing a brief about the package. It gets used by the java docs.

4. Generate java docs and ask for review from the end users/incorporate feedback. add version, date, author info wherever necessary. refer to other classes/methods in the documentation wherever necessary.

Error Handling & Logging

1. Use logging as much as possible.

2. Choose log level carefully from - info, debug, warn, error, fatal.

3. Check to see if the logger implementation used is an asynchronous one. If not extensive logging can lead to a lot of IO, hence degraded performance.

4. Tune logging configuration appropriately. There are ways to turn off logging for a library at a package level or to choose the logging level at package level. Check for log file size, log rotation policy.

5. Log statements should be consistent as well. Use string formatters.

6. Consider confidential information is not logged e.g. passwords, information which is stored as encrypted.

7. Use error codes mapped to error messages. Consider externalizing this either in a database table or a config file. Consider for internalization use cases.

8. Basic information like sessionID, transactionID, threadID, userID should always be associated with the log statement. Very important.

9. Throw exceptions as and when necessary. Use exceptions that come with the JDK instead of inventing your own. e.g. IllegalArgumentException, FileNotFoundException etc.

Unit Testing / TDD

Always write a test case for the new functionality added. Ensure to cover all paths in the method by writing multiple tests. A coded test case helps in a long run when the method is tweaked after a long time and a quick way is needed to ensure it hasn't broken rest of the functionality. Well tested code with working tests should be only checked in to the repository.