Technical Requirements for Uber Case (Draft)
Team Members: Leora Hertan, Zi Ye, Karen Kirker, Shuyan Yang, Liqiong Chen, Xiuyang Guan

1. Business Case
Uber Technologies Inc. is a publicly held company that provides billing, payment and scheduling services to independent licensed operators in 200 cities across 60 countries. It has a smartphone app that provides on-demand transportation services to clients and connects passengers with Uber drivers. Although Uber is considered to be a successful company, it has several problems including: lost revenue due to the “Uber deleting” wave, poor public relations, and pressure from strong competitors. To solve these problems and reduce the customer churn rate, we are going to use data analytical methods to diagnose customer churn segments, performance of driver segments, and use those insights to modify and enhance Uber’s product features and offerings, as well as identify new revenue opportunities.

2. Database Engines
A data engine is the underlying software component that manipulates the database data. In this section, the advantages, disadvantages and their suitability to the Uber business case of three different database engines (MySQL, PostgreSQL, Spark) will be compared. At the end of the section we will make a recommendation on which database is most suitable to solve Uber’s business problems.
2.1. MySQL
– Advantages:
● Available to retrieve large amount of data efficiently and in a timely manner.
● Provides better compatibility with other operating systems, which could lead to improved cooperation with other database engines.
● Open sourced database engine which means there is more help and insight available for database maintenance.
– Disadvantages:
● While designed to handle large amounts of data, it can sometimes experience poor performance with too many operations running at the same time.
● Can be troublesome to perform analysis with unstructured data.
2.2. PostgreSQL
– Advantages:
● Open source, easy to use, with a supportive community offering help.
● Capable of handling many tasks efficiently at the same time, which MySQL is weak at.
● Reliable and stable database engine that typically experiences no crashes during highly active operations.
● Unstructured data can be stored and analyzed, which provides the opportunity of analyzing the experience from two different user parties.
– Disadvantages:
● Hard to set up, so it requires a great effort at the beginning.
● Does not support the entire ANSI SQL 92′ standard.
2.3. Spark
– Advantages:
● Capable of using unstructured data while Oracle SQL is only able to use structured data, which means that Spark may be better equipped to manage incoming data.
-Disadvantages:
● Doesn’t have a file management system. It requires other cloud-based data platforms or hadoop to integrate.
● Consumes a lot of memory and issues around memory consumption are not handled in a user friendly manner.
2.4. Suitability to the Uber business case
In this section, a comparison table has been made to compare different database engines and their ability to be used in the Uber case.

Technical requirement for Uber case MySQL PostgreSQL Spark
As stated in the case, the database should be capable of “handling at least thousands in real time data and up to millions in peak periods cased by new policies” Yes Yes Yes
The case specifies a real-time analysis tool is needed No Yes Yes
Analyze unstructured data and have the ability to handle a large amount of data at the same time No Yes Yes
2.5. Recommended Database Engine
Since this tool will help with data analyze, but not directly generate income, we need to try to lower the cost. In this case, it’s better to use open sourced tools. From the conclusion in section 2.4, MySQL and PostgreSQL are both open sourced. PostgreSQL is relatively easier to use, and will be able to handle many tasks efficiently at the same time, which would be a better fit to meet the requirements of our case. In addition, MySQL cannot handle the real-time analysis or analyze unstructured data, which doesn’t match our needs.
So we will only consider PostgreSQL and Spark at this stage. They both have the characteristics to fulfill all the basic requirements for this case, which includes the ability to process tasks efficiently, conduct real-time analysis and handle large amounts of unstructured data. Therefore, we need to compare the disadvantages for both of these database engines. While PostgreSQL is difficult to set up at beginning, that is also the only disadvantage that would affect users’ experience, and would not cause any trouble in the future once it is all set before we start to run it. On the other hand, Spark requires additional data platforms or hadoop to integrate. Also it consumes a lot of memory, which is not user friendly. All these characteristics will affect future usages of Spark. Choosing PostgreSQL can avoid these problems.
Overall, PostgreSQL will be the best fit. It satisfies all needs for the Uber case, with the least potential problems for future data analyzing processes.

3. Data Lake Components
A data lake is a storage repository that holds a vast amount of raw data in its natural format until it is needed. Until the data has been used, there is some noise and unclean things in the data water. A data lake uses a flat architecture to store data.
3.1. Data Lake Architecture
This section will explore the advantages and disadvantages of two data lake architectures: Apache Hadoop + Relational database and Hadoop + Spark + NoSql. The comparisons will focus on how the two data lakes are designed and set-up.
3.1.1. Apache Hadoop + Relational database
– Advantages:
● Open source, so it is a cost-effective option.
● Scalability is simple making it easier to solve larger problems.
● Uses Hadoop Distributed File Systems making it more difficult for the system to fail.
● New version uses Hadoop Distributed File Systems making it more difficult for the system to fail and errors less likely to occur.
– Disadvantages:
● Difficult to integrate with existing databases, especially since there is no support provided.
● Difficult to use and learn since it requires knowing MapReduce.
● Designed as a “batch-processing” engine so responses can take seconds to hours.
● Limited security functionality making it unsafe for safe enterprise deployment that deals with sensitive/private data.
● Before Hadoop 2.0 there is a single point of failure making it useless if the node responsible for data location fails.
3.1.2. Hadoop + Spark + NoSql
– Advantages:
● Front-end and back-end security to protect against numerous types of threats.
● Easy to scale computing power up or down, so clients only pay for what they use.
● Cost-effective because clients can launch customer applications and internal applications within the cloud saving on infrastructure costs.
● Capable of using unstructured data while Oracle SQL is only able to use structured data, which means that Spark may be better equipped to manage incoming data.
– Disadvantages:
● Doesn’t have a file management system. It requires other cloud-based data platforms or hadoop to integrate.
● Consumes a lot of memory and issues around memory consumption are not handled in a user friendly manner.
3.2. Recovering/Continuity of Business
Business recovering and continuity are business practices that refer to how prepared a business is to deal with unforeseen risks and what steps will be taken to get operations up and running again if one of those risks comes to fruition. In this section we will explore how the two data lake configurations plan for risks and continuity within the data analysis/storage process.

3.2.1. Suitability to the Uber business case
Technical requirement for Uber case Hadoop + Relational database Hadoop + Spark + NoSql
As stated in the case, the database should be capable of “handling at least thousands in real time data and up to millions in peak periods cased by new policies” Yes Yes
The case specifies a real-time analysis tool is needed No Yes
Analyze unstructured data and have the ability to handle a large amount of data at the same time Yes Yes
3.2.2. Data Lake Architecture Comparison
Apache Hadoop and Spark would form a well-functioning combination of data lake architecture. Hadoop is quite cheap to set up, which would also be capable of dealing with unstructured large files. Another database, Spark, was chosen here to fulfill the need of handling real-time analysis that Uber required. Hadoop generally works well with relational databases like Spark, since Spark would gain benefit from the Hadoop cluster management. Spark would eliminate its disadvantage occupying large amount of memory space and deploys available resources in Hadoop. On the other hand, Spark adds value to Hadoop as well, the introduction of YARN resource manager allows Hadoop to process data from batch-base to stream-base data analysis. As a result, the combination of Hadoop and Spark were chosen as the data lake architecture for Uber.
3.2.3. Recommendation of data lake
Apache Hadoop combination with Spark would be a good fit for Uber when it comes to ensuring recovery and continuity of business. Apache Hadoop uses a new file system that makes it hard for the file management system to collapse in the middle of operation. Spark adds value to the Hadoop because it can handle unstructured data. Handling unstructured data makes it easy for the architecture to help the company deal with incoming data that may not be yet structured. This is important as Uber is looking for a database that will be able to ensure that real time data is immediately factored in. Some of the data entering the system may be largely unstructured but Spark will make sure that such data immediately becomes a part of the database. Various disadvantages that may undermine the suitability of Apache Hadoop combination with Spark in ensuring business continuity for Uber have been addressed. Apache Hadoop used to have a single point of failure whereby data failure in one node ends up affecting the whole system in earlier versions. This problem has been eliminated, with the current version created in such a manner that it is difficult to fail. Hadoop file systems are reliable. Spark consumes a lot of memory and may create issues surrounding memory that may not be easy to handle. It may create downtime in instances where there is large data request that exceeds the in-memory computing capacity of the available systems. However, Hadoop works well with relational database and can provide Spark with the space it requires to process requests. Combination of Hadoop and Spark is recommended for Uber as it will allow real time data processing while ensuring data continuity.

Leave a Reply

Your email address will not be published. Required fields are marked *