CCA 175 Spark and Hadoop Developer — Common Mistakes

Durga Gadiraju

Published in

itversity

4 min readApr 14, 2018

ITVersity is known for providing high quality content for CCA 175 Spark and Hadoop Developer with hands on labs

For material, click here to sign up for one of our Udemy courses using our coupons

We have the history of training thousands to clear CCA 175 Spark and Hadoop Developer Certification with the striking rate of more than 95%. However some people have failed the certification and this is an attempt to provide some inputs based on their experience so that the new aspirants will be careful.

Syllabus

Let us revisit the syllabus so that we understand what it takes to prepare for the exam.

Basics of Programming — using Scala or Python
Data Ingestion using Sqoop, HDFS commands and Flume
Stage, Transform and Load using Core Spark APIs — Transformations and Actions
Data Analysis using Spark SQL and DataFrames

It is scenario based exam where a dedicated environment will be given to give the exam with all necessary tools setup. Environment is accessible via Linux based remote desktop. For further details visit, Cloudera’s official page.

Typical Preparation Process

Here is the typical preparation process, most of our students follow.

Understand the syllabus of the certification
Sign up and complete our course(s) either by using Python or Scala or Both
Sign up for our lab for hands on practice, if they do not have proper environment
Go through Arun’s blog or our exercises for the practice

Even after so much rigorous preparation, some of the people fail the certification.

Common Pitfalls/Mistakes

Here are some of the mistakes people commit due to which people even fail the certification

Over Confidence — at times people tend to be over confident and take certain aspects lightly.
Environment — Certification exam is conducted in Linux based desktop remotely. Some of the people are not comfortable with Linux based desktop.
Sqoop Connectivity — Sqoop is the simplest of the technologies and it is open book exam where sqoop material is provided, yet people struggle to connect to database. This is most common mistake I have heard
File Formats and Delimiters — There is lot of emphasis on File Formats and Delimiters. At times they need not be defaults.
Dropping the directories — Once people come up with results that need to be stored in designated location, if they realize the results are not inline with the question — they have to delete, fix the code and rerun. In the process of the deletion people end up deleting the entire question directory rather than just solution directory

Most of the guys pass the exam with one of the above mistakes. But if one commit more than one mistake, probability of failing is very high.

Our Suggestions

Here are our suggestions for the common mistakes.

Over Confidence — Do not think exam is easy. As we provide lot of resources and simplify learning process, people end up thinking that exam is easy. But it is not, there are breadth of topics that need to be prepared well.
Environment — It is similar (but not same) to Cloudera QuickStart VM. Make sure to get familiar with the Linux based Remote Desktop. At ITVersity, we provide simulator where we provide remote desktop with some exercises. Please send email to training@itversity.com. It costs $14.45 for a week.

Make sure you are comfortable with Linux Terminal
Open multiple terminals
Use GUI based editors such as gedit, Sublime Text etc rather than using command line editors
Make sure to have copy of code

Sqoop Connectivity — Many people are not able to successfully connect to Sqoop in the first attempt due to several reasons.

Do not memorize, instead use documentation for building JDBC URL
Use sqoop eval to get the structure of the table
Make sure table name, HDFS directory names are correct. Understand the difference between warehouse-dir and target-dir
Make sure file formats, delimiters and compression codecs (from core-site.xml) are as asked in the question

File Formats and Delimiters — There are only handful of file formats and APIs one need to familiar with

Standard File Formats — avro, parquet, orc, text
If it is sqoop related question go to documentation and use appropriate control argument such as as-avrofile
If it is spark related question use sqlContext.read API. It have methods such as parquet, orc etc. For avro, it is a bit different, make sure you understand the process of reading avro data.
Once Data Frame is created, one can use df.write API. It have methods such as parquet, orc etc. For avro it is a bit different, make sure you understand writing data in avro format.
For text format, we have to make sure we use right delimiters. Most likely you have to use map function before writing to file system using saveAs method in Spark

Dropping the directories — recently some people shared the experience of dropping the directories accidentally.

Make sure you are careful while running hadoop fs -rm -R command.
Always drop only the solution directory not the base directory for the problem statement
Even if you drop the base directory for the problem statement, it will go to trash. You can copy the directory back from trash. It will be in /user/`whoami`/.Trash.
When directory is deleted it actually gives the Trash directory. Use it to copy back

Conclusion

If you follow our content and practice on our labs, probability of passing is very high. If you learn from mistakes of others, probability will be even better.

You can visit our success stories category to learn or share the experience of taking CCA 175 examination.