Jump to content

User:LI AR/Books/Cracking the DataScience Interview: Difference between revisions

From Wikipedia, the free encyclopedia

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

Inline

Revision as of 16:02, 28 February 2017

The Wikimedia Foundation's book rendering service has been withdrawn. Please upload your Wikipedia book to one of the external rendering services.

You can still create and edit a book design using the Book Creator and upload it to an external rendering service:

MediaWiki2LaTeX provides a softcopy PDF service. Uniquely, it remains under active support and may be used online or installed locally.
Pedia Press offer final tidying and ordering of print-on-demand bound copies in (approximately) A5 format.

For help with downloading a single Wikipedia page as a PDF, see Help:Download as PDF.


Cracking The Data Science Interview Basic Stuff To Know

This user book is a user-generated collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book. If you are the creator of this book and need help, see Help:Books (general tips) and WikiProject Wikipedia-Books (questions and assistance).

Edit this book: Book Creator · Wikitext

Order a printed copy from: PediaPress

[ About ] [ Advanced ] [ FAQ ] [ Feedback ] [ Help ] [ WikiProject ] [ Recent Changes ]

Cracking the DataScience Interview

Basic Stuff To Know

Generic pages: Glossaire_de_l'exploration_de_données; Big_data

Inspired from books like:
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
- "120 real data science interview questions"

Tips

DataScience is (very) experimental (Andrew Ng): https://pbs.twimg.com/media/CBXshmjWgAAgLKa.jpg

Bias–variance_tradeoff

Correlation_does_not_imply_causation

Competitions

Datasets

IDEs

Data Manipulation

https://github.com/Quartz/bad-data-guide
https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
"Essay Why Most Published Research Findings Are False"
- http://robotics.cs.tamu.edu/RSS2015NegativeResults/pmed.0020124.pdf
"A Few Useful Things to Know about Machine Learning"
- https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

Maths (Stats / Algebra)

Inspiration for this section: https://github.com/soulmachine/machine-learning-cheat-sheet

Glossary_of_probability_and_statistics

Curse_of_dimensionality

Mode_(statistics)

Entropy_in_thermodynamics_and_information_theory

Likelihood_function

Cumulative_distribution_function

Probability_mass_function

Probability_density_function

Pareto_efficiency

Taxicab_geometry

Norm_(mathematics)#Euclidean_norm

Norm_(mathematics)

Trace_(linear_algebra)

Eigenvalues_and_eigenvectors

Hadamard_product_(matrices)

Kernel_(statistics)

Radial_basis_function

Latent_variable

Statistical_inference

Inductive_reasoning

Deduction_and_induction

Distributions
- https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Discrete_uniform_distribution

Normal_distribution

Bernoulli_distribution

Binomial_distribution

Poisson_distribution

Chi-squared_distribution

Log-normal_distribution

Weibull_distribution

Gamma_distribution

Beta_distribution

Hypergeometric_distribution

Neural Nets

Softmax_function

Sigmoid_function

Hyperbolic_function#Tanh

Evaluation: Mean_absolute_percentage_error; Mean_absolute_scaled_error; Symmetric_mean_absolute_percentage_error; Regression-kriging

Information_gain_ratio

Kullback–Leibler_divergence

Gini_coefficient

Akaike_information_criterion

Bayesian_information_criterion

Precision_and_recall

Sensitivity_and_specificity

Receiver_operating_characteristic

Receiver_operating_characteristic#Area_under_the_curve

Cross-validation_(statistics)

Errors_and_residuals

If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)

Heteroscedasticity

Clustering

See also the Calinski-Harabasz Index: http://stats.stackexchange.com/questions/97429/intuition-behind-the-calinski-harabasz-index

Silhouette_(clustering)

Working with Text: Tf–idf; Okapi_BM25

See also Mr Gomez page on Weka: http://www.esp.uem.es/jmgomez/tmweka/

Sentiment_analysis

Named-entity_recognition

Conditional_random_field

Latent_Dirichlet_allocation

Visualization: Data_visualization; Exploratory_data_analysis; Statistical_graphics; Visual_perception; Heat_map; Misleading_graph; Pareto_chart

Feature/Attribute Selection / Dimensionality Reduction: Principal_component_analysis; Independent_component_analysis; Singular_value_decomposition; T-distributed_stochastic_neighbor_embedding; Autoencoder; Deep_learning#Stacked_.28de-noising.29_auto-encoders

Statistical tests

Evaluating an hypothesis

Statistical_power

Statistical_hypothesis_testing

Student's_t-test

Type_I_and_type_II_errors

Detecting abrupt changes in time series

Structural_break

Kruskal–Wallis_one-way_analysis_of_variance

Pairwise_summation

MOSUM: https://cran.r-project.org/web/packages/strucchange/vignettes/strucchange-intro.pdf

Chaos: Lyapunov_exponent

Techniques: Statistical_classification; Cluster_analysis; Regression_analysis; Linear_regression; Logistic_regression; Association_rule_learning; Survival_analysis; Monte_Carlo_method; Multinomial_logistic_regression; Lasso_(statistics); Expectation–maximization_algorithm; Hidden_Markov_Models; Viterbi_algorithm; Latent_semantic_analysis; Evolutionary_algorithm; Genetic_algorithm; Voronoi_diagram; Local_outlier_factor; Ordered_weighted_averaging_aggregation_operator

Neural Nets

Boltzmann_machine

The various types of NN as a picture: http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png

Ensemble Techniques: Ensemble_learning

Ensemble Learning = Boosting, Bagging or Stacking: http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning#19053

Bootstrap_aggregating

Boosting_(machine_learning)

Experimentation framework

Goal: test various parameters on various algorithms to determine the best model(s)
Weka's "Experimenter" mode: http://weka.sourceforge.net/manuals/ExplorerGuide.pdf
AutoWeka: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
R::mlrMBO: https://github.com/mlr-org/mlrMBO

Coding / Exposing API to the rest of the application: Microservices

BigData: Star_schema; OLAP_cube; Solid-state_drive; MongoDB

Map-Reduce framework

Apache_Hadoop https://hadoop.apache.org/

Scrapping

Apache_Flume http://flume.apache.org/

Storage

Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Apache_HBase http://hbase.apache.org/

Apache_Hive https://hive.apache.org/

Transfers - to/from RelationalDB

Sqoop http://sqoop.apache.org/

Transfers - serialization/streaming

Apache_Avro http://avro.apache.org/

Apache_Kafka https://kafka.apache.org/

Storage - In memory

Apache_Spark https://spark.apache.org/

Admin

Apache_ZooKeeper http://zookeeper.apache.org/

Apache_Cassandra https://cassandra.apache.org

Ambari http://ambari.apache.org/

Apache_Oozie http://oozie.apache.org/

Programming

Pig_(programming_tool) https://pig.apache.org/

ML

Apache_Mahout http://mahout.apache.org/

Apache_SystemML http://systemml.apache.org/

Working with text

Elasticsearch https://www.elastic.co/

Working with text - Data Viz

Kibana https://www.elastic.co/products/kibana

Resources

News/Blogs/RSS

Podcasts

MOOCs

Retrieved from "https://en.wikipedia.org/enwiki/w/index.php?title=User:LI_AR/Books/Cracking_the_DataScience_Interview&oldid=767896218"

User namespace book pages