Jump to content

User:LI AR/Books/Cracking the DataScience Interview

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Liar666 (talk | contribs) at 15:41, 28 February 2017 (Cracking the DataScience Interview). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Cracking the DataScience Interview

Basic Stuff To Know

Generic pages
Glossaire_de_l'exploration_de_données
Big_data
  • Inspired from books like:
    • "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
    • "120 real data science interview questions"
Tips
Bias–variance_tradeoff
Correlation_does_not_imply_causation
Competitions
Datasets
IDEs
 https://cran.r-project.org/
 https://cran.r-project.org/web/views/
 https://cran.r-project.org/web/views/MachineLearning.html
 https://cran.r-project.org/web/views/Bayesian.html
 https://cran.r-project.org/web/views/Cluster.html
 https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
 https://cran.r-project.org/web/views/Survival.html
 https://cran.r-project.org/web/views/TimeSeries.html
  • Python/SciKit-Learn
 https://www.python.org/
 http://scikit-learn.org/stable/
Data Manipulation
 http://robotics.cs.tamu.edu/RSS2015NegativeResults/pmed.0020124.pdf
  • "A Few Useful Things to Know about Machine Learning"
 https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Maths (Stats / Algebra)
Glossary_of_probability_and_statistics
Mode_(statistics)
Variance
Covariance
Entropy_in_thermodynamics_and_information_theory
Expected_value
Likelihood_function
Cumulative_distribution_function
Probability_mass_function
Probability_density_function
Pareto_efficiency
Tensor_product
Taxicab_geometry
Norm_(mathematics)#Euclidean_norm
Lp_space
Norm_(mathematics)
Determinant
Trace_(linear_algebra)
Eigenvalues_and_eigenvectors
Convolution
Hadamard_product_(matrices)
Kernel_(statistics)
Radial_basis_function
Logit
Latent_variable
Inference
Statistical_inference
Inductive_reasoning
Deduction_and_induction
  • Distributions

https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Discrete_uniform_distribution
Normal_distribution
Bernoulli_distribution
Binomial_distribution
Poisson_distribution
Chi-squared_distribution
Log-normal_distribution
Weibull_distribution
Gamma_distribution
Beta_distribution
Hypergeometric_distribution
  • Neural Nets
Softmax_function
Sigmoid_function
Hyperbolic_function#Tanh


Evaluation
Mean_absolute_percentage_error
Mean_absolute_scaled_error
Symmetric_mean_absolute_percentage_error
Regression-kriging
Information_gain_ratio
Kullback–Leibler_divergence
Gini_coefficient
Akaike_information_criterion
Precision_and_recall
Sensitivity_and_specificity
Receiver_operating_characteristic
Receiver_operating_characteristic#Area_under_the_curve
Cross-validation_(statistics)
Errors_and_residuals
  • If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)
Heteroscedasticity
  • Clustering
Dunn_index
Rand_index
Jaccard_index
Silhouette_(clustering)
Working with Text
Tf–idf
Okapi_BM25
Sentiment_analysis
Named-entity_recognition
Conditional_random_field
Latent_Dirichlet_allocation
Apache_Lucene
Visualization
Data_visualization
Exploratory_data_analysis
Statistical_graphics
Visual_perception
Heat_map
Misleading_graph
Pareto_chart
Feature/Attribute Selection / Dimensionality Reduction
Principal_component_analysis
Independent_component_analysis
Singular_value_decomposition
T-distributed_stochastic_neighbor_embedding
Autoencoder
Deep_learning#Stacked_.28de-noising.29_auto-encoders
Statistical tests
  • Evaluating an hypothesis
Statistical_power
Statistical_hypothesis_testing
P-value
Student's_t-test
Type_I_and_type_II_errors
  • Detecting abrupt changes in time series
Structural_break
Chow_test
Kruskal–Wallis_one-way_analysis_of_variance
F-test
F-statistics
Pairwise_summation
CUSUM
Chaos
Lyapunov_exponent
Techniques
Statistical_classification
Cluster_analysis
Regression_analysis
Linear_regression
Logistic_regression
Association_rule_learning
Survival_analysis
Monte_Carlo_method
Multinomial_logistic_regression
Lasso_(statistics)
Expectation–maximization_algorithm
Latent_semantic_analysis
Evolutionary_algorithm
Genetic_algorithm
Voronoi_diagram
Hidden_Markov_model
Local_outlier_factor
Ordered_weighted_averaging_aggregation_operator
  • Neural Nets
Boltzmann_machine
Ensemble Techniques
Ensemble_learning
Bootstrap_aggregating
Boosting_(machine_learning)
Experimentation framework
Coding / Exposing API to the rest of the application
Microservices
BigData
Star_schema
OLAP_cube
Solid-state_drive
MongoDB
  • Map-Reduce framework
Apache_Hadoop https://hadoop.apache.org/
  • Scrapping
Apache_Flume http://flume.apache.org/
  • Storage
Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Apache_HBase http://hbase.apache.org/
Apache_Hive https://hive.apache.org/
  • Transfers - to/from RelationalDB
Sqoop http://sqoop.apache.org/
  • Transfers - serialization/streaming
Apache_Avro http://avro.apache.org/
Apache_Kafka https://kafka.apache.org/
  • Storage - In memory
Apache_Spark https://spark.apache.org/
  • Admin
Apache_ZooKeeper http://zookeeper.apache.org/
Apache_Cassandra https://cassandra.apache.org
Ambari http://ambari.apache.org/
Apache_Oozie http://oozie.apache.org/
  • Programming
Pig_(programming_tool) https://pig.apache.org/
  • ML
Apache_Mahout http://mahout.apache.org/
Apache_SystemML http://systemml.apache.org/
  • Working with text
Elasticsearch https://www.elastic.co/
  • Working with text - Data Viz
Kibana https://www.elastic.co/products/kibana
Resources
News/Blogs/RSS
Podcasts
MOOCs
  • Weka
 - http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
 - http://www.cs.waikato.ac.nz/ml/weka/mooc/moredataminingwithweka/
 - http://www.cs.waikato.ac.nz/ml/weka/mooc/advanceddataminingwithweka/
  • Andrew Ng
 - https://www.youtube.com/watch?v=UzxYlbK2c7E&list=PLJ_CMbwA6bT-n1W0mgOlYwccZ-j6gBXqE
  • Yann Lecun
 - https://www.college-de-france.fr/site/yann-lecun/course-2015-2016.htm
  • From renown Universities
 https://www.coursera.org/specializations/jhu-data-science
 https://www.coursera.org/specializations/machine-learning
 https://www.coursera.org/specializations/data-science-python
 https://www.coursera.org/specializations/big-data
 https://www.coursera.org/learn/machine-learning
 https://www.coursera.org/learn/r-programming
 https://www.coursera.org/learn/data-scientists-tools
 https://www.coursera.org/learn/python-data-analysis
  • DataSchool
 http://www.dataschool.io/learn/