Jump to content

User:LI AR/Books/Cracking the DataScience Interview

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Liar666 (talk | contribs) at 09:26, 19 June 2017 (Cracking the DataScience Interview). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Cracking the DataScience Interview

Basic Stuff To Know

Generic pages
Glossaire_de_l'exploration_de_données
Big_data
  • Inspired from books like:
    • "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
    • "120 real data science interview questions"
Tips
Overfitting
Bias–variance_tradeoff
Concept_drift
Correlation_does_not_imply_causation
Curse_of_dimensionality
Vanishing_gradient_problem
Machine Learning definition and types
Artificial_intelligence
List_of_machine_learning_concepts
Machine_learning
Data_mining
Knowledge_extraction
Knowledge_extraction#Knowledge_discovery
Pattern_recognition
Signal_processing
Supervised_learning
Semi-supervised_learning
Unsupervised_learning
Reinforcement_learning
Online_machine_learning
Incremental_learning
Q-learning
Feature_learning
Learning_to_rank
Similarity_learning
Biclustering
Natural_language_processing
Biomimetics
Collective_intelligence
Data_stream_mining
Sequential_pattern_mining
Clickstream
Semantics
Semantic_Web
Competitions
Datasets
List_of_datasets_for_machine_learning_research
Usages
Inpainting
Software
Data Manipulation
Data_pre-processing
Data_cleansing
Data_reduction
Data_wrangling
Data_scrubbing
Data_editing
Data_scraping
Data_curation
Data_pre-processing
Data_fusion
Data_integration
Data_binning
Sanitization_(classified_information)
Extract,_transform,_load
Imputation_(statistics)
Interpolation
Outlier
Local_case-control_sampling#Imbalanced_datasets
Sampling_(statistics)
Sampling_(statistics)#Stratified_sampling
Stratified_sampling
Jackknife_resampling
Oversampling_and_undersampling_in_data_analysis
Oversampling_and_undersampling_in_data_analysis#SMOTE
AdaBoost
Unicode_equivalence#Normalization
URL_normalizationd
Text_segmentation
N-gram
Tokenization_(lexical_analysis)
Stemming
Word2vec https://www.tensorflow.org/tutorials/word2vec
Spatial_data
Trend_surface_analysis
Variogram
Geary's_C
Moran's_I
Spatial_descriptive_statistics#Ripley.27s_K_and_L_functions
  • Signal processing
Dynamic_time_warping
  • Signal processing - Images
Normalization_(image_processing)
Normalized_frequency_(unit)
Image_segmentation


Techniques for Feature/Attribute Selection/Dimensionality Reduction
High-dimensional_statistics
Dimensionality_reduction
Factor_analysis
Principal_component_analysis
Independent_component_analysis
Singular_value_decomposition
Multidimensional_scaling
T-distributed_stochastic_neighbor_embedding
Autoencoder
Deep_learning#Stacked_.28de-noising.29_auto-encoders
Elastic_map
Linear_discriminant_analysis
  • Signal processing
Compressed_sensing
  • Working with spatial data
Spatial_analysis
Spatial_analysis#Spatial_dependency_or_auto-correlation
Maths (Stats / Algebra)
Pseudo-random_number_sampling
Glossary_of_probability_and_statistics
Bijection,_injection_and_surjection
Mean
Harmonic_mean
Median
Mode_(statistics)
Range_(mathematics)
Quartile
Interquartile_range
Variance
Covariance
Standard_deviation
Collinearity#Usage_in_statistics_and_econometrics
ANOVA
ANCOVA
MANOVA
ANORVA
Moving_average
EWMA_chart
Exponential_smoothing
Autoregressive_model
Autoregressive–moving-average_model
Autoregressive_integrated_moving_average
Autocorrelation
Cross-correlation
Entropy_in_thermodynamics_and_information_theory
Moment_(mathematics)
Residual
Expected_value
Likelihood_function
Cumulative_distribution_function
Probability
Probability_mass_function
Probability_density_function
Prior_probability
Prior_knowledge_for_pattern_recognition
Dependent_and_independent_variables
Independence_(probability_theory)
Hoeffding's_inequality
Pareto_efficiency
Nash_equilibrium
Pareto_principle
Tensor_product
Taxicab_geometry
Norm_(mathematics)#Euclidean_norm
Lp_space
Norm_(mathematics)
Determinant
Trace_(linear_algebra)
Eigenvalues_and_eigenvectors
Projection_(mathematics)
Curvature
Convolution
Hadamard_product_(matrices)
Kernel_(statistics)
Radial_basis_function
Logit
Latent_variable
Inference
Statistical_inference
Inductive_reasoning
Deduction_and_induction
Transduction_(machine_learning)
Stochastic
Stochastic_process
Probability_theory
Probability
Posterior_probability
Statistic
Statistics
Gaussian_noise
Bayesian_inference
Bayes_rule
Bayes'_theorem
Bayesian_network
Naive_Bayes_spam_filtering
Naive_Bayes_classifier
Loss_function
Regularization_(mathematics)
Normalization_(statistics)
Quantile_normalization
Nyström_method (+PCA)
Preference_(economics)
Delaunay_triangulation
Neighbourhood_(mathematics)
  • Genetic Algorithms
Mutation_(genetic_algorithm)
Crossover_(genetic_algorithm)
Selection_(genetic_algorithm)
Fitness_function
Utility#Utility_functions
  • SVM
Kernel_method
Kernel_(image_processing)
Kernel_(statistics)
  • Neural Networks
Rectifier_(neural_networks)
Backpropagation
Gradient
Gradient_descent
Stochastic_gradient_descent
Gradient_boosting
Softmax_function
    • Softmax is a "discriminant learning metric": examples for all classes!={i} help learn even for class {i} since sum of evaluations is forced to be 1 (the method creates a link in the evaluations of the classes)
Sigmoid_function
Hyperbolic_function#Tanh
Dropout_(neural_networks)
Radial_basis_function
Hebbian_theory
  • Signal processing
Signal_processing
Low-pass_filter
High-pass_filter
Energy_(signal_processing)
Fast_Fourier_transform
Wavelet
Discrete_wavelet_transform
Coherence_(signal_processing)
Kalman_filter
  • Time Series
Time_series
Decomposition_of_time_series
Seasonal_adjustment
Seasonality
Frequency_domain
Time_domain
Spectral_density
  • Games
Game_theory
A*_search_algorithm
Minimax
Multi-armed_bandit
Zero-sum_game


Distances
Distance
Euclidean_distance [dim1]
Edit_distance
Hamming_distance
Manhattan_distance [dim1]
Levenshtein_distance
Minkowski_distance [dim n == generalization]
Mahalanobis_distance
Canberra_distance
Distance_correlation
Angular_distance
String_metric
Jaro–Winkler_distance
Jaccard_index
Kendall_tau_distance
Chebyshev_distance
Tf–idf
Neural_coding
Hausdorff_distance [between clouds of points, a point and a cloud]
Distance#Distances_between_sets_and_between_a_point_and_a_set


Distributions
Discrete_uniform_distribution
Normal_distribution
Bernoulli_distribution
Binomial_distribution
Poisson_distribution
Chi-squared_distribution
Log-normal_distribution
Pareto_distribution
Chi-squared_distribution
Gibbs_distribution
Weibull_distribution
Gamma_distribution
Beta_distribution
Hypergeometric_distribution
Dirac_delta_function
Evaluation
Performance_indicator
Mean_absolute_percentage_error
Mean_absolute_scaled_error
Symmetric_mean_absolute_percentage_error
Regression-kriging
Information_gain_ratio
Kullback–Leibler_divergence
Gini_coefficient
Pearson_correlation_coefficient
Entropy

http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/node15.html

Akaike_information_criterion
Bayesian_information_criterion
Structural_similarity
Type_I_and_type_II_errors
False_positive_rate
False_coverage_rate
False_discovery_rate
Confusion_matrix
Accuracy_and_precision
Precision_and_recall
F1_score
Sensitivity_and_specificity
Receiver_operating_characteristic
Receiver_operating_characteristic#Area_under_the_curve
Discounted_cumulative_gain
Cross-validation_(statistics)
Errors_and_residuals
  • If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)
Heteroscedasticity
  • Clustering
Dunn_index
Rand_index
Jaccard_index
Silhouette_(clustering)
Working with Text
Semantic_similarity
Tf–idf
Cosine_similarity
Okapi_BM25
Named-entity_recognition
Conditional_random_field
Latent_Dirichlet_allocation
Sentiment_analysis
Web_mining
Web_crawler
Text_mining
Document_classification
Automatic_summarization
Working with Images
Visualization
Data_visualization
Exploratory_data_analysis
List_of_graphical_methods
Statistical_graphics
Visual_perception
Heat_map
Misleading_graph
Pareto_chart
(Statistical) tests
A/B_testing
  • Evaluating an hypothesis
Statistical_power
Statistical_hypothesis_testing
P-value
Student's_t-test
Chi-squared_test
Type_I_and_type_II_errors
  • Detecting abrupt changes in time series
Stationary_process
Structural_break
Chow_test
Kruskal–Wallis_one-way_analysis_of_variance
F-test
F-statistics
Pairwise_summation
CUSUM
Lyapunov_exponent
Kolmogorov_complexity
Machine Learning Techniques
Statistical_classification
One-class_classification
Binary_classification
Multiclass_classification
Multi-label_classification
Structured_prediction
Cluster_analysis
Elbow_method_(clustering)
Nearest_neighbor_search#Approximate_nearest_neighbor
Regression_analysis
Linear_regression
Logistic_regression
Ridge_regression
Kriging
Multivariate_adaptive_regression_splines
Association_rule_learning
Apriori_algorithm
Survival_analysis
Monte_Carlo_method
Monte_Carlo_algorithm
Multinomial_logistic_regression
Lasso_(statistics)
Expectation–maximization_algorithm
Markov_chain_Monte_Carlo
Hidden_Markov_Models
Viterbi_algorithm
CART
Decision_tree_learning
Decision_tree
Pruning_(decision_trees)
ID3_algorithm
C4.5_algorithm
Random_forest
Support_vector_machine
Support_vector_machine#Support_vector_clustering_.28SVC.29
Support_vector_machine#Regression
Conditional_random_field
Latent_semantic_analysis
Genetic_algorithm
Evolutionary_algorithm
Evolutionary_computation
Voronoi_diagram
Local_outlier_factor
Ordered_weighted_averaging_aggregation_operator
Support_vector_machine
Types_of_artificial_neural_networks
Comparison_of_deep_learning_software/Resources
Artificial_neural_network
Perceptron
Feedforward_neural_network
Multilayer_perceptron
Radial_basis_function_network
Long_short-term_memory
SNNS
Time_delay_neural_network
Recursive_neural_network
Recurrent_neural_network
Hopfield_network
Content-addressable_memory
Boltzmann_machine
Self-organizing_map
Learning_vector_quantization
Long_short-term_memory
Liquid_state_machine
Autoassociative_memory
Convolutional_neural_network
Autoencoder
Neuroevolution
Neuroevolution_of_augmenting_topologies
Deep_learning
Deep_learning#Deep_neural_network_architectures
Deep_belief_network
Generative_adversarial_networks
Neural_Turing_machine
Early_stopping
ADALINE
Memristor
Instantaneously_trained_neural_networks
Spiking_neural_network
  • Signal Processing
Optical_character_recognition
  • Fuzzy Logic
Fuzzy_logic
Inference_engine
Fuzzy_logic
Type-2_fuzzy_sets_and_systems
T-norm_fuzzy_logics
Adaptive_neuro_fuzzy_inference_system
Fuzzy_control_system
  • Working with spatial data
Spatial_association


Ensemble Techniques
Ensemble_learning
Ensembles_of_classifiers
Ensemble_learning#Implementations_in_statistics_packages
Bootstrap_aggregating
Boosting_(machine_learning)
Gradient_boosting
Committee_machine
Applications
Bayesian_spam_filtering
Experimentation framework
Coding / Exposing API to the rest of the application
Microservices
BigData
Data_lake
Streaming_algorithm
Star_schema
OLAP_cube
Solid-state_drive
MongoDB
  • Map-Reduce framework
Apache_Hadoop https://hadoop.apache.org/
  • Scrapping
Apache_Flume http://flume.apache.org/
  • Storage
Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Apache_HBase http://hbase.apache.org/
Apache_Hive https://hive.apache.org/
  • Transfers - to/from RelationalDB
Sqoop http://sqoop.apache.org/
  • Transfers - serialization/streaming
Apache_Avro http://avro.apache.org/
Apache_Kafka https://kafka.apache.org/
  • Storage - In memory
Apache_Spark https://spark.apache.org/
Apache_Flink http://flink.apache.org/
  • Admin
Apache_ZooKeeper http://zookeeper.apache.org/
Apache_Cassandra https://cassandra.apache.org
Ambari http://ambari.apache.org/
Apache_Oozie http://oozie.apache.org/
  • Programming
Pig_(programming_tool) https://pig.apache.org/
  • ML
Apache_Mahout http://mahout.apache.org/
Apache_SystemML http://systemml.apache.org/
  • Working with text
Apache_Lucene
Elasticsearch https://www.elastic.co/
  • Working with text - Data Viz
Kibana https://www.elastic.co/products/kibana
Small_data


Multi-Agent Systems
Agent-based_model
Multi-agent_system
Agent-oriented_software_engineering

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.7968&rep=rep1&type=pdf [YDemazeau: Vowels Methodology]

Ant_colony_optimization_algorithms


Quantum Machine Learning
Quantum_machine_learning
Quantum_tunnelling
Quantum_annealing
Adiabatic_quantum_computation


Resources
Books
  • "Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms", Jeff Heaton, 2013, ISBN:9781493682225
  • "Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms", Jeff Heaton, 2014, ISBN: 978-1499720570
  • "Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks", Jeff Heaton, 2015, ISBN: 978-1505714340
  • "Introduction to Machine Learning (Adaptive Computation and Machine Learning)", E. Alpaydin, MIT Press, 2004, ISBN: 978-0262012430
  • "Machine Learning: An Artificial Intelligence Approach", R.S. Michalski, J.G. Carbonell, T.M. Mitchell, Symbolic Computation, 1983, ISBN:978-3540132981
  • "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II", Antonio Gulli, CreateSpace, 2015, ISBN:978-1517216719
  • "Artificial Intelligence a Modern Approach", Stuart Russell and Peter Norvig, Prentice Hall, 1995, ISBN:978-0131038059
  • "An Introduction to MultiAgent Systems", Michael Wooldridge, John Wiley & Sons, 2009 (2nd ed), ISBN:978-0470519462
  • "Data Mining: Practical Machine Learning Tools and Techniques", Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, Morgan Kaufmann, ISBN:978-0128042915
  • "Agent Intelligence Through Data Mining", Andreas L. Symeonidis, Pericles A. Mitkas, Springer/Apress, ISBN:978-0387257570
  • "Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence", Gerhard Weiss, 2000, ISBN:978-0262232036
  • "Data science at the command line", Janssens, O'Reilly.
  • Also look for MachineLearning, DeepLearning, Spark, Mahout, R, Python, SciKit-Learn, Data/Text Mining, ElasticSearch, Natural Language, Statistics @ O'Reilly, Packt, Manning/In Action, HeadFirst
News/Blogs/RSS
Podcasts
MOOCs
Jobs