User:LI AR/Books/Cracking the DataScience Interview
Appearance
The Wikimedia Foundation's book rendering service has been withdrawn. Please upload your Wikipedia book to one of the external rendering services. |
You can still create and edit a book design using the Book Creator and upload it to an external rendering service:
|
| This user book is a user-generated collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book. If you are the creator of this book and need help, see Help:Books (general tips) and WikiProject Wikipedia-Books (questions and assistance). Edit this book: Book Creator · Wikitext Order a printed copy from: PediaPress [ About ] [ Advanced ] [ FAQ ] [ Feedback ] [ Help ] [ WikiProject ] [ Recent Changes ] |
Cracking the DataScience Interview
Basic Stuff To Know
- Generic pages
- Glossaire_de_l'exploration_de_données
- Big_data
- Inspired from books like:
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
- "120 real data science interview questions"
- Tips
- DataScience is (very) experimental (Andrew Ng): https://pbs.twimg.com/media/CBXshmjWgAAgLKa.jpg
- Overfitting
- Bias–variance_tradeoff
- Concept_drift
- Correlation_does_not_imply_causation
- Curse_of_dimensionality
- Machine Learning definition and types
- Artificial_intelligence
- List_of_machine_learning_concepts
- Machine_learning
- Data_mining
- Knowledge_extraction
- Knowledge_extraction#Knowledge_discovery
- Pattern_recognition
- Signal_processing
- Supervised_learning
- Semi-supervised_learning
- Unsupervised_learning
- Reinforcement_learning
- Online_machine_learning
- Incremental_learning
- Q-learning
- Feature_learning
- Learning_to_rank
- Similarity_learning
- Biclustering
- Natural_language_processing
- Biomimetics
- Collective_intelligence
- Data_stream_mining
- Sequential_pattern_mining
- Clickstream
- Semantics
- Semantic_Web
- Competitions
- https://www.testdome.com/tests/data-analysis-test/65
- https://www.kaggle.com/
- https://www.datascience.net/fr/home/
- http://www.kdnuggets.com/?s=datasets
- http://www.kdnuggets.com/datasets/index.html
- https://aws.amazon.com/public-datasets/
- http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html
- Usages
- Inpainting
- Software
- http://www.databaseetl.com/data-mining-tools/
- IDEs
- R/Packages
- https://cran.r-project.org/
- https://cran.r-project.org/web/views/
- https://cran.r-project.org/web/views/MachineLearning.html
- https://cran.r-project.org/web/views/Bayesian.html
- https://cran.r-project.org/web/views/Cluster.html
- https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
- https://cran.r-project.org/web/views/Survival.html
- https://cran.r-project.org/web/views/TimeSeries.html
- Python
- C++
- Alteryx
- https://www.alteryx.com/ [Commercial]
- Comparison
- DeepLearning
- GANs (Generative Adversial Networks)
- DataViz
- https://matplotlib.org/
- https://plot.ly/
- :GGobi http://www.ggobi.org/
- http://ggplot2.org/
- http://ggvis.rstudio.com/
- https://d3js.org/
- https://datascienceplus.com/creating-graphs-with-python-and-goopycharts/
- https://www.tableau.com/ [Commercial]
- http://bokeh.pydata.org/en/latest/ [Python]
- http://pyqtgraph.org/ [Python]
- http://rawgraphs.io/
- http://scidavis.sourceforge.net/
- http://home.gna.org/veusz/
- http://jwork.org/dmelt/
- Graphs
- GUI
- Data Manipulation
- Data_pre-processing
- Data_cleansing
- Data_reduction
- Data_wrangling
- Data_scrubbing
- Data_editing
- Data_scraping
- Data_curation
- Data_pre-processing
- Data_fusion
- Data_integration
- Data_binning
- Sanitization_(classified_information)
- Extract,_transform,_load
- Imputation_(statistics)
- Interpolation
- Outlier
- https://github.com/Quartz/bad-data-guide
- https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
- Local_case-control_sampling#Imbalanced_datasets
- Sampling_(statistics)
- Sampling_(statistics)#Stratified_sampling
- Stratified_sampling
- Jackknife_resampling
- Oversampling_and_undersampling_in_data_analysis
- Oversampling_and_undersampling_in_data_analysis#SMOTE
- AdaBoost
- "Essay Why Most Published Research Findings Are False"
- "A Few Useful Things to Know about Machine Learning"
- Working with text
- Unicode_equivalence#Normalization
- URL_normalizationd
- Text_segmentation
- N-gram
- Tokenization_(lexical_analysis)
- Stemming
- Word2vec https://www.tensorflow.org/tutorials/word2vec
- https://google.github.io/seq2seq/
- Working with spatial data
- Spatial_data
- Trend_surface_analysis
- Variogram
- Geary's_C
- Moran's_I
- Spatial_descriptive_statistics#Ripley.27s_K_and_L_functions
- Signal processing
- Signal processing - Images
- Techniques for Feature/Attribute Selection/Dimensionality Reduction
- High-dimensional_statistics
- Dimensionality_reduction
- Factor_analysis
- Principal_component_analysis
- Independent_component_analysis
- Singular_value_decomposition
- Multidimensional_scaling
- T-distributed_stochastic_neighbor_embedding
- Autoencoder
- Deep_learning#Stacked_.28de-noising.29_auto-encoders
- Elastic_map
- Linear_discriminant_analysis
- Signal processing
- Working with spatial data
- Maths (Stats / Algebra)
- Inspiration for this section: https://github.com/soulmachine/machine-learning-cheat-sheet
- Pseudo-random_number_sampling
- Glossary_of_probability_and_statistics
- Bijection,_injection_and_surjection
- Mean
- Harmonic_mean
- Median
- Mode_(statistics)
- Range_(mathematics)
- Quartile
- Interquartile_range
- Variance
- Covariance
- Standard_deviation
- Collinearity#Usage_in_statistics_and_econometrics
- ANOVA
- ANCOVA
- MANOVA
- ANORVA
- Moving_average
- EWMA_chart
- Exponential_smoothing
- Autoregressive_model
- Autoregressive–moving-average_model
- Autoregressive_integrated_moving_average
- Autocorrelation
- Cross-correlation
- Entropy_in_thermodynamics_and_information_theory
- Moment_(mathematics)
- Residual
- Expected_value
- Likelihood_function
- Cumulative_distribution_function
- Probability
- Probability_mass_function
- Probability_density_function
- Prior_probability
- Prior_knowledge_for_pattern_recognition
- Dependent_and_independent_variables
- Independence_(probability_theory)
- Hoeffding's_inequality
- Pareto_efficiency
- Nash_equilibrium
- Pareto_principle
- Tensor_product
- Taxicab_geometry
- Norm_(mathematics)#Euclidean_norm
- Lp_space
- Norm_(mathematics)
- Determinant
- Trace_(linear_algebra)
- Eigenvalues_and_eigenvectors
- Projection_(mathematics)
- Curvature
- Convolution
- Hadamard_product_(matrices)
- Kernel_(statistics)
- Radial_basis_function
- Logit
- Latent_variable
- Inference
- Statistical_inference
- Inductive_reasoning
- Deduction_and_induction
- Transduction_(machine_learning)
- Stochastic
- Stochastic_process
- Probability_theory
- Probability
- Posterior_probability
- Statistic
- Statistics
- Gaussian_noise
- Bayesian_inference
- Bayes_rule
- Bayes'_theorem
- Bayesian_network
- Naive_Bayes_spam_filtering
- Naive_Bayes_classifier
- Loss_function
- Regularization_(mathematics)
- Normalization_(statistics)
- Quantile_normalization
- Nyström_method (+PCA)
- Preference_(economics)
- Delaunay_triangulation
- Neighbourhood_(mathematics)
- Genetic Algorithms
- Mutation_(genetic_algorithm)
- Crossover_(genetic_algorithm)
- Selection_(genetic_algorithm)
- Fitness_function
- Utility#Utility_functions
- SVM
- Neural Networks
- Rectifier_(neural_networks)
- Backpropagation
- Gradient
- Gradient_descent
- Stochastic_gradient_descent
- Gradient_boosting
- http://www.wildml.com/deep-learning-glossary/#gradient-clipping
- http://www.wildml.com/deep-learning-glossary/#batch-normalization
- http://www.wildml.com/deep-learning-glossary/#backpropagation
- http://www.wildml.com/deep-learning-glossary/#momentym
- http://www.wildml.com/deep-learning-glossary/#sgd
- https://visualstudiomagazine.com/articles/2015/07/01/variation-on-back-propagation.aspx
- Softmax is a "discriminant learning metric": examples for all classes!={i} help learn even for class {i} since sum of evaluations is forced to be 1 (the method creates a link in the evaluations of the classes)
- Sigmoid_function
- Hyperbolic_function#Tanh
- Dropout_(neural_networks)
- Radial_basis_function
- Hebbian_theory
- Signal processing
- Signal_processing
- Low-pass_filter
- High-pass_filter
- Energy_(signal_processing)
- Fast_Fourier_transform
- Wavelet
- Discrete_wavelet_transform
- Coherence_(signal_processing)
- Kalman_filter
- Time Series
- Time_series
- Decomposition_of_time_series
- Seasonal_adjustment
- Seasonality
- Frequency_domain
- Time_domain
- Spectral_density
- Games
- Distances
- Distance
- Euclidean_distance [dim1]
- Edit_distance
- Hamming_distance
- Manhattan_distance [dim1]
- Levenshtein_distance
- Minkowski_distance [dim n == generalization]
- Mahalanobis_distance
- Canberra_distance
- Distance_correlation
- Angular_distance
- String_metric
- Jaro–Winkler_distance
- Jaccard_index
- Kendall_tau_distance
- Chebyshev_distance
- Tf–idf
- Neural_coding
- For graphs: http://blog.smola.org/post/33412570425
- https://fr.wikipedia.org/wiki/Algorithme_de_Needleman-Wunsch
- Clouds
- Hausdorff_distance [between clouds of points, a point and a cloud]
- Distance#Distances_between_sets_and_between_a_point_and_a_set
- Distributions
- Discrete_uniform_distribution
- Normal_distribution
- Bernoulli_distribution
- Binomial_distribution
- Poisson_distribution
- Chi-squared_distribution
- Log-normal_distribution
- Pareto_distribution
- Chi-squared_distribution
- Gibbs_distribution
- Weibull_distribution
- Gamma_distribution
- Beta_distribution
- Hypergeometric_distribution
- Dirac_delta_function
- https://ercim-news.ercim.eu/en107/special/robust-and-adaptive-methods-for-sequential-decision-making [Characterization of the simplicity of a distribution: BernsteinExponent+TsybakovMarginCondition]
- Evaluation
- Performance_indicator
- Mean_absolute_percentage_error
- Mean_absolute_scaled_error
- Symmetric_mean_absolute_percentage_error
- Regression-kriging
- https://www.kaggle.com/wiki/RootMeanSquaredLogarithmicError
- http://weka.sourceforge.net/packageMetaData/percentageErrorMetrics/index.html
- http://weka.sourceforge.net/packageMetaData/logarithmicErrorMetrics/index.html
- Information_gain_ratio
- Kullback–Leibler_divergence
- Gini_coefficient
- Pearson_correlation_coefficient
- Entropy
http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/node15.html
- Akaike_information_criterion
- Bayesian_information_criterion
- Structural_similarity
- Type_I_and_type_II_errors
- False_positive_rate
- False_coverage_rate
- False_discovery_rate
- Confusion_matrix
- Accuracy_and_precision
- Precision_and_recall
- F1_score
- Sensitivity_and_specificity
- Receiver_operating_characteristic
- Receiver_operating_characteristic#Area_under_the_curve
- Discounted_cumulative_gain
- Cross-validation_(statistics)
- Errors_and_residuals
- If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)
- Clustering
- See also the Calinski-Harabasz Index: http://stats.stackexchange.com/questions/97429/intuition-behind-the-calinski-harabasz-index
- Working with Text
- Semantic_similarity
- Tf–idf
- Cosine_similarity
- Okapi_BM25
- See also Mr Gomez page on Weka: http://www.esp.uem.es/jmgomez/tmweka/
- Named-entity_recognition
- Conditional_random_field
- Latent_Dirichlet_allocation
- Sentiment_analysis
- Web_mining
- Web_crawler
- Text_mining
- Document_classification
- Automatic_summarization
- Working with Images
- http://mirror.imagej.net/plugins/mexican-hat/index.html
- If your model seeks to penalize near misses, the Mexican hat function is a good choice.
- Visualization
- Data_visualization
- Exploratory_data_analysis
- List_of_graphical_methods
- Statistical_graphics
- Visual_perception
- Heat_map
- Misleading_graph
- Pareto_chart
- (Statistical) tests
- A/B_testing
- Evaluating an hypothesis
- Statistical_power
- Statistical_hypothesis_testing
- P-value
- Student's_t-test
- Chi-squared_test
- Type_I_and_type_II_errors
- Detecting abrupt changes in time series
- Stationary_process
- Structural_break
- Chow_test
- Kruskal–Wallis_one-way_analysis_of_variance
- F-test
- F-statistics
- Pairwise_summation
- CUSUM
- MOSUM: https://cran.r-project.org/web/packages/strucchange/vignettes/strucchange-intro.pdf
- Time series / Chaos
- Machine Learning Techniques
- Statistical_classification
- One-class_classification
- Binary_classification
- Multiclass_classification
- Multi-label_classification
- Structured_prediction
- Cluster_analysis
- Elbow_method_(clustering)
- Nearest_neighbor_search#Approximate_nearest_neighbor
- Regression_analysis
- Linear_regression
- Logistic_regression
- Ridge_regression
- Kriging
- Multivariate_adaptive_regression_splines
- Association_rule_learning
- Apriori_algorithm
- Survival_analysis
- Monte_Carlo_method
- Monte_Carlo_algorithm
- Multinomial_logistic_regression
- Lasso_(statistics)
- Expectation–maximization_algorithm
- Markov_chain_Monte_Carlo
- Hidden_Markov_Models
- Viterbi_algorithm
- CART
- Decision_tree_learning
- Decision_tree
- Pruning_(decision_trees)
- ID3_algorithm
- C4.5_algorithm
- Random_forest
- Support_vector_machine
- Support_vector_machine#Support_vector_clustering_.28SVC.29
- Support_vector_machine#Regression
- Conditional_random_field
- Latent_semantic_analysis
- Genetic_algorithm
- Evolutionary_algorithm
- Evolutionary_computation
- Voronoi_diagram
- Local_outlier_factor
- Ordered_weighted_averaging_aggregation_operator
- Support_vector_machine
- Neural Networks
- History: http://www.chronicle.com/article/The-Believers/190147/
- The various types of NN as a picture: http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png
- Types_of_artificial_neural_networks
- Comparison_of_deep_learning_software/Resources
- Artificial_neural_network
- Perceptron
- Feedforward_neural_network
- Multilayer_perceptron
- Radial_basis_function_network
- Long_short-term_memory
- SNNS
- Time_delay_neural_network
- Recursive_neural_network
- Recurrent_neural_network
- Hopfield_network
- Content-addressable_memory
- Boltzmann_machine
- Self-organizing_map
- Learning_vector_quantization
- Long_short-term_memory
- Liquid_state_machine
- Autoassociative_memory
- Convolutional_neural_network
- Autoencoder
- Neuroevolution
- Neuroevolution_of_augmenting_topologies
- Deep_learning
- Deep_learning#Deep_neural_network_architectures
- Deep_belief_network
- Generative_adversarial_networks
- Signal Processing
- Fuzzy Logic
- Fuzzy_logic
- Inference_engine
- Fuzzy_logic
- Type-2_fuzzy_sets_and_systems
- T-norm_fuzzy_logics
- Adaptive_neuro_fuzzy_inference_system
- Fuzzy_control_system
- Working with spatial data
- Ensemble Techniques
- Ensemble Learning = Boosting, Bagging or Stacking: http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning#19053
- Applying Bagging should help reduce variance and overfitting.
- Applications
- Bayesian_spam_filtering
- Experimentation framework
- Goal: test various parameters on various algorithms to determine the best model(s)
- Weka's "Experimenter" mode: http://weka.sourceforge.net/manuals/ExplorerGuide.pdf
- AutoWeka: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
- R::mlrMBO: https://github.com/mlr-org/mlrMBO
- Coding / Exposing API to the rest of the application
- Microservices
- Map-Reduce framework
- Scrapping
- Storage
- Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
- Apache_HBase http://hbase.apache.org/
- Apache_Hive https://hive.apache.org/
- Transfers - to/from RelationalDB
- Transfers - serialization/streaming
- Storage - In memory
- Admin
- Apache_ZooKeeper http://zookeeper.apache.org/
- Apache_Cassandra https://cassandra.apache.org
- Ambari http://ambari.apache.org/
- Apache_Oozie http://oozie.apache.org/
- Programming
- ML
- Working with text
- Working with text - Data Viz
- Small/Micro Data
- Multi-Agent Systems
- Agent-based_model
- Multi-agent_system
- Agent-oriented_software_engineering
- https://www.researchgate.net/publication/266182243_Agent_Groupe_Role_et_Service_Un_modele_organisationnel_pour_les_systemes_multi-agents_ouverts [JFerber: AGR Methodology]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.7968&rep=rep1&type=pdf [YDemazeau: Vowels Methodology]
- Quantum Machine Learning
- Quantum_machine_learning
- Quantum_tunnelling
- Quantum_annealing
- Adiabatic_quantum_computation
- Resources
- http://www.wildml.com/deep-learning-glossary/
- http://deeplearning.net
- https://www.datacamp.com
- http://www.learnpython.org
- https://www.codecademy.com/learn/python
- http://www.dataschool.io/how-to-get-better-at-data-science/
- http://simplystatistics.org/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists/
- Social network for DataScientists
- Books
- "Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms", Jeff Heaton, 2013, ISBN:9781493682225
- "Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms", Jeff Heaton, 2014, ISBN: 978-1499720570
- "Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks", Jeff Heaton, 2015, ISBN: 978-1505714340
- "Introduction to Machine Learning (Adaptive Computation and Machine Learning)", E. Alpaydin, MIT Press, 2004, ISBN: 978-0262012430
- "Machine Learning: An Artificial Intelligence Approach", R.S. Michalski, J.G. Carbonell, T.M. Mitchell, Symbolic Computation, 1983, ISBN:978-3540132981
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II", Antonio Gulli, CreateSpace, 2015, ISBN:978-1517216719
- "Artificial Intelligence a Modern Approach", Stuart Russell and Peter Norvig, Prentice Hall, 1995, ISBN:978-0131038059
- "An Introduction to MultiAgent Systems", Michael Wooldridge, John Wiley & Sons, 2009 (2nd ed), ISBN:978-0470519462
- "Data Mining: Practical Machine Learning Tools and Techniques", Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher J. Pal, Morgan Kaufmann, ISBN:978-0128042915
- "Agent Intelligence Through Data Mining", Andreas L. Symeonidis, Pericles A. Mitkas, Springer/Apress, ISBN:978-0387257570
- "Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence", Gerhard Weiss, 2000, ISBN:978-0262232036
- "Data science at the command line", Janssens, O'Reilly.
- Also look for MachineLearning, DeepLearning, Spark, Mahout, R, Python, SciKit-Learn, Data/Text Mining, ElasticSearch, Natural Language, Statistics @ O'Reilly, Packt, Manning/In Action, HeadFirst
- News/Blogs/RSS
- https://www.reddit.com/r/machinelearning
- https://www.reddit.com/r/statistics
- https://www.reddit.com/r/datascience
- https://www.reddit.com/r/bigdata
- http://www.kdnuggets.com/
- http://www.becomingadatascientist.com/
- https://rdatamining.wordpress.com/
- http://www.r-bloggers.com/
- https://dataaspirant.com/
- http://www.joyofdata.de/blog/
- https://www.dataiku.com/blog/
- https://www.datacamp.com/community/
- http://beautifuldata.net/
- http://www.datatau.com/news
- http://dataelixir.com/
- http://www.oreilly.com/data/newsletter.html
- http://blog.kaggle.com/
- http://blog.yhathq.com/
- http://simplystatistics.org/
- http://fastml.com/
- http://www.win-vector.com/blog/
- http://fivethirtyeight.com/
- http://www.dataschool.io/
- https://research.facebook.com/blog/datascience/
- http://deeplearning.net/feed/
- http://learningwithdata.com/
- http://blog.plot.ly/
- https://datasciencelab.wordpress.com/
- https://shapeofdata.wordpress.com/
- http://datalab.lu/
- http://www.pythonweekly.com/
- http://pbpython.com/
- https://plus.google.com/communities/105141578068503684401 ( https://plus.google.com/+JaanaNystr%C3%B6m/posts/MKCV3vNsn1g )
- http://blog.revolutionanalytics.com/2012/12/the-most-influential-data-scientists-on-twitter.html
- http://www.kdnuggets.com/2012/12/most-influential-data-scientists-on-twitter.html
- Podcasts
- http://www.learningmachines101.com/
- http://www.thetalkingmachines.com/
- http://dataskeptic.com/
- http://www.partiallyderivative.com/
- http://www.ocdqblog.com/podcast/
- http://blog.pivotal.io/podcasts-pivotal
- https://www.udacity.com/podcasts/linear-digressions
- http://datastori.es/
- http://radar.oreilly.com/tag/oreilly-data-show-podcast
- http://freakonomics.com/radio/freakonomics-radio-podcast-archive/
- http://simplystatistics.org/category/podcast/
- http://data-informed.com/multimedia/podcasts/
- http://www.bbc.co.uk/programmes/p02nrss1
- MOOCs
- Generic
- Weka
- Andrew Ng
- Yann Lecun
- Ans Rosling (visualization)
- From renown Universities
- https://www.coursera.org/specializations/jhu-data-science
- https://www.coursera.org/specializations/machine-learning
- https://www.coursera.org/specializations/data-science-python
- https://www.coursera.org/specializations/big-data
- https://www.coursera.org/learn/machine-learning
- https://www.coursera.org/learn/r-programming
- https://www.coursera.org/learn/data-scientists-tools
- https://www.coursera.org/learn/python-data-analysis
- http://www.holehouse.org/mlclass/
- http://online.stanford.edu/course/statistical-learning
- http://work.caltech.edu/telecourse.html
- https://www.udacity.com/course/data-analyst-nanodegree--nd002
- https://www.thinkful.com/courses/learn-data-science-online/
- https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x7
- https://www.coursetalk.com/
- https://github.com/justmarkham/DAT7#bonus-resources
- http://datasciencemasters.org/
- http://www.wolfram.com/broadcast/c?c=99
- http://www.wolfram.com/broadcast/c?c=97
- http://www.wolfram.com/broadcast/c?c=397
- DataSchool
- Jobs