User:LI AR/Books/Cracking the DataScience Interview
Appearance
The Wikimedia Foundation's book rendering service has been withdrawn. Please upload your Wikipedia book to one of the external rendering services. |
You can still create and edit a book design using the Book Creator and upload it to an external rendering service:
|
| This user book is a user-generated collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book. If you are the creator of this book and need help, see Help:Books (general tips) and WikiProject Wikipedia-Books (questions and assistance). Edit this book: Book Creator · Wikitext Order a printed copy from: PediaPress [ About ] [ Advanced ] [ FAQ ] [ Feedback ] [ Help ] [ WikiProject ] [ Recent Changes ] |
Cracking the DataScience Interview
Basic Stuff To Know
- Generic pages
- Glossaire_de_l'exploration_de_données
- Big_data
- Inspired from books like:
- "A collection of Data Science Interview Questions Solved in Python and Spark vol I & II"
- "120 real data science interview questions"
- Tips
- DataScience is (very) experimental (Andrew Ng): https://pbs.twimg.com/media/CBXshmjWgAAgLKa.jpg
- Competitions
- https://www.testdome.com/tests/data-analysis-test/65
- https://www.kaggle.com/
- https://www.datascience.net/fr/home/
- Datasets
- http://www.kdnuggets.com/?s=datasets
- http://www.kdnuggets.com/datasets/index.html
- https://aws.amazon.com/public-datasets/
- http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html
- IDEs
https://cran.r-project.org/ https://cran.r-project.org/web/views/ https://cran.r-project.org/web/views/MachineLearning.html https://cran.r-project.org/web/views/Bayesian.html https://cran.r-project.org/web/views/Cluster.html https://cran.r-project.org/web/views/NaturalLanguageProcessing.html https://cran.r-project.org/web/views/Survival.html https://cran.r-project.org/web/views/TimeSeries.html
- Python/SciKit-Learn
https://www.python.org/ http://scikit-learn.org/stable/
- Data Manipulation
- https://github.com/Quartz/bad-data-guide
- https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
- "Essay Why Most Published Research Findings Are False"
http://robotics.cs.tamu.edu/RSS2015NegativeResults/pmed.0020124.pdf
- "A Few Useful Things to Know about Machine Learning"
https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
- Maths (Stats / Algebra)
- Inspiration for this section: https://github.com/soulmachine/machine-learning-cheat-sheet
- Glossary_of_probability_and_statistics
- Mode_(statistics)
- Variance
- Covariance
- Entropy_in_thermodynamics_and_information_theory
- Expected_value
- Likelihood_function
- Cumulative_distribution_function
- Probability_mass_function
- Probability_density_function
- Pareto_efficiency
- Tensor_product
- Taxicab_geometry
- Norm_(mathematics)#Euclidean_norm
- Lp_space
- Norm_(mathematics)
- Determinant
- Trace_(linear_algebra)
- Eigenvalues_and_eigenvectors
- Convolution
- Hadamard_product_(matrices)
- Kernel_(statistics)
- Radial_basis_function
- Logit
- Latent_variable
- Inference
- Statistical_inference
- Inductive_reasoning
- Deduction_and_induction
- Distributions
- Discrete_uniform_distribution
- Normal_distribution
- Bernoulli_distribution
- Binomial_distribution
- Poisson_distribution
- Chi-squared_distribution
- Log-normal_distribution
- Weibull_distribution
- Gamma_distribution
- Beta_distribution
- Hypergeometric_distribution
- Neural Nets
- Evaluation
- Mean_absolute_percentage_error
- Mean_absolute_scaled_error
- Symmetric_mean_absolute_percentage_error
- Regression-kriging
- https://www.kaggle.com/wiki/RootMeanSquaredLogarithmicError
- http://weka.sourceforge.net/packageMetaData/percentageErrorMetrics/index.html
- http://weka.sourceforge.net/packageMetaData/logarithmicErrorMetrics/index.html
- Information_gain_ratio
- Kullback–Leibler_divergence
- Gini_coefficient
- Akaike_information_criterion
- Precision_and_recall
- Sensitivity_and_specificity
- Receiver_operating_characteristic
- Receiver_operating_characteristic#Area_under_the_curve
- Cross-validation_(statistics)
- Errors_and_residuals
- If residual is consistantly >0 or <0 on a range of the training set => the model has failed to capture something in the data or we use wrong type of model (e.g. linear reg on parabolic data; DataSkeptic/Heteroskedasticity)
- Clustering
- See also the Calinski-Harabasz Index: http://stats.stackexchange.com/questions/97429/intuition-behind-the-calinski-harabasz-index
- Working with Text
- Tf–idf
- Okapi_BM25
- See also Mr Gomez page on Weka: http://www.esp.uem.es/jmgomez/tmweka/
- Sentiment_analysis
- Named-entity_recognition
- Conditional_random_field
- Latent_Dirichlet_allocation
- Apache_Lucene
- Visualization
- Data_visualization
- Exploratory_data_analysis
- Statistical_graphics
- Visual_perception
- Heat_map
- Misleading_graph
- Pareto_chart
- Feature/Attribute Selection / Dimensionality Reduction
- Principal_component_analysis
- Independent_component_analysis
- Singular_value_decomposition
- T-distributed_stochastic_neighbor_embedding
- Autoencoder
- Deep_learning#Stacked_.28de-noising.29_auto-encoders
- Statistical tests
- Evaluating an hypothesis
- Detecting abrupt changes in time series
- Structural_break
- Chow_test
- Kruskal–Wallis_one-way_analysis_of_variance
- F-test
- F-statistics
- Pairwise_summation
- CUSUM
- Chaos
- Lyapunov_exponent
- Techniques
- Statistical_classification
- Cluster_analysis
- Regression_analysis
- Linear_regression
- Logistic_regression
- Association_rule_learning
- Survival_analysis
- Monte_Carlo_method
- Multinomial_logistic_regression
- Lasso_(statistics)
- Expectation–maximization_algorithm
- Latent_semantic_analysis
- Evolutionary_algorithm
- Genetic_algorithm
- Voronoi_diagram
- Hidden_Markov_model
- Local_outlier_factor
- Ordered_weighted_averaging_aggregation_operator
- Neural Nets
- The various types of NN as a picture: http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png
- Ensemble Techniques
- Ensemble_learning
- Ensemble Learning = Boosting, Bagging or Stacking: http://stats.stackexchange.com/questions/18891/bagging-boosting-and-stacking-in-machine-learning#19053
- Experimentation framework
- Goal: test various parameters on various algorithms to determine the best model(s)
- Weka's "Experimenter" mode: http://weka.sourceforge.net/manuals/ExplorerGuide.pdf
- AutoWeka: http://www.cs.ubc.ca/labs/beta/Projects/autoweka/
- R::mlrMBO: https://github.com/mlr-org/mlrMBO
- Coding / Exposing API to the rest of the application
- Microservices
- BigData
- Star_schema
- OLAP_cube
- Solid-state_drive
- MongoDB
- Map-Reduce framework
- Scrapping
- Storage
- Apache_Hadoop#HDFS https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
- Apache_HBase http://hbase.apache.org/
- Apache_Hive https://hive.apache.org/
- Transfers - to/from RelationalDB
- Transfers - serialization/streaming
- Storage - In memory
- Admin
- Apache_ZooKeeper http://zookeeper.apache.org/
- Apache_Cassandra https://cassandra.apache.org
- Ambari http://ambari.apache.org/
- Apache_Oozie http://oozie.apache.org/
- Programming
- ML
- Working with text
- Working with text - Data Viz
- Resources
- http://deeplearning.net
- https://www.datacamp.com
- http://www.learnpython.org/
- https://www.codecademy.com/learn/python
- News/Blogs/RSS
- https://www.reddit.com/r/machinelearning
- https://www.reddit.com/r/statistics
- https://www.reddit.com/r/datascience
- https://www.reddit.com/r/bigdata
- http://www.kdnuggets.com/
- http://www.becomingadatascientist.com/
- https://rdatamining.wordpress.com/
- http://www.r-bloggers.com/
- https://dataaspirant.com/
- http://www.joyofdata.de/blog/
- https://www.dataiku.com/blog/
- https://www.datacamp.com/community/
- http://beautifuldata.net/
- http://www.dataschool.io/
- https://research.facebook.com/blog/datascience/
- http://deeplearning.net/feed/
- http://learningwithdata.com/
- http://blog.kaggle.com/
- http://blog.plot.ly/
- https://datasciencelab.wordpress.com/
- https://shapeofdata.wordpress.com/
- http://datalab.lu/
- Podcasts
- http://www.learningmachines101.com/
- http://www.thetalkingmachines.com/
- http://dataskeptic.com/
- http://www.partiallyderivative.com/
- http://www.ocdqblog.com/podcast/
- http://blog.pivotal.io/podcasts-pivotal
- https://www.udacity.com/podcasts/linear-digressions
- http://datastori.es/
- http://radar.oreilly.com/tag/oreilly-data-show-podcast
- http://freakonomics.com/radio/freakonomics-radio-podcast-archive/
- http://simplystatistics.org/category/podcast/
- http://data-informed.com/multimedia/podcasts/
- http://www.bbc.co.uk/programmes/p02nrss1
- MOOCs
- Weka
- http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/ - http://www.cs.waikato.ac.nz/ml/weka/mooc/moredataminingwithweka/ - http://www.cs.waikato.ac.nz/ml/weka/mooc/advanceddataminingwithweka/
- Andrew Ng
- https://www.youtube.com/watch?v=UzxYlbK2c7E&list=PLJ_CMbwA6bT-n1W0mgOlYwccZ-j6gBXqE
- Yann Lecun
- https://www.college-de-france.fr/site/yann-lecun/course-2015-2016.htm
- From renown Universities
https://www.coursera.org/specializations/jhu-data-science https://www.coursera.org/specializations/machine-learning https://www.coursera.org/specializations/data-science-python https://www.coursera.org/specializations/big-data https://www.coursera.org/learn/machine-learning https://www.coursera.org/learn/r-programming https://www.coursera.org/learn/data-scientists-tools https://www.coursera.org/learn/python-data-analysis
- DataSchool
http://www.dataschool.io/learn/