Keras and Tensorflow Basics in Python – Simple Example

Below is a small example showing how to utilize Keras/Tensorflow 2.0 to predict a value utilizing a small dataset. More explanations to follow in the Jupyter notebook below... Keras / Tensorflow Basics - A Simple ExampleThe dataset utilized here is fake, for the sake of example use only. It contains a price and two "features". We're assuming the dataset is a price listing of gemstones, and based on the features we can predict what the price of a new gemstone added to the list may be. The data can be found here. In [1]: #Imports import pandas as pd import numpy as np import seaborn as sns Data In [2]: df = pd.read_csv('Keras/fake_reg.csv') In [3]: df.head() Out[3]: price feature1 feature2 0 461.527929 999.787558 999.766096 1 548.130011 998.861615 1001.042403 2 410.297162 1000.070267 998.844015 3 540.382220 999.952251 1000.440940 4 546.024553 1000.446011 1000.338531 In [4]: sns.pairplot(df) Out[4]: <seaborn.axisgrid.PairGrid at 0x18188b92e48> This is a very simply dataset, but the pairplot can show us how the two features may correlate to pricing. Training the Model In [5]: from sklearn.model_selection import train_test_split In [6]: #We need .values because it's best to pass in numpy arrays due to how tensorflow works X = df[['feature1', 'feature2']].values y = df['price'].values In [7]: #Split into test/train X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) In [8]: #Scale data to be...
Read More

Kaggle Submission: Titanic

I've already briefly done some work in the dataset in my tutorial for Logistic Regression - but never in entirety. I decided to re-evaluate utilizing Random Forest and submit to Kaggle. In this dataset, we're utilizing a testing/training dataset of passengers on the Titanic in which we need to predict if passengers survived or not (1 or 0). Titanic Dataset - Kaggle SubmissionYou can view my Kaggle submissions here. Initial Imports In [1]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Data In [2]: #Import training data df_test = pd.read_csv('test.csv') df_train = pd.read_csv('train.csv') In [3]: df_test.head() Out[3]: PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S In [4]: df_train.head() Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S Cleanse Missing Data In [5]: #We have some missing data here, so we're going to map out where NaN exists #We might be able to take care of age, but cabin is probably too bad to save sns.heatmap(df_train.isnull(),yticklabels=False,cbar=False,cmap='viridis') Out[5]: <matplotlib.axes._subplots.AxesSubplot at...
Read More

Grand Totals and Subtotals in SQL Server

Just one of those things you may not know existed - the ROLLUP clause. It's very useful for totaling. Are you used to doing something like this? SELECT customer, sum(sales) from table UNION SELECT 'Total', sum(sales) from table The problem here is that on larger datasets, this can really be a pain in terms of performance. ROLLUP can be extremely quick, but the syntax may not be straight forward. Here's an example of summing sales in AdventureWorks2017 with a Grand Total. USE AdventureWorks2017 SELECT SalesOrderID, sum(LineTotal) FROM Sales.SalesOrderDetail where SalesOrderID in ( '43659', '43660', '43661', '43662', '43663' ) GROUP BY SalesOrderID WITH ROLLUP Notice I've added a WITH ROLLUP at the end of my GROUP BY. This creates the grand total. However, I'm left with a NULL for the SalesOrderID. To solve this I wouldn't want to use a coalesce, because a NULL might be valid elsewhere. Instead we should use the GROUPING function. USE AdventureWorks2017 SELECT CASE WHEN GROUPING(SalesOrderID) = 1 THEN 'Total' ELSE cast(SalesOrderID as nvarchar(10)) END as SalesOrderID, sum(LineTotal) FROM Sales.SalesOrderDetail where SalesOrderID in ( '43659', '43660', '43661', '43662', '43663' ) GROUP BY SalesOrderID WITH ROLLUP You can extend this even further...
Read More

Natural Language Processing in Python – Simple Example

NLP is a huge deal recently, and quite "easy" to do on a basic level within Python. In this exercise I completed, we'll show how to classify yelp reviews both with and without text pre-processing. Interesting to note that the pre-processing actually didn't help us here. Natural Language Processing Project - Simple ExampleIn this NLP project we will be attempting to classify Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. We will use the Yelp Review Data Set from Kaggle. Each observation in this dataset is a review of a particular business by a particular user. The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review. The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool"...
Read More

Search all tables and views for column name in SQL Server

A nifty little bit of code I've used for years and years, very useful if you don't have a third party add-on installed... USE AdventureWorks2017 SELECT * FROM ( SELECT t.name AS name, 'table' as type, SCHEMA_NAME(schema_id) AS schema_name, c.name AS column_name FROM sys.tables AS t INNER JOIN sys.columns c ON t.OBJECT_ID = c.OBJECT_ID WHERE c.name LIKE '%Employee%' UNION SELECT v.name AS name, 'view' as type, SCHEMA_NAME(schema_id) AS schema_name, c.name AS column_name FROM sys.views AS v INNER JOIN sys.columns c ON v.OBJECT_ID = c.OBJECT_ID WHERE c.name LIKE '%Employee%' ) a ORDER BY schema_name, name ...
Read More

Error Catching – Stored Procedure inside of SSIS

One fun thing I found out about SSIS is that is doesn't always error out a package if you execute stored procedures as SQL tasks instead of utilizing the SQL itself inside of the package. In other words - it may produce an error if you were to execute it in SSMS, but if you execute within SSIS it'll fly by successfully without you knowing anything went wrong. The best little trick I've found is to just surround the entire procedure with a Try/Catch block. This seems to raise the error up to SSIS and fails the package for you. CREATE PROCEDURE usp_TestProc as BEGIN TRY --SQL goes here... END TRY BEGIN CATCH PRINT 'Error ' + CONVERT(varchar(50), ERROR_NUMBER()) + ' at line ' + CONVERT(varchar(50), ERROR_LINE()) + ': ' + ERROR_MESSAGE(); END CATCH; GO ...
Read More

Including Jupyter Notebooks on WordPress – Part II

An update to a post I wrote in July of 2020 about embedding Jupyter Notebooks on WordPress. The original post is located here. I continued along utilizing the nbconvert shortcode for a while and encountered a huge amount of problems with https://nbviewer.jupyter.org/. Essentially - I believe there is some sort of bug with the system. Normally it takes 1-2 days for nbviewer to "recognize" something that is dropped into github - which isn't a huge deal but a big annoyance. In addition - I've had multiple workbooks go 1-2 weeks without being recognized, and a few that were fine and then dropped into a 404 error on the site. Either way - I'm looking at using a great little utility called "nb2wp" and located on github here. All it really does is perform the same conversion, but drops it locally to a .html file with CSS inline, along with saving all of the images (and embeds the links to those images as...
Read More

Recommender System in Python – Simple Example

It's actually very easy to build a simple recommendation system in Python. I'll show you how to do it utilizing a movie dataset with various user ratings. In this case we're just comparing two movies against all others to recommend what a user might like if they were to enjoy Star Wars or Liar Liar. Recommender System - A Simple ExampleThis example utilizes a dataset of movie ratings by user. We'll attempt to recommend other movies a user might like based upon the first. We have 'u.data' and 'Movie_ID_Titles' files to read in. u.data is located here. Movie_ID_Titles is located here. In [ ]: #Initial Imports import numpy as np import pandas as pd import seaborn as sns %matplotlib inline Data Imports In [81]: #Set column names column_names = ['user_id','item_id','rating','timestamp'] #Read in data df = pd.read_csv('u.data',sep='\t',names=column_names) In [82]: df.head() Out[82]: user_id item_id rating timestamp 0 0 50 5 881250949 1 0 172 5 881250949 2 0 133 1 881250949 3 196 242 3 881250949 4 186 302 3 891717742 In [83]: #Read in data movie_titles = pd.read_csv('Movie_Id_Titles') In [84]: movie_titles.head() Out[84]: item_id title 0 1 Toy Story (1995) 1 2 GoldenEye (1995) 2 3 Four Rooms (1995) 3 4 Get Shorty (1995) 4 5 Copycat (1995) In [85]: #Merge these two datasets on the item id df = pd.merge(df,movie_titles,on='item_id') In [86]: df.head() Out[86]: user_id item_id rating timestamp title 0 0 50 5 881250949 Star Wars (1977) 1 290 50 5 880473582 Star Wars (1977) 2 79 50 4 891271545 Star Wars (1977) 3 2 50 5 888552084 Star Wars (1977) 4 8 50 5 879362124 Star Wars (1977) Data ManipulationFirst we want to create a ratings dataframe to hold the...
Read More

Principal Component Analysis in Python – Simple Example

The greatest variance is shown on an orthogonal line perpendicular to the axis. Likewise, the second greatest variation on the second axis, and so on. This allows us to reduce the number of variables used in an analysis. Taking this a step further - we can expand to higher level of dimensions - shown as "components". If we utilize a dataset with a large number of variables, this helps us reduce the amount of variation to a small number of components - but these can be tough to interpret. A much more detailed walk-through on the theory can be found here. I'm going to show how this analysis can be done utilizing Scikit learn in Python. The dataset were going to be utilizing can be loaded directly within sklearn as shown below. Principal Component Analysis - Simple ExampleWe're going to be utilizing a cancer dataset that is found within sklearn. We're going to try to find what components are most important (show the most...
Read More

Refresh all views in SQL Server database

One problem I've had recently is a SQL Server instance in which the views were constantly cached and needed to be refreshed. Normally to do this you would run the build in procedure: exec sp_refreshview MyView However - I had the need to run this on all my views daily, and we constantly had new views. You can use the stored proc below to simply loop all the views and run as needed. create procedure usp_refreshview as declare @sqlcmd nvarchar(max) = '' select @sqlcmd = @sqlcmd + 'exec sp_refreshview ' + '''' + name + '''; ' from sys.objects as so where so.type = 'V' print @sqlcmd if len(@sqlcmd) > 0 exec(@sqlcmd) To run, simply exec: exec usp_refreshview ...
Read More