GitHub - tannishk/data-profiling: Basic task to perform data profiling

dataset -- the Pandas data frame colname -- The Column name In Split.py : correlation(x,y): Here x,y are 2 data series and used to find correlation between them . x,y should of float or Integer type and returns the pearson correlation cofficient and tell what type of correlation it is

uniqueoccur(dataset,name): "Frequency of each unique item in a dataset[name]"

is_date(string): "To check whether it is date datatype or not "

rollingmean(dataset,name,win): To calculate the rolling mean for a time series , win:Size of the window

rollingstd(dataset,name,win): To calcualate the rolling Standard devaiation for a time sereis , win:Size of the window

RangeCol(dataset,colname) To give the range of the a column or the range of a given time stamp

type1(dataset,name): "TO give the data type a column whether it is float , integer , boolean ,String , Date datatype"

Cal(dataset,colname) Used to display Sum , Average , mean , Median , Standard Deviation , Total Number of Values and Total Number of unique values

Split(dataset,colname) Used to split the data between training and testing data and create test.csv and train.csv

desc(df,pos): A similar to describe in describe with type and plot functionality

In replace.py :

rep(dataframe,col,choice,value): Tells about all the rows with which has an empty value for a particular column with an option to add / alter the values indicating empty values eg . NULL , NaN , , unknown etc and also allows a person to replace a given value" choice is used to tell your choice whether to replace the value or not , value to enter a value you want to replace it with

In featureselection.py Kbest(dataset,features,class1,arr): It is used to select arr number of best features based on chi test compared between features and class1 which indicate the class to be predicted . Here df is the dataframe used PCA(dataset,numberofcomponenets) :It is used to select the Principle component analysis for the given dataset with these features transforming it into data with number of componenets varthres(dataset,threshold): "To do feature selction based on varience by passing the threshold value " Ica(dataset): "To do feature transformation such as all features are independent from one another "

In Distribution.py : distri(dataset,name): It is used to tell whether name of dataset is continous or a discrete distribution outliers1(dataset,name): IT is used to calculate univariate dataset based on Z test in a name column of the dataset hisplot1(dataset,name): IT is used to create a histagram for a name column lineplot(dataset,name): "Used to do autoscaling and plot line for group of data passed as a set of columns in a list " threed(dataset,name): "Pass only 3 data columns as a list in name to draw a 3d plot" norm(dataset,name) : Normal test for normal distribution and throws normal test and p value as a result welisberg(dataset,name): "Weibull continous distribution and throws KS Test Statistic either D,D+,D- test and p value as a result" exponential(dataset,name): "Exponential continous distribution and throws KS Test Statistic either D,D+,D- test and p value as a result " logistic(dataset,name): "Logistic continous distribution and and throws KS Test Statistic either D,D+,D- test and p value as a result " typedis(dataset,name,dis): "Type any type of ditribution . Dis is used to take in the type of code distribution visit refer http://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html#module-scipy.stats for more reference and throws KS Test Statistic either D,D+,D- test and p value as a result "

In Muloutlier.py: transform(dataset,col): Map all text dataset[col] in to numeric data and returns an altered dataframe mulout(df): Performs the preprocessing and uses oneclass SVM to find outliers in multi variate envirnment timeoutlier(dataset, col, window): Outlier for time series throws a list with True is an outlier and false if not an outlier

In similarity.py:

createdict(a,b): It is used to convert a , b to numeric vectors

Euclids(x,y): to find euclid similarity Manhattan (x,y): To find manhattan similarity Cos(x,y): to find cosine similarity MahalanobisDist(x, y): to find MahalanobisDist

converttfidf(x,dicte): To convert a x to tfidf vector

compare(a,b): "Find Cosine similarity between 2 columns of a and b dataframe Code any be altered to include other similarity "

Main function to find similarity among columns and tell which columns can be used to combine 2 tables

In skewkurt.py

skewness(dataset,col): "To Calculate the skewness of a given data For normal distribute data skewness = 0 , Skewness > 0 more weight the left tail and less weight in right tail "

kurtosis(dataset,col,ty): "To calculate Kurtosis of a data set ty can be fisher or pearson"

In timeseries.py: decompose(df,col,freq): "To plot the decomposition graphs " freq(df,col,max1): "To find the required freq for the decompostion " lmtestcheck(df,col,max1): "To perform and LM test for autocorrelation and find significant lags . 1 to determine a significant lag and 0 to determine insignificant lag " checkdb(df,col): " It tells whether the Data is serially correlated or not " check(df,col): "To check whether a given series is Periodic or not using AutoCorrelation function :"

In Primarykey.py: primarykey(dataset): "It is used to find a primary key in a given dataset "

In Preparation.py: timestamp(dataset,col): "Find the range of time stamps of any given columns "

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
Basic Time Series Analysis.ipynb		Basic Time Series Analysis.ipynb
Distrubtions and Plot Library.ipynb		Distrubtions and Plot Library.ipynb
Feature Selection.ipynb		Feature Selection.ipynb
Outliers.ipynb		Outliers.ipynb
PrimaryKey.ipynb		PrimaryKey.ipynb
README.md		README.md
Similarity.ipynb		Similarity.ipynb
Test.ipynb		Test.ipynb
Timeseries.ipynb		Timeseries.ipynb
Untitled.ipynb		Untitled.ipynb
Untitled.rtf		Untitled.rtf
Untitled1.ipynb		Untitled1.ipynb
__init__.py		__init__.py
as		as
auto-mpg.csv		auto-mpg.csv
ballon.csv		ballon.csv
bank.csv		bank.csv
cars.csv		cars.csv
combine.py		combine.py
combine.pyc		combine.pyc
distribution.py		distribution.py
distribution.pyc		distribution.pyc
featureselection.py		featureselection.py
featureselection.pyc		featureselection.pyc
iris.csv		iris.csv
muloutlier.py		muloutlier.py
muloutlier.pyc		muloutlier.pyc
preparation.py		preparation.py
preparation.pyc		preparation.pyc
primarykey.py		primarykey.py
primarykey.pyc		primarykey.pyc
rec		rec
replace.py		replace.py
replace.pyc		replace.pyc
sample data.csv		sample data.csv
similairty.pyc		similairty.pyc
similarity.py		similarity.py
similarity.pyc		similarity.pyc
skewkurt.ipynb		skewkurt.ipynb
skewkurt.py		skewkurt.py
skewkurt.pyc		skewkurt.pyc
split.py		split.py
split.pyc		split.pyc
splitReplacepythonFile.ipynb		splitReplacepythonFile.ipynb
test.csv		test.csv
time series sample.csv		time series sample.csv
timeseries.py		timeseries.py
timeseries.pyc		timeseries.pyc
train.csv		train.csv
untitled		untitled
user accounts.csv		user accounts.csv
values.py		values.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

tannishk/data-profiling

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages