Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genesis #19

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .author

This file was deleted.

108 changes: 45 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,80 +1,62 @@
# Media Description Generator
-When AI turns journalist
## Purpose
# AI_Hackathon

* Making information more accessible to people with disability.
*Use case:* Providing descriptive facts from ambient sounds to the hearing-impaired.

* Providing unbaised reporting while it gets more relevant in the trend of [microtargeting](https://en.wikipedia.org/wiki/Microtargeting).
*Use case:* A standard repository of facts which form the basis of all news outlets.

* Richer subtitle generation for non-speech sounds.
*Use case:* Improved subtitles on streaming services.

* Obataining relevant facts from a large swaths of data which is otherwise resource intensive for humans.
*Use case:* Detecting unusual activity in lengthy surveillance media to strengthen security.

* Archival of hefty media in an extremely compact format.(See [WaybackMachine](https://archive.org/web/))
*Use case:* Keeping a record of current events serving as a time-capsule of humanity.

## Approach
<img src="https://github.com/tejasvi/AI_Hackathon/raw/master/overview.svg?sanitize=true">
@@ So submit a pull request within the stipulated time along with a readme file that states the method/techniques used

* The audio is first converted to WAV format which is the uncompressed form.
2. Then find a relevant dataset(It may be small but that is where transfer learning becomes important)
3. Use a suitable model and apply the learnings

* The audio is then converted into a spectrogram which is the visual representation of sound frequencies.
## Marking_Scheme
1. Novelty 40%
2. Dataset_used 20%
3. DL techniques 20%
4. Code Quality 20%

* The spectrograms can then be treated as images where we can benefit from transfer learning using pretrained models (like Resnet).

* After being fed separatly to classification and speech recognition network, appropriate keyword description is generated.

* The obtained keywords can further be used to generate readable sentences using [NLG](https://en.wikipedia.org/wiki/Natural-language_generation). However it proved to be much tedious to prototype.
# Team_Name
# Team_Name
## Members

<img src="https://github.com/tejasvi/AI_Hackathon/raw/master/planned.svg?sanitize=true">

## XYZ
- [a](https://github.com/a)
- [b](https://github.com/b)
- [c](https://github.com/c)

* The problem scope can be extended to object level reseasoning in images. This will provide more context to the description by using videos along with audio.

## Aztecs
- [Shridhar Hegde](https://github.com/shridharrhegde)
- [Sayan Sarkar](https://github.com/alpha99991)

## Implementation

## Team Schwifty
- [Abhilash Reddy](https://github.com/abhilashreddys)
- [Kousik Rajesh](https://github.com/kousikr26)
- [Rashi](https://github.com/kousikr26)

### Data and preprocessing
## Genesis
- [Sivaramakrishnan SK](https://github.com/sk124)
- [Tejasvi S Tomar](https://github.com/tejasvi)

The dataset used in this project is [Freesound Audio Tagging](https://arxiv.org/pdf/1906.02975) which contains snippets of audio with sound type labels.
## SMS
- sai krishna(https://github.com/themendu)
- Manoj(https://github.com/manojpaidimarri21)
- Sravya(https://github.com/sravya27082001)

The sounds are further divided into ~11 hr *curated* set and ~80 hr *noisy* set. We used only curated set for training due to resource constraints and for stronger transfer learning demonstration.
## GEEKZONE
- [Bishal Kumar Shaw](https://github.com/bishal1212)
- [Vishal Goyal](https://github.com/b)
- [Pratyush Roy](https://github.com/c)

For spectrogram generation, widely used librosa library is used. Each audio of length *n* seconds is converted into 128x128n grayscale image.
## Intelliqo
- [Eshwar Nukala](https://github.com/eshwar28)
- [Neeraja J](https://github.com/jayne228)
- [Shrey Jani](https://github.com/jani-boop)

## Matrix Agents
- [Aadi](https://github.com/aadig15)
- [Animesh](https://github.com/animeshrdso)
- [Dibyakanti](https://github.com/Dibyakanti)

There are 80 possible labels inlcuding *Gasp,Printer, Gong, Bark, Male singing,etc*. The audio falling in human voice category are fed to speech recognition model.

### Architecture

For classification, ResNet18 architecture pre-trained for ImageNet is used. The evaluation metric used is [LwLRAP](https://www.kaggle.com/pkmahan/understanding-lwlrap) which is believed to be most effective and widely used with spectrograms.

Speech recognition uses [DeepSpeech](https://github.com/mozilla/DeepSpeech) architecture published by Mozilla.

### Transfer learning

The parameters of ResNet18 are initialized for ImageNet dataset instead of being random. The random initialization demonstrably converged slower than intializion to ImageNet weights.
#### Random Initialization
<img src="https://github.com/tejasvi/AI_Hackathon/raw/master/random_init.JPG">

#### ImageNet Initialization
<img src="https://github.com/tejasvi/AI_Hackathon/raw/master/img_init.JPG">

However freezing inner layers not gave good results. Instead we had to unfreeze all layers to get fast convergence. It may be because the images of ImageNet have much different features therefore inner layers also require training.


## Execution

(See [`main.ipynb`](https://github.com/tejasvi/AI_Hackathon/blob/master/main.ipynb)). The trained model was exported after training and is loaded during inference. As a prototype, currently audio files are transcribed in batches. However, it is possible to do classification and speech recognition parallely for real-time processing. For simplicity it currently runs in sequence.

## Resources

* Implementation of LwLARP metric is taken from [Dan Ellis](https://colab.research.google.com/drive/1AgPdhSp7ttY18O3fEoHOQKlt_3HJDLi8).
* For data profiling, [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling) is used to obtain general overview of data.
* Use of [FastAI](http://fast.ai) library to structure and train the model.
* Other resources include StackOverflow, Medium and the rest.
## Kuch_bhi
- [Abhishek Kumar](https://github.com/abhishek18f)
- [Rahul D](https://github.com/chindimaga)
Loading