Course-Content-Vectorizer

Dataview:

list from [[]] and !outgoing([[]])

This project is done as I envisioned it.

I have my current iteration hosted here: diegotyner/CanvasResourceSemanticSearch
However, I want to do better with processing, and make it more general so that I can throw in research papers and the like.
To that end, I’ll be continuing on here: Text-Extractor-Database

🧲 Published

GitHub:

diegotyner/CanvasResourceSemanticSearch
On the github, I explain the technical implementation and technologies used more in depth through the readme.
I use:
Selenium
Requests
Postgres
Transformers
Check the github for a more rigorous summary of how I went about it, or reach out to chat!

I might pick it up to do something fun, like a UMAP on course transcripts

🧾 Project Description

Blurt

This project will center around RAG and word vectorization. I’m very interested about working it in to analyze large chunks of text, especially to get insight from them. This could be expanded to a number of domains, like potentially the symposium proceedings? We’ll see!

On top of that, I routinely have to catch up on a large batch of content! (skipping class for 2 weeks). It would be great to have hints to know where to start my studying, and have hints for which lectures are most informative / content rich.

Brainstorming Deepseek Chat - Link

Its officially on the way!

The scraping is live on the github, and the first attempt at semantic search is done now!

🎯 Objective

📂 Project Logs

Scraping

CanvasScraper.ipynb - Colab


508262252 - 1_83t7iz4h - PID 1770401.txt
506997062 - 1_mbz6ul4h - PID 1770401.txt
506997062 - 1_mbz6ul4h - PID 1770401.txt
506273622 - 1_tkz9ulng - PID 1770401.txt
how to tell if a page needs Javascript to load? Fix the no endpoint bug

do I have to learn selenium 😢
- The answer was sort of. There was the easier to approach of directly hitting canvas api, but the lecture transcript did need javascript to activate button.

Automated pushing to Google drive. Lectures should be hosted there, not on vps

postgres=# CREATE TABLE lectures (
    lecture_id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    class TEXT,  -- URL/filepath
    created_at TIMESTAMP DEFAULT NOW(),
    metadata JSONB  -- author, date, tags, etc.
);
postgres=# CREATE TABLE chunks (
    chunk_id SERIAL PRIMARY KEY,
    lecture_id INT NOT NULL REFERENCES lectures(lecture_id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    position INT,  -- Original order in lecture
    metadata JSONB,  -- page numbers, timestamps, etc.
    created_at TIMESTAMP DEFAULT NOW()
);

https://www.reddit.com/r/LangChain/comments/1g1cm9n/generating_embeddings_for_a_large_document_10/
https://www.youtube.com/watch?v=Hj7PuK1bMZU

Vault

Explorer

Course-Content-Vectorizer

Dataview:

🧲 Published

GitHub:

🧾 Project Description

Blurt

🎯 Objective

📂 Project Logs

Scraping

🎟 Features

Existing

Todo

🔗 -> Links

Resources

Connections

Graph View

Table of Contents

Backlinks