Dataview:

list from [[]] and !outgoing([[]])

This project is done as I envisioned it.

I have my current iteration hosted here: diegotyner/CanvasResourceSemanticSearch
However, I want to do better with processing, and make it more general so that I can throw in research papers and the like.
To that end, Iโ€™ll be continuing on here: Text-Extractor-Database

๐Ÿงฒ Published

GitHub:

  • diegotyner/CanvasResourceSemanticSearch
    On the github, I explain the technical implementation and technologies used more in depth through the readme.
    I use:
  • Selenium
  • Requests
  • Postgres
  • Transformers
    Check the github for a more rigorous summary of how I went about it, or reach out to chat!

I might pick it up to do something fun, like a UMAP on course transcripts

๐Ÿงพ Project Description

Blurt

This project will center around RAG and word vectorization. Iโ€™m very interested about working it in to analyze large chunks of text, especially to get insight from them. This could be expanded to a number of domains, like potentially the symposium proceedings? Weโ€™ll see!

On top of that, I routinely have to catch up on a large batch of content! (skipping class for 2 weeks). It would be great to have hints to know where to start my studying, and have hints for which lectures are most informative / content rich.

Brainstorming Deepseek Chat - Link

Its officially on the way!

  • The scraping is live on the github, and the first attempt at semantic search is done now!

๐ŸŽฏ Objective

๐Ÿ“‚ Project Logs

Scraping

CanvasScraper.ipynb - Colab

508262252 - 1_83t7iz4h - PID 1770401.txt
506997062 - 1_mbz6ul4h - PID 1770401.txt
506997062 - 1_mbz6ul4h - PID 1770401.txt
506273622 - 1_tkz9ulng - PID 1770401.txt
how to tell if a page needs Javascript to load? Fix the no endpoint bug
  • do I have to learn selenium ๐Ÿ˜ข
    • The answer was sort of. There was the easier to approach of directly hitting canvas api, but the lecture transcript did need javascript to activate button.

Automated pushing to Google drive. Lectures should be hosted there, not on vps

postgres=# CREATE TABLE lectures (
    lecture_id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    class TEXT,  -- URL/filepath
    created_at TIMESTAMP DEFAULT NOW(),
    metadata JSONB  -- author, date, tags, etc.
);
postgres=# CREATE TABLE chunks (
    chunk_id SERIAL PRIMARY KEY,
    lecture_id INT NOT NULL REFERENCES lectures(lecture_id) ON DELETE CASCADE,
    content TEXT NOT NULL,
    embedding VECTOR(384),  -- Dimension matches MiniLM-L6-v2
    position INT,  -- Original order in lecture
    metadata JSONB,  -- page numbers, timestamps, etc.
    created_at TIMESTAMP DEFAULT NOW()
);

https://www.reddit.com/r/LangChain/comments/1g1cm9n/generating_embeddings_for_a_large_document_10/
https://www.youtube.com/watch?v=Hj7PuK1bMZU

๐ŸŽŸ Features

Existing

Todo

Resources

  • Put useful links here

Connections

  • Link all related words