Skip to content

¡hola! 👋🏼

my name is Sergio Sánchez Zavala but i go by chekos (/che/ like Guevara, /kos/ like 'costly', in lowercase) and i work as a data engineer. i'm a hip hop head policy wonk data nerd.

i always wanted to write but i never had the time or energy or focus to do it. that's changed so here are some words.

how to set up ffmpeg as a lambda layer

what i learned

how to add ffmpeg and ffprobe as a lambda layer to be used by lambda functions.

Getting ffmpeg

## ffmpeg
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz

## checksum
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz.md5

md5sum -c ffmpeg-release-amd64-static.tar.xz.md5

## extract
tar xvf ffmpeg-release-amd64-static.tar.xz

Side note: i had to brew install md5sha1sum and brew install wget on my local laptop

Creating Lambda Layer

  1. create ffmpeg/bin/
  2. copy ffmpeg into it
  3. zip ffmpeg/
## Create bin/
mkdir -p ffmpeg/bin

## Copy ffmpeg
cp ffmpeg-6.0-amd64-static/ffmpeg ffmpeg/bin

## Zip directory
cd ffmpeg
zip -r ../ffmpeg.zip .

Finally

Upload zip file as a lambda layer.

Bonus

In my case I also included ffprobe as it's also required for whisper.

how to create an alias in the gh CLI

what i learned

you can create aliases in the GitHub CLI. i'm not super familiar with aliases. i've used them in the past to automate long commands. currently i'm using a couple at work to shorten dbt commmands ever so slightly (from dbt run --target prod --select <models> to prod-run <selection query>).

however, i had only seen these as aliases one sets up at the profile level/scope. as in, we'd go to ~/.bash_profile or ~/.zsh_profile and add a new alias that's set every time we open a new terminal.

this is the first time i see a cli offer that within the tool itself. i wonder if this is a common practice i've missed until now.

in the GitHub cli you can use the command alias set to set an alias (docs).

i usually have to google the full list of flags i would like to run when creating a repo via the gh-cli so i figured i'd save it as an alias now. this is why i ~~wish i remembered~~ would like to run most times:

gh repo create <name> \
--public \
--add-readme \
--clone \
--gitignore Python \
--license bsd-3-clause-clear

simply create a public repo named include a ReadME, a license and a gitignore file and finally clone it to the local directory.

i might add the --disable-wiki simply because i don't use the wikis.

from the docs:

The expansion may specify additional arguments and flags. If the expansion includes positional placeholders such as "$1", extra arguments that follow the alias will be inserted appropriately. Otherwise, extra arguments will be appended to the expanded command.

so what i did was run

gh alias set pyrepo 'repo create "$1" --public --add-readme --clone --gitignore=Python --license=bsd-3-clause-clear'

and if i choose to i can add a description by adding -d "my repos description" right after gh pyrepo <name>

how to use gh-actions to produce example images of code

what i learned

I learned to chain a lot of small tools using GitHub Actions to produce ready-to-share images of code examples for social media (namely, instagram and twitter) from my phone. The steps, generally speaking, go as follows:

  1. Create a new page on a Notion Database. Probably will create a specific template for this, like I do with TIL’s but it’s not necessary.
  2. GitHub Action: Use my markdownify-notion python package to write the markdown version of this page and save it on a “quarto project” folder. This let’s me use one general front-matter yaml file for all files rather than automate adding front matter to each file. I can still add specific front matter to files if I want to. (this TIL is an example of how this works - I’m writing it on Notion on my phone.)
  3. GitHub Action: Use Quarto to render this markdown file --to html and save it on an “output” directory. This will execute the code in the code cells and save the output inline.
  4. GitHub Action: Use shot-scraper to produce two files: a png screenshot and a pdf file. I’m using shot-scraper for the PDF as well rather than using quarto because it’s easier and I am not in need of customizing this pdf file at all just yet. I’m creating it and saving it essentially just because I can, it’s easy, and might find use for it later.
  5. GitHub Action: Once there are new png or pdf files in the “output” directory, I then use s3-credentials to put those objects on a S3 bucket I also created using s3-credentials . This tool is fantastic s3-credentials.readthedocs.io

This is how the final image looks like

9EB00936-09DE-4836-93B6-8504E7E036A8

how to solve permission error from airflow official docker image

what i learned

tl;dr: when you use the Airflow official docker image you need to make sure that the variable AIRFLOW_UID is set to match your UID (and AIRFLOW_GID=0 aka root ) or you’re going to get permission errors. i was working on deploying Airflow on a VM at work this week and I got a permission error (Errno 13) regarding the containers’ python’s logging config. When I first started working with this docker-compose.yml i used the suggested echo -e "AIRFLOW_UID=$(id -u)" > .env command which provided my user id (let’s say it’s 506 ) from my local machine and assigned it to the AIRFLOW_UID key. Now that i am working in the VM and have extended my .env file to include other information i figured i could just use a copy of the same file. Everything else works fine except airflow cannot write logs because the user in this virtual machine with user id 506 does not have permission to write to this ./logs/ directory. If you google this error i found — among a sea of almost right answers — that most of the solutions online are variations of “change the logs folder’s permissions to 777” meaning anyone can read, write, and execute the contents of the logs. That works. However, you don’t really need everyone to be able to read and write — just this airflow user. Updating the UID on the VM’s .env file worked perfectly without having to mess with the permissions.

about jq [ ] syntax

what i learned

If you want to dump a list of objects you’re constructing from some other json you need to wrap your entire jq string in square brackets ( [] ). Otherwise you’ll be writing each object one at a time and that’s not valid JSON. For example, running something like

jq '.[] | {id: .id, title: .title, created: .created }'

returns →

{
    id: "123",
    title: "page 1",
    created: "2022-01-25T23:15:00.000Z"
}
{
    id: "124",
    title: "page 2",
    created: "2022-01-26T13:18:15.000Z"
}
{
    id: "125",
    title: "page 3",
    created: "2022-01-27T18:37:05.000Z"
}

This file is not valid JSON. However, if you wrap your entire expression in square brackets [] jq will group these all as a list of objects instead of appending each object at a time.

jq '[.[] | { id: .id, title: .title, created: .created }]'

returns →

[
  {
    "id": "123",
    "title": "page 1",
    "created": "2022-01-25T23:15:00.000Z"
  },
  {
    "id": "124",
    "title": "page 2",
    "created": "2022-01-26T13:18:15.000Z"
  },
  {
    "id": "125",
    "title": "page 3",
    "created": "2022-01-27T18:37:05.000Z"
  }
]

how to execute a shell script in the current shell

what i learned

when you execute a shell script, it defaults to creating a new shell, executing the script in that shell and closing it. if you want to, for example, set environmental variables you would need to run the script in the current shell. let's say you want to have a short shell script that sets the database url as an environmental variable called env_vars.sh.

##!/bin/bash
export DATABASE_URL="super_secret_url"

if you run

sh env_vars.sh

in your terminal, it would run said script in a new shell and therefore those environmental variables would not be set in your current shell and would then be unavailable to your other scripts.

to run that in your current shell you use the following syntax

. ./env_vars.sh

this way your environmental variables are set in your current shell and you can use them as expected.

Haciendo datos abiertos más accesibles con datasette

California recientemente liberó datos sobre las detenciones hechas por oficiales de las 8 agencias más grandes del estado. Estos datos cubren los meses de julio a diciembre del 2018. Esta fue la primera ola de divulgación de datos que entrará en vigencia en los años siguientes. Los datos cubrieron más de 1.8 millones de paradas en todo el estado. Si bien este es un paso en la dirección correcta, un solo archivo .csv de alrededor de 640 megabytes con más de 1.8 millones de filas y más de 140 columnas podría ser intimidante para algunas personas que se beneficiarían de la exploración de estos datos: líderes locales, periodistas, activistas y organizadores, por nombrar algunos.