spiffytech

I'm available for part-time work!

ReadStuffLater uses emojis to tag content . It's simple, it's fun, and it affords basic content organization without encouraging users to spiral into reinvent-Dewey-Decimal territory.

screenshot of emoji tags
Yeah, the aesthetics need work

There's just one problem: data validation. When the client tells my server to tag a record, how can the server confirm the tag is actually an emoji? I mean, I shouldn't accept and store just anything in that field, right?

This is a much gnarlier problem than it has any right to be. If you want the TL;DR, see what I did and what I wish I'd done, and the a more technical solution!

Failed idea #1: Use Regex character classes

My first thought was to google around for this, and everyone recommends regex! Everyone! Well that seemed easy.

There is a recent(?) extension to regex that lets you specifically ask, “is this an emoji?”

Except it's wrong. And also not available everywhere.

const regex = /^\p{Emoji}$/gu;
console.log("🙂".match(regex))
console.log("*️⃣".match(regex));
console.log("👨🏾".match(regex));

> Array ["🙂"]
> null
> null

I mean, it produces kinda-okay results if you ask “does this string contain any number of emojis”. But it fails hard when you ask “Is this string made of exactly one emoji, and nothing else?”.

Also, it seems Postgres regex doesn't support these special character classes, so validation would be strictly at the application layer.

EDIT: Someone showed how to patch the holes in this approach and make it work. Check it out below!

Why does the regex give the wrong answer?

I'm glad you asked! It turns out there isn't really such a thing as “an emoji”. You have code points, and code point modifiers, and code point combinations.

A great primer on this is Bear Plus Snowflake Equals Polar Bear.

Here's the dealio: Let's say we want to display the emoji for a brown man, “👨🏾”. There isn't a code point for that. Instead we use “👨 ZWJ 🏿”.

ZWJ is “zero-width joiner”. It's a Unicode byte that gets used in I guess the Indian Devanagari writing system? But it's also a fundamental building block for emojis.

Its job is “when a mommy code point loves a daddy code point very much, they come together and make a whole new glyph”.

Basically any emoji that includes at least 1 person who isn't a boring yellow person doing nothing is several characters stapled together with ZWJ. Some other things work this way too.

Some examples include: 👪 (man + woman + boy), 👩‍✈️ (woman + airplane), and ❤️‍🔥 (heart + fire).

(And flags are multiple code points that aren't connected by ZWJ! ††)

(If your computer doesn't have current or exhaustive emoji fonts (thanks, Linux!), you might see what's supposed to be a single glyph instead displayed as several emojis side by side, like how my computer shows “Women With Bunny Ears Partying, Type-1-2” as “ 👯 🏻 ♀️”.)

So our regex can't just check if a string is an emoji: many things we want to identify are several emojis stapled together.

(The way you want to think about your goal here is “graphemes” and “glyphs”, not “characters”.)

Fortunately, when I experimented, it looked like you have to join characters in a specific order, so when you add both skin tone and hair color (“👱🏿‍♂️”) you can count on it happening in exactly one canonical byte sequence. Otherwise, we'd have to dive into Unicode normalization (a good topic to understand anyway!).

Edit: Someone showed me how to make this work. Check it out below!

Failed idea #2: Use Regex character ranges

Alright, so we can't just use the special regex “match me some emoji” feature. What about a regex full of Unicode character ranges? StackOverflow sure loves those!

Well, they're all either too broad or too narrow.

You get stuff like “just capture anything that's a 'Unicode other symbol'” (/\p{So}+/gu). This fails for the same reasons as approach #1, and also for the bonus reason that this character class includes symbols that aren't emojis ('❤').

Ah, but some other StackOverflow answer says to just use a regex for Unicode code points! That also fails the same way as approach #1, plus, nobody includes exhaustive code point ranges in their SO answers.

Here's a partial list of valid emoji:

Two things to note:

1) There are quite a few code ranges that include emoji! Not just handful that all the StackOverflow answers include. If you want zero false positives, you need (eyeballing it) a hundred code point ranges.

2) See all those grey empty spaces? That's non-emoji characters that are in those same code point ranges. You probably don't want to accept “ª”, “«”, etc. as emoji.

So you're either including a bajillion micro-ranges, or a handful of very wide ranges that will give false positives, or you're rejecting valid emoji.

And once you pick some ranges, I have no idea whether they'll include the new emoji the standard adds each year.

So validating code point ranges is a terrible approach. It's just plain wrong: emojis aren't individual code points in the first place, and you'll get a huge portion of false positives and false negatives.

Oh, don't forget that JavaScript uses UTF-16, while everything else in the world uses UTF-8. If you're building a regex with Unicode code points, all your hardcoded numbers will be different.

Failed idea #3: just stuff all possible emojis into a regex

Alright, so what if I just get a list of EVERY POSSIBLE EMOJI, and build a regex out of them like /🙂|😢|😆/. It's exhaustive, it's accurate, and it'll match individual, whole glyphs.

Except... *️⃣ broke my regex, because it's not its own symbol: it's a regular asterisk followed followed by other stuff: “* + VS16 + COMBINING ENCLOSING KEYCAP”.

VS16 is the Unicode byte that says “Hey, this character can either look like text or like an emoji, please show it as emoji”.

Regex wasn't happy about that – all it saw was a random asterisk in my pattern and it threw a fit.

I mean, even the markdown engine for this blog post mistook that as “please make the rest of my post italic” until I put the emoji into a code block.

But maybe I was on the right track trying to exhaustively match all emoji?

What worked for me

What I finally came up with was exhaustively validating emoji shortcodes instead of emojis themselves. Shortcodes are those things you type into Github or Slack to summon the emoji popup – e.g., “:winking_face:“.

The great part about shortcodes is they're strictly simple characters. Off the cuff, I think it's all a-z and _. Unsure about numbers.

That makes them super convenient to store or pattern match on. Not so convenient for other reasons (see the next section).

So when a user picks out an emoji, I find its shortcode and store that in the database. When I display an emoji, I convert the other way.

To build my allowlist, I found an NPM package that holds the same data as the emoji picker I'm using. I wrote a script to extract all the shortcodes, generate all the appropriate variants, and turned that into a SQL list of values I could copy/paste.

I stuffed that into a database table and foreign key'd my records to it. (I previously used a CHECK constraint using IN, but that made schemas very noisy.)

I wrote the output of that file to disk and checked it into source control. Now every time I build the app I generate the data again and compare against the oracle, so if the package's list of valid emoji gets updated, I'll get a build failure until I update my allowlist.

Problem: solved ✅

What I wish I'd done instead

I should have done basically the same thing, except with the actual emoji. I used shortcodes because I got caught up in path dependence with the regex stuff. But if I'm already using a data structure of discrete strings, why not just use the emoji themselves?

There's a modest advantage in network / storage efficiency (why store lot byte when few byte do trick?), but the real advantage would be simplicity.

In the emoji dataset I have, an emoji like “:people_holding_hands:” 🧑‍🤝‍🧑 doesn't have different shortcodes for skin tone or hair color. It's just “:people_holding_hands:“. Checking Emojipedia, I get the uncertain impression that shortcodes might not be standardized, and I see some tools have different shortcodes for skin tones, while others don't.

I had to make up my own encoding for that, including noticing the emoji might have zero skin tones (yellow figures), or multiple skin tones (two figures of different races).

I also have to do a lookup every time I display an emoji. In an ideal world, I'd lazy-load the emoji picker JS so it only downloads when the user actually wants to select an emoji.

But because I have to convert shortcodes to emoji, I have to load the picker Database on any page where I want to display an emoji, so I can figure out what glyph matches my stored data people_holding_hands:3:5.

If I were to revisit my implementation, I'd just store and validate straight-up emoji.

A more technical solution

Over on Lobsters, user singpolyma pointed out how to test a string without needing an oracle.

You use your language's tools to detect if the string is a single grapheme, and then you check if it either passes the Emoji regex character class, or contains the Emoji variant selector code point.

Here's what you do:

const isEmoji = (e: string) => {
  const segmenter = new Intl.Segmenter();
  const regex = /\p{Emoji_Presentation}/u;
  const variantSelector = String.fromCodePoint(0xfe0f);

  return Boolean(
    Array.from(segmenter.segment(e)).length === 1 &&
      (e.match(regex) || e.includes(variantSelector))
  );
};

On my test data set, 229 out of 3,664 emoji fail the regex test by itself, such as ☺️, ☹️, ☠️, 👁️‍🗨️. But all of those contain the VS16 Emoji variant selector byte!

This means you use the grapheme count to tell “does this look like one glyph to the user?”, then follow up with “does this either show as an emoji by default, or get converted into one?”. All the safety, no oracles!

Well... mostly. It does mean any byte sequence containing VS16 will be accepted, which isn't the same thing as a valid emoji...

Intl.Segmenter is available everywhere except Firefox. And Postgres cannot count graphemes or use the Emoji regex character class, so you can only do application-level data validation. But you're free from managing an allowlist, so there's that.


Footnotes:

Your emoji picker includes country flags, but the Unicode Consortium doesn't want to take sides on whether Taiwan is a real country.

So they dodged the issue: if your text includes a 2-letter country abbreviation encoded as emoji letters, it might or might not display as a flag, depending on how your device feels.

So you're free to include “🇹 🇼” in your text, and if you just so happen to be in a country that doesn't find Taiwanese sovereignty objectionable, you'll see it displayed as 🇹🇼. Otherwise you'll just see 🇹 🇼.

EDIT: New info from Lobsters: it looks like my information is outdated! Or maybe wrong! I mean the part about how flags are rendered is correct, but the “they don't say which flags are valid” part might not be.

At some point the list of acceptable country flags got enumerated. That file dates back to 2015, and references a Consortium task seeking to clarify what “subtags” are valid. That Atlassian task is newer than the Github commits, so I guess its timestamp is a lie, leaving me unable to tell how early the enumeration took place.

However, “depictions of images for flags may be subject to constraints by the administration of that region.”

I would have learned this factoid sometime around the Unicode 6.0 release in 2010, so maybe they started enumerating country codes later, or maybe I just learned wrong in the first place.


Why emoji and not a normal tagging system with arbitrary text?

I want ReadStuffLater to be a very low-friction, low-cognitive-overhead experience. It's not a place to organize your second brain; it's just read-and-delete.

Simply making rich organization available can make people feel like they're supposed to use it. And once people think that's the kind of app this is, they'll start expecting features that are expressly out of scope.

Yet once a user saves hundreds of links, they need something besides one giant list. This is my attempt to split the difference. And for product positioning purposes, I want to signal “do not expect this to be the same as Instapaper”.

If a more second-brain-flavored reading list is what you need, I recommend Instapaper, Pocket, or Wallabag. They're a take on this problem with a stronger focus on long-term knowledge retention.

My app ReadStuffLater fundamentally revolves around scraping web pages with the Microlink API. Sometimes that goes wrong: the target web page has a problem. Or Microlink does. Or the target throws up a captcha or blocks data center IPs or something.

I thought I'd done an alright job of handling all the cases that could go wrong. API errors are retried, website errors retry or do something sensible based on common status codes.

Yet I still get rinks that intermittently fail to scrape for no apparent reason. There are a couple usual suspects where I thought I'd handled all the failure modes, but they keep going wrong. And random pages have problems sometimes, too.

My strategy has been “notice failed scrapes, rerun them while reading the logs, then fix”. This has problems:

  1. Detecting a legitimate failure in hard. Sometimes a scrape unrecoverable fails for reasons outside my control. False negatives are real.
  2. This doesn't let me diagnose transient failures, where everything is working again by the time I manually verify the problem.
  3. It's a pain since I don't have good tools to peek into the scrape process. I'm always setting up something ad-hoc like console.log in local dev.

I think the right call here is an audit log. Every scrape would get its raw result stored in the DB. Status codes, body, Microlink metadata. Everything.

I guess I'd need a way to look up the audit records for a given link. A CLI script on the prod box would probably be okay. Maybe reformat the data in a way that's convenient to munge with jq.

The interesting question is: at what point should I have recognized that I needed an audit log?

Should I be logging ALL external API calls? What internal operations should I log? Do I just wait until I have problems and then start logging? I'm not sure.

The logical extreme of this is “just adopt event sourcing”. And yes, that would solve this problem. But what other problems would it cause? Maybe I should just adopt it piecemeal, and not for the whole system? But then I'm paying the implementation complexity cost for minimal benefit.

Idunno. All I know is right now I sure need an audit log for this one piece of the system 🙂

I was recently on my way out the door when I knocked over a glass of water, spilling it across my Framework laptop. I panicked and tried to dab it up, but saw that water had seeped under the keyboard and was leaking out the bottom of the laptop. The screen began to flicker.

I held down the power button and began disassembling the laptop. In minutes I had the laptop in pieces and ran a hair dryer over it. Water had gotten into many nooks and crannies; every time I tilted the unit a new direction water ran out from somewhere I had missed.

I dried up all the water I could spot, and left the unit open to air out for about 24 hours.

The next morning I put it back together and it works fine!

I don't think I'd have been so lucky with other laptops I've owned. They've all been difficult to open, or purposely designed to keep users out. I'd have been at the mercy of however well they drain, with little assurance of when (or if) it was safe to power them on again.

My Framework was trivial to open, even while stressed and anxious, and I had the comfort of knowing that if it did break, I'd probably only have to replace the mainboard, and not the whole laptop.

disassembled laptop

My app includes content areas that expand and collapse. A lot like accordions, except they take up the whole page and can be huge.

When you open one, whatever's open gets closed, and that makes whatever you just clicked on jump around as the previous content area stops taking up space on the page.

Here's a code snippet I put together that keeps whatever you just clicked at the same spot in the viewport after the layout shift:

/**
 * This ensures that the element is at the same position within the viewport
 * after a layout shift.
 *
 * MUST be called BEFORE triggering the layout shift.
 *
 * IT can only do so much - if the layout shift cuts off enough content, the
 * element will still wind up positioned higher in the viewport than before.
 */
export const retainScrollPosition = (el: Element) => {
  const targetViewportPosition = el.getBoundingClientRect().top;

  requestAnimationFrame(() => {
    const newPagePositon = el.getBoundingClientRect().top + window.scrollY;
    window.scrollTo({ top: newPagePositon - targetViewportPosition });
  });
};

We have to pick a new health insurance plan this month, and we've had a tough time making the decision.

You can't just add up what you'll spend – what each thing costs depends on how much you've already spent!

And some things are inherently probabilistic – will I go through procedure X this year? How many visits will I need for condition Y? How many urgent care visits?

So complex and uncertain!

Inspired by vaguely recalling that I read Lucas F. Costa's blog post some time ago, I applied the Monte Carlo method to my health insurance decision.

I have a simplistic understanding of Monte Carlo simulations:

  1. Assign probabilities to everything that can happen in your scenario
  2. Randomly selecting outcomes for each possible event, then repeat the calculations a gazillion times
  3. Measure how things typically play out

It can get much fancier (hello, MCMC!) but I think that's the gist of it.

I put together a simple TypeScript file with some arithmetic operations and calls to Math.random() and ran it with Bun. I punched in all the reasons my wife and I will or might spend on healthcare, added in the premiums, and took the average result.

Surprisingly, the expensive plan will save us a couple thousand dollars this year, even accounting for the higher premiums.

I feel better about the decision since I did something resembling rigorous calculation of which plan is best. Usually I just guesstimate and anxiously hope for the best.

I give 100% effort. But I have no sense of moderation – I'm on or off.

I can do great work when I care about something. But if I don't, I can barely work at all.

I don't take credit for work that wasn't mine. But I can't feel work satisfaction just because I'm in the same team or company as the person who did something cool.

I can understand some things very clearly and deeply. But I can't believe something or change my mind just because someone insists I should.

I can do excellent work when I understand the assignment in detail. But when I don't, I can't even get started.

I have very broad interests, and can get excited about many subjects. But I can't stay focused on any one subject very long.

I aspire for my work to be excellent. But I'm inflexible, opinionated, stubborn, and the pickiest eater you'll ever meet.


Every neurodivergent trait that makes me stronger also makes me weaker. A coin with two sides. But sometimes people don't understand that; they imagine that their own capabilities are a baseline that I can just add my own strengths on top of.

My weaknesses are an integral part of me. You can't have half of the coin.

Dokku has a Let's Encrypt plugin which works behind Cloudflare. There's just a little bit of chicken-and-egg setup involved.

Let's Encrypt needs to connect back to your server to validate ownership of your domain. You can't have Cloudflare's “full” TLS mode enabled when you're doing first-time validation, because in “full” mode Cloudflare will error out, failing to establish a TLS connection to your not-yet-TLS backend server.

You could disable “full (strict)” TLS mode in Cloudflare, but then you'll take all your sites down: Dokku does HTTP –> HTTPS redirects on all sites configured with TLS, and will thus reject the non-TLS inbound connections from Cloudflare's networks. Or more accurately, it'll receive an inbound HTTP request from Cloudflare's servers, return a redirect to HTTPS, which Cloudflare will pass on to the client, but the client is already at an HTTPS URL, so the client will enter an infinite redirect loop.

You can get around all this during first-time setup by disabling Cloudflare's proxying behavior on your domain while you get Let's Encrypt set up on Dokku. After it's set up, you can turn Cloudflare proxying on, and cert renewals should work fine, since Let's Encrypt validation checks routed through Cloudflare can still establish end-to-end TLS while your certs remain valid.

While at work I was upgrading graphql-code-generator, and found that the generated code no longer exports the type for a query node. Instead, the type of the whole query response was exported. But I had code that relied on having the node type available.

To solve this, I created a new type by reading attributes from deep in the response type.

interface Foo {
    bar: {
        baz: {
            zap: {
                zoop: number;
            }[];
        };
    };
}

type zoop = Foo['bar']['baz']['zap'][number]['zoop'];

If some of your properties are optional, you can use Required<T>. If they're possibly undefined, use Exclude<T, undefined>.

Recently, while building a simple Reddit clone, I wanted to lazy-load images and comments. That is, rather that loading all of the images and comments the instant I added a component to the DOM, I wanted to wait until the component was actually visible. This spreads out the impact of loading a page, both for the client and the server.

Read more...

I needed to create a bottom nav (like what's used in many mobile apps) in CSS. Here's how I did it.

Read more...