Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[css-syntax] custom property names too permissive, require namespacing rules #7129

Open
aphillips opened this issue Mar 10, 2022 · 25 comments
Open
Labels
css-syntax-3 cssom-1 Current Work i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. Needs Testcase (WPT)

Comments

@aphillips
Copy link
Contributor

aphillips commented Mar 10, 2022

Originally raised on CSS Variables, but later discussion concluded the best fix is to change CSS Syntax. Original post below:

https://www.w3.org/TR/css-variables-1/#defining-variables

A custom property is any property whose name starts with two dashes (U+002D HYPHEN-MINUS), like --foo. The production corresponds to this: it’s defined as any (a valid identifier that starts with two dashes), except -- itself, which is reserved for future use by CSS. Custom properties are solely for use by authors and users; CSS will never give them a meaning beyond what is presented here.

The above text defines the custom property name as "any valid identifier". Tracing that definition back to CSS Values and thence to ident token, we find that the name can contain any Unicode code point > U+0080. This includes various special forms of whitespace as well as potential problem characters, such as bidi controls (such as might cause "Trojan Source" attacks). Namespacing is definitely a complicated problem: the I18N WG doesn't want groups to cherry-pick characters (thereby excluding certain languages from using the feature).

Most programming languages attempt to address this by adopting some form of restriction for variable names such as those found in UAX31 Unicode Identifier and Pattern Syntax. In JavaScript, for example, the definition looks like the one found here. CSS should make similar restrictions on property names (values can remain unrestricted).

@aphillips aphillips added css-variables-1 Current Work i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. labels Mar 10, 2022
@tabatkins
Copy link
Member

Yes, custom property names can contain literally any non-ASCII character.

If necessary, I'm happy to restrict this, but I question why it would be necessary to do this for property names, but okay for property values to still be fully unrestricted?

@aphillips
Copy link
Contributor Author

Property names are used in CSS "code" and have to be be parsed, matched, and otherwise referenced. Abusive names can cause spoofing problems (even though the underlying code point sequence is still just some integers and the parser may not care). For example, is --\0301 a variable reference? Or an error? (using U+0301 COMBINING ACUTE ACCENT as an example of a combining mark at the start of a name)

Property values are data and can include natural language text (as well as, well, any character data, including junk). While the value space might be limited by applications in different ways, there don't appear to be any requirements to do so here. In fact, your Spec goes out of its way to highlight this fact:

Because custom properties can contain anything, there is no general way to know how to interpret what’s inside of them (until they’re substituted into a known property with var()). Rather than have them partially resolve in some cases but not others, they’re left completely unresolved; they’re a bare stream of CSS tokens interspersed with var() functions.

@tabatkins
Copy link
Member

Property values are used in CSS just as much as property names, tho. We can't interpret custom property values in the custom property, but we do as soon as they're substituted into a non-custom property, or if the custom property is registered with a grammar.

Moreso, tho, the <custom-ident> value type, which operates on the same rules, is used in a number of places in CSS, such as counter-style names, font names, etc. which seem to be in a similar semantic space.

@aphillips
Copy link
Contributor Author

Yes, it occurred to me that this might turn out to be a gap in CSS Syntax (which might be a serious "ouch" and difficult to do something about).

Since one of the things the property value can contain is a string literal, one probably can't apply UAX31-like rules just generically to the value (i.e. in CSS Variables)

@tabatkins
Copy link
Member

Right, removing the potential from strings def seems out (at minimum, they def shouldn't be restricted to the Identifier production from Unicode), but I think I didn't make my original point clearly enough - this restriction should apply to all custom identifiers, not just property names, right?

@aphillips
Copy link
Contributor Author

That's right.

@tabatkins
Copy link
Member

Okay, I'm gonna retag this to Syntax, then, because we should handle the restriction at that level.

@tabatkins
Copy link
Member

Agenda+ to discuss restricting the allowed codepoints in an ident sequence (used in keywords, function names, dimension units, selectors, property names, etc).

Possible options:

Then there are subsequent questions. First, what should we do about characters so restricted that are used in an identifier sequence?

  • Treat them like we do lone surrogates, and replace with U+FFFD
  • Disallow them from the production at all, so an identifier valid today with a restricted character in the middle would instead become two identifiers with a DELIM token containing the restricted character. (I'm inclined toward this, as it would cause most usage of the restricted character to become invalid at parse-time, such as in custom property names, and thus would discourage its usage.)

Second, should we allow escapes to represent the restricted character in identifiers?

  • JS, as far as I can tell, doesn't allow it. Their restriction is rather broad, tho - it's intended to make it so that no identifier that would be illegal to write literally can be written with escapes (in other words, they disallow an ident that could only be written by using escapes) - this disallows things like escaping a dash or period, which CSS has historically allowed and probably can't restrict
  • Just let it work. I'm inclined to go this way.
@svgeesus
Copy link
Contributor

Since this is now being solved at the CSS Syntax level, which is the correct way to do it, untagging from CSS Variables

@svgeesus svgeesus removed the css-variables-1 Current Work label Mar 15, 2022
@svgeesus svgeesus changed the title [css-variables] custom property names too permissive, require namespacing rules Mar 15, 2022
@tabatkins
Copy link
Member

tabatkins commented Mar 16, 2022

Note that the latter two of the three identifiers in my popular tweet would be invalid under the JS rules:

.ಠ_ಠ { --(╯°□°)╯: ︵┻━┻; } is valid CSS.

I'f I'm reading correctly, under the HTML rules the middle one --(╯°□°)╯ would be invalid due to the degree symbols, but the other two are valid.

Not that this is required to be supported, just noting the effects. ^_^

@css-meeting-bot
Copy link
Member

The CSS Working Group just discussed Custom property names too permissive, and agreed to the following:

  • RESOLVED: Use HTML restrictions for custom idents
  • RESOLVED: illegal characters in an ident can be escaped
  • RESOLVED: Invalid ident characters are treated as DELIM tokens
The full IRC log of that discussion <fantasai> Topic: Custom property names too permissive
<fantasai> github: https://github.com//issues/7129
<fantasai> TabAtkins: i18nWG raised issue about custom idents, which allow any Unicode codepoint above a certain codepoint
<fantasai> TabAtkins: There are some concerns about e.g. bidi characters corrupting the display of the code
<fantasai> TabAtkins: Also argument for consistency in what characters allowed across languages
<fantasai> TabAtkins: JS follows UAX?? rules for characters allowed in idents
<fantasai> TabAtkins: HTML allows a different but largely compatible range of characters
<fantasai> TabAtkins: In one of my Tweets, I showed off using weird Unicode rules
<fantasai> TabAtkins: e.g. different emoji are valid or invalid
<fantasai> TabAtkins: I agree with i18n feedback, reasonable to partially restrict these
<fantasai> TabAtkins: e.g. no reason to allow bidi override chars in CSS idents
<fantasai> TabAtkins: so I suggest adopting either HTML rules or JS rules
<Rossen_> q?
<fantasai> TabAtkins: don't have a strong opinion on which to go for
<fantasai> TabAtkins: Otherwise I'd go with HTML rules by default
<emilio> Scribenick: emilio
<emilio> fantasai: I think this is fairly reasonable, but I don't know the differences between the rules so I don't have an opinion on those yet
<fantasai> TabAtkins: JS rules are a bit more strict, they disallow chars that look like punctuation
<fantasai> TabAtkins: HTML gives exact codepoint ranges
<fantasai> TabAtkins: Reason I'd go with HTML is to guarantee being able to write selectors for custom elements, without ever having to escape
<Rossen_> ack fantasai
<fantasai> fantasai: That sounds reasonable, let's go with that
<fantasai> Rossen_: Makes sense, any downsides to it?
<fantasai> TabAtkins: Any change to make more restrictive, could potentially make some stylesheets invalid
<fantasai> TabAtkins: potentially breaking code that works
<fantasai> Rossen_: And with HTML rules we'd have fewer breakage
<fantasai> Rossen_: seems like path of least destruction
<fantasai> Rossen_: Anyone would like to argue against the change entirely?
<fantasai> Rossen_: If not any objections?
<fantasai> Rossen_: Taking the silence as a no
<fantasai> RESOLVED: Use HTML restrictions for custom idents
<fantasai> TabAtkins: Got 2 sub-issues
<fantasai> TabAtkins: One is whether to allow illegal characters to be escaped in an identifier
<fantasai> TabAtkins: JS doesn't allow that, you can escape for readability but not to avoid the identifier restrictions
<fantasai> TabAtkins: but CSS has traditionally always allowed escapes for everything, so don't see a strong reason to disallow
<faceless> +1 from us too
<fantasai> TabAtkins: So I would prefer to go with illegal chars can be escaped
<fantasai> fantasai: I strongly agree with that
<fantasai> Rossen_: Any objections for allowing illegal characters to be escaped in an ident?
<fantasai> RESOLVED: illegal characters in an ident can be escaped
<fantasai> TabAtkins: Next question is how do we handle the illegal characters
<dbaron> That doesn't allow nulls in idents, does it?
<fantasai> TabAtkins: Do we censor them into e.g. U+FFFD
<fantasai> TabAtkins: or drop them entirely?
<fantasai> TabAtkins: I'd prefer to drop them, because it would more clearly result in invalid code
<fantasai> TabAtkins: so if we allow to work but censored it wouldn't prevent use in source text, which was the goal of i18n
<fantasai> TabAtkins: so would prefer to exclude from the ident production
<fantasai> <fantasai> +1
<tantek> +1 TabAtkins
<fantasai> Rossen_: [missed]
<fantasai> TabAtkins: No, would not be changing existing rules for censoring rules. Currently lone surrogates etc. do that
<fantasai> TabAtkins: Those are in there for UTF-8 well-formedness and C compatibility
<fantasai> TabAtkins: They have a reason to be censored out at technical low level
<fantasai> TabAtkins: these restrictions are for human reasons, so would restrict differently
<Rossen_> ack fantasai
<fantasai> fantasai: So should we resolve that they would make the production invalid? (That's what was proposed right?)
<TabAtkins> --(╯°□°)╯
<fantasai> TabAtkins: yes
<fantasai> TabAtkins: if you put this ^ as a custom property name, the degree sign is not a valid character
<fantasai> TabAtkins: so it would make an ident, a delim, a parenthesis, and a ???
<fantasai> TabAtkins: That's definitely not an ident, because it's multiple tokens not an ident token
<bmathwig> Is there a practical use case for doing something like that? Seems more like a developer having fun rather than good quality code.
<fantasai> TabAtkins: Proposed resolution is that it would break into multiple tokens
<fantasai> fantasai: What kind of token are these invalid characters going to be?
<fantasai> TabAtkins: DELIMs, one codepoint at a time
<fantasai> TabAtkins: Characters without a specific role are generally handled as DELIM
<fantasai> TabAtkins: and we only use certain DELIMs in certain places
<TabAtkins> the degree sign isn't a valid ident char under the HTML rules, so this would produce an ident, a delim containing the degree sign, an ident, a delim, and finally an ident
<fantasai> RESOLVED: Invalid ident characters are treated as DELIM tokens
<faceless> present-
@tabatkins
Copy link
Member

The HTML allowed chars are:

"-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

We'd continue to disallow . and allow [A-Z], of course, but for all the characters >= 0x80 we'll match this list.

@svgeesus
Copy link
Contributor

@aphillips does that resolution address your concerns?

@aphillips
Copy link
Contributor Author

Programming languages such as JS and Java that allow non-ASCII variable names with character limits usually have different restrictions for the initial character. Most notably they forbid combining marks. They sometimes exclude other values (such as bidi controls, although those are excluded above). I think using the HTML restrictions is a realistic solution for CSS for the reasons given above.

We might need a note about combining mark handling at the start of a token. HTML handles this by requiring an alpha char ([a-zA-Z] in your case) at the start (HTML also treats combining characters as non-combining when parsing, e.g. class="&#x301;" contains a class name consisting solely of a combining mark. (I used an entity for visibility). As long as the processor can't be fooled, I think you're good to go?

@tabatkins
Copy link
Member

Ah, hm, indeed. HTML gets to avoid the first-character problem. CSS does have special rules for the first character of an ident as well, but they're only different in the ASCII range (preventing idents from being number-lookalikes).

But if the concern is just about combining characters at the start, that'll be fine mechanically; the CSS parser still just handles codepoints individually and doesn't care about combining in any way. So you could select the class you mention without worry, by putting that combining char after a period.

That said, I'm fine with further first-char restrictions if necessary. Unless you request otherwise tho, @aphillips , I'll assume that the existing rules should be fine and use the same non-ASCII allowed characters for both initial and non-initial chars.

@aphillips
Copy link
Contributor Author

As long as the CSS parser doesn't care, then things should work. Content authors will need to be advised about the spoof/abuse potential when viewing CSS files as text somewhere (you may even already have such a note??)

@tabatkins
Copy link
Member

Don't have such a note, but I'm look over some of the verbage used elsewhere for that issue and add one.

tabatkins added a commit that referenced this issue Mar 23, 2022
…s to the same list that HTML allows in custom element names. #7129
@tabatkins
Copy link
Member

Okay, restriction applied, and I added a significant note to that section.

I'll need to add and/or tweak some tests for this.

@mathiasbynens
Copy link
Contributor

CC’ing @markusicu @macchiati (Unicode “Trojan Source” working group) as FYI

@tabatkins
Copy link
Member

Note that I'm currently linking to a Rust-lang blog post about the trojan source thing; if there's a better "official source" about it from Unicode I'd be happy to switch the reference over.

@aphillips
Copy link
Contributor Author

Here's Unicode's announcement, which has a link to @macchiati et al's doc about the topic.

@cdoublev
Copy link
Collaborator

cdoublev commented Jun 7, 2022

Shouldn't serialize an identifier be modified accordingly?

- If the character is not handled by one of the above rules and is greater than or equal to U+0080, is "-" (U+002D) or "_" (U+005F), or is in one of the ranges [0-9] (U+0030 to U+0039), [A-Z] (U+0041 to U+005A), or \[a-z] (U+0061 to U+007A), then the character itself.
+ If the character is not handled by one of the above rules and is a [non-ASCII ident code point](https://drafts.csswg.org/css-syntax-3/#non-ascii-ident-code-point), "-" (U+002D), "_" (U+005F), or is in one of the ranges [0-9] (U+0030 to U+0039), [A-Z] (U+0041 to U+005A), or \[a-z] (U+0061 to U+007A), then the character itself.
@tabatkins
Copy link
Member

Yes, it should be.

@tabatkins tabatkins added the cssom-1 Current Work label Jun 17, 2022
@romainmenke
Copy link
Member

romainmenke commented Nov 5, 2022

Is it correct that the ranges are inclusive?
Other definitions are explicit about this.

lowercase letter has this definition :

A code point between U+0061 LATIN SMALL LETTER A (a) and U+007A LATIN SMALL LETTER Z (z) inclusive.

Can we change the current text to :

non-ASCII ident code point
A code point whose value is any of:
U+00B7
between U+00C0 and U+00D6 inclusive
between U+00D8 and U+00F6 inclusive
between U+00F8 and U+037D inclusive
between U+037F and U+1FFF inclusive
U+200C
U+200D
U+203F
U+2040
between U+2070 and U+218F inclusive
between U+2C00 and U+2FEF inclusive
between U+3001 and U+D7FF inclusive
between U+F900 and U+FDCF inclusive
between U+FDF0 and U+FFFD inclusive
greater than or equal to U+10000


Update :

I forgot I asked this here and asked it again in #8861 (comment)

This has been answered and needed edits were made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
css-syntax-3 cssom-1 Current Work i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. Needs Testcase (WPT)
9 participants