-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate CLDR as database for cultures #128
Comments
I know they don't love keeping it working (they do, it just is a hassle), but Dojo has scripts to take a lot of the CLDR info and turn it into usable JSON data and functions: http://bugs.dojotoolkit.org/browser/dojo/util/trunk/buildscripts/cldr It has some hefty pre-reqs, but that's to be expected with xlst and whatnot. I'd imagine you could do it a touch cleaner in node these days, but the core logic would be the same. I'm definitely looking into doing this, but I'd love for CLDR to just give us this stuff natively. :D Adam Peller is the expert on this stuff and is/was active on that cldr ticket. |
Thanks Alex, appreciate your input! More puzzle pieces to work with. |
There's some activity on the cldr ticket: http://unicode.org/cldr/trac/ticket/2733#comment:22 Looks like for now our best bet is to just wait a few more weeks and hope for a release of those tools. |
Btw. Tim Wood of Moment.js had general interest in using CLDR, but outlined some issues here: moment/moment#315 - he now also closed that ticket, since he doesn't have time to work on it. If we ever want to add relative time functions to Globalize, Moment.js would be a good starting point. |
There's a Ldml2JsonConverter in CLDR now: http://unicode.org/cldr/trac/changeset/7886 Need to try that out. |
Google's Closure library use CLDR as the datasource. I haven't yet figured out how they import the data, but the data itself is in this file: http://docs.closure-library.googlecode.com/git/closure_goog_i18n_datetimesymbols.js.source.html The equivalent of our format method is in DateTimeFormat: http://docs.closure-library.googlecode.com/git/class_goog_i18n_DateTimeFormat.html The source list the tokens it supports, which I can't find in the API document. The header says its based on CLDR standards: "Datetime formatting functions following the pattern specification as defined in JDK, ICU and CLDR, with minor modification for typical usage in JS. Pattern specification: (Refer to JDK/ICU/CLDR)" |
Twitter published a JavaScript port of their Ruby CLDR wrapper 8 months ago: https://github.com/twitter/twitter-cldr-js |
Thanks @dilvie for those links. Those look like it could just load them with Its nice to see that there are properties for relative time formatting, e.g. "6 months ago". That's not something Globalize supports right now, but we could consider adding it. |
@jzaefferer Agreed. There's a lot of data that is not necessary most of the time. Maybe somebody could create a custom build script similar to http://projects.jga.me/jquery-builder/, so we get only the features we really need. That would actually be really nice in Globalize -- especially if all you need are string translations and number formatting. Or string translations and relative time (gentle nudge to anybody with more time than I can spare at the moment). |
It would be nice if we could just filter, as opposed to reformatting, the data. This would allow anyone who has full CLDR already available to just use that instead of duplicating some data with the Globalize files. We'll have to see if using the existing structure would become awkward. |
Hey guys, I noticed this issue earlier when I was investigating how to convert the CLDR data to JSON format. You need the CLDR core.zip and tools.zip files, then you need to use the Ldml2JsonConverter in the tools to convert the data. You can use what I have so far as an example, in the tools folder at https://github.com/andyearnshaw/Intl.js. |
@andyearnshaw Thanks for the tips! |
What about coverage. Does CLDR data cover all culture data needed by Globalize? @jzaefferer have you or anyone looked into it already? |
@rxaviers I don't think anyone did. Though Tim Wood, of moment.js, said that it lacks support for relative time, suggesting that everything else was there. We need to verify that either way. |
I started mapping the languages. There are 79 missing languages/cultures on CLDR (that are present on globalize). They are: https://gist.github.com/rxaviers/5933900#file-missing-globalize-cultures-in-cldr First question: Are we ok dropping that? Next step: content mapping. PS: Note that there are languages supported by CLDR that we don't currently support. They are the green ones https://gist.github.com/rxaviers/5933900#file-globalize-vs-cldr-diff Update: By the time I made this comment, I wasn't aware of LDML inheritance rules. So, my conclusion of 79 missing languages is not correct. See below comments for more accurate info. |
Anything that's not in CLDR will be dropped. If someone wants us to support a new locale, they'll have to go through CLDR. We will no longer maintain our own data set. For content mapping, we'll be filtering the data, but we cannot change the structure. We want to be fully compatible with the full data set, so if someone already has the JSON files available from somewhere else, they should be able to use those with Globalize. |
Yeah, now that CLDR has an official JSON format, we should build on top of that, and do the mapping internally. Might make sense to pick one very simple formatting task and port that to CLDR, so see if we can just rename a few references or have to do heavy refactorings. |
You'll probably find the CLDR JSON to be a little bloated and require a As far as I can tell at a quick glance, Globalize would lose On Fri, Jul 5, 2013 at 2:57 PM, Jörn Zaefferer notifications@github.comwrote:
|
Great. So, language coverage is not an issue. My next comment is about the content. |
Just mapped the content. Follow my findinds below. Some definitions are mappable, some are not. What are we going to do with those? Currency This is implemented in a different way on CLDR. Current Globalize associates one currency symbol per culture/locale. CLDR doesn't. CLDR defines a list of currencies per country code (which is more accurate IMHO). The closest you get using CLDR is: (a) get a list of the territories of a language (by using Note that I am not talking about number symbols (decimal symbol, group separator, plus and minus sign). These are defined on CLDR per locale (just like we do on Globalize). Calendar This is also implemented in a different way on CLDR. Current Globalize defines the calendar preference ( Analogous to the currency above, CLDR defines the calendar preference and the firstDay preference per territory (not per locale) (by using respectively Each calendar's definitions (eg. Gregorian's names of the days) are defined on CLDR per locale like Globalize does. Some definitions are missing (eg. separator of parts of a time). Some have mappings, but not that simple straightforward mapping, eg. Number Some definitions are missing, eg. negative pattern, decimals preference (except for currencies), and groupSizes. The full mapping is here: |
Negative patterns aren't missing, they just aren't included where the format is just -. Excerpt from http://www.unicode.org/reports/tr35/tr35-numbers.html:
Also, the patterns define the group sizes:
I'd recommend a thorough read of the relevant tr35 sections because you will probably need the knowledge those documents provide, like where multiple inheritence is concerned (for example, with calendar properties). |
Great, thank you. So, two less missing maps. |
Regarding currency, the current implementation in Globalize has its limitations anyway. There's some background about that in #66. We also have a bunch of issues labelled culture-bugs. Switching to CLDR should resolve, or at least help to resolve, those. As andy commented, negative pattern and groupSize are there, in the pattern, and the same seems to apply to the decimals field. So instead of having three properties for these, there's just one pattern that defines them all. I suppose negativeInfinity can be inferred just like the negativePattern can be inferred, though there might actually be locales which have a definition for that. Are there patterns for percent? If so, those should cover the three pattern, decimals and groupSizes properties, just like the regular number patterns. There's a bunch more stuff in your gist where I don't have a reply. It seems like we can just drop a few things, like the AM/PM fields.
Stick with new CLDR patterns. It seems like we have to change a bunch of stuff anyway that this needs a 2.0 release. I don't think we'll be able to provide backwards compatibility for example for the currency changes. |
I have updated my gist based on Jörn's and Andy's comments. As it points out, for some areas of the Globalize code, the update won't be a simply matter of getting the locale data from somewhere else. But, it's going to be a full refactoring. But for all of those updates, it seems we won't lose any feature. Actually, CLDR seems more complete and better structured. |
Chatted with Rafael on IRC about this. He'll start with some prototyping to figure out what's needed to support those number or date patterns. We can then discuss API changes based on the prototyping results. |
The CLDR data is one of those things that requires us to shift our thinking- and for well vetted reasons. I've been working a lot with the data- here are a few things I've noticed: 1) Language vs. Territory. A few versions back they moved language_irrelevant data into the supplemental files. For instance: 2) Inheritance. For those that don't know: CLDR uses a crazy custom multiple inheritance scheme with the XML data. Although the JSON data extracted represents a flattened view of that inheritance, inheritance shouldn't be forgotten about. Their
If there is indeed something special about If all that isn't enough; the documentation also has THIS:
NOTE: That example is not intended to "tear down" @rxaviers data. I have no idea how many hoops he jumped through to compile that list, I just picked something to illustrate that CLDR is not straight forward. Another big thing not to miss is 3) Only 100% approved data. There is a ton of "draft" data in the XML files that is not included in the JSON. This is probably desired- but it means that using the JSON data will mean that any changes submitted to CLDR may not be 100% confirmed for a few version (versions come out every 6 months I believe). CLDR is in the standardization field that has the duty of analyzing the overall consequences of their changes before approving them. This delay to the Globalize users is something to acknowledge. 4) Some data seems to be missing from the JSON. There are Ok, so hopefully it's clear that the CLDR data is complicated if you are to use it correctly. It is this way because that is the nature of globalization. I'm very glad to see @jzaefferer has been evangelizing it. It is a good way to go, but will require refactors- maybe BIG ones to reap the benefits. This last part is out of the scope of the jQuery Globalize project but its something to ponder: So, while its cool to get all the data from the same source- the web as a whole could benefit if those chunks were framework independent so they could be shared. For example- if someone is using Moment and Globalize- they could share the raw data (from CLDR) so it doesn't need to be downloaded twice. Using package managers like Bower and Component, developers could pick and choose which languages they want to include. Sorry so long. Hope it helps. |
@williamwicks don't worry, you haven't and thanks for your message. LDML has a more accurate definition. It distinguishes: {language, region, and script} in a more realistic way, whereas current Globalize has this three definitions kinda fuzzy (inherited from the standards it had initially chosen to follow). So, I agree with you that CLDR is one of those things that requires us to shift our thinking. By the way, thanks for pointing out that Your module is part of the solution, and here's how I see it: https://github.com/rxaviers/globalize/wiki/Globalize-and-CLDR Ping me on IRC Freenode @rxaviers. |
Note: the Date Format Patterns are not equivalent http://www.unicode.org/reports/tr35/tr35-dates.html#Date_Format_Patterns |
Does anyone know the difference between "short day" (E...EEE) and "short name" (EEEEEE) on Date_Field_Symbol_Table? What would each respective path be? (eg. |
I would guess that |
Your suggestion makes sense, Scott. It's analogous to the order of era, year, and others. |
Just stumbled upon this issue as part of figuring out why moment.js isn't already using ICU date formats :) You can use https://github.com/papandreou/node-cldr to extract data from the CLDR XML files. It takes care of resolving the crazy inheritance scheme and just gives you the resolved chunks of data as JavaScript objects. Ping me if you need assistance implementing it or if you need some CLDR data that it doesn't yet have an extraction method for. |
My inter library has some helper functions for matching and adapting ICU formats: https://github.com/papandreou/inter/blob/master/lib/inter.js#L932-L981 -- could be useful. And if you want to format date intervals according to CLDR's locale-specific rules ( |
@papandreou, our goal is to provide a set of tools that leverage the official CLDR JSON data. Check our in progress implementation on #172. We are using this library https://github.com/rxaviers/cldr to get help on CLDR data access. Said that, converting XML data into the official JSON bindings is something we delegate to the official CLDR JSON tool. If you find any issues or want to help us on this process you are welcome. |
@rxaviers Yeah, I understood that. But since the JSON data is still incomplete, I just wanted to note that I've written a tool that extracts the data you need from the XML files. |
What's missing? |
The current http://www.unicode.org/Public/cldr/24/json.zip only has 39 locales as opposed to 650+ in the XML data. I don't know whether the data for the locales that are included is complete. It wasn't when I checked a few months ago. |
The available JSON data for download has the top 20 languages they (unicode.org CLDR staff) consider to be the "most used" languages. It contains the complete amount of data per language though. Also, they have been fully resolved. You can use their official conversion tool (tools.zip) to generate the JSON representation of the languages not available in the ZIP. This ZIP contains a README with instructions on how to build the data. tools/scripts/CLDRWrapper may also be useful. Using the tool, you can opt to either generate resolved data, or unresolved to save space (or bandwidth) (-r false option of the conversion tool). |
Oh, I missed that part. So json.zip is just a small sample of the real stuff. That's good news, I've been waiting for this to arrive :) Thanks for the info. |
Yeap :). You are welcome. If you happen to find any flaws on the generated JSON, I will very much like want to know too. So, please let us know. |
@rxaviers do you have any insight on how to actually use the conversion tool? i tried to follow the readme, but for someone who is not experienced with java, it's just too vague. I also posted the question on SO: http://stackoverflow.com/questions/20046099/how-to-build-json-data-from-cldr-data-using-the-java-conversion-tool |
@ragulka I completely understand your pain. I had the exact same issue as you had. Their README's instructions are currently misleading and should be fixed according to http://unicode.org/cldr/trac/ticket/6726. |
Closed by PR #172 |
…ble. Also by not using mass assignment in set_translations. This also cleans up after globalizejs#150 which was the first step in the right direction
We're currently having various issues with the culture files generate from .NET (see label "cultures-bug"). We're considering moving to CLDR as the database for these files.
They're actually working on providing JSON files: http://unicode.org/cldr/trac/ticket/2733
We need to build a prototype to figure out if we can transform the CLDR data into something we can use here directly, or if we need to adapt the Globalize API to the CLDR data.
/cc @clarkbox @SlexAxton @Krinkle
The text was updated successfully, but these errors were encountered: