UTF-8 character support
If I go to the "Text" tab when viewing a document that contains accents (like any French language document), all the accented characters become question marks: "?". This affects the faceting on Analyze > View Entities; cities like "Montréal" show up as "Montr". This is a critical issue for my jurisdiction (Quebec), where the official language is French.
I know Solr can be configured to handle these characters - I don't know if you're using another library for text extraction before sending it to Solr.
This should be resolved, as we can process many non-English documents and encode the text files as UTF-8.
To properly process the docs in a non-English language, click the pen/edit icon on the upload modal and choose the language before uploading. For existing documents, you can change their Document Info to indicate another language and then reprocess the text. You can find more info in our help docs: https://www.documentcloud.org/help/modification
Ahh, we should touch base about that. We've tried to be very clear with folks who are likely to be uploading non-English language documents that we explicitly do not support them for a number of different technical reasons well beyond just character sets. At this stage in our beta we ask that you limit your use of DocumentCloud to English language documents. Drop us a line at firstname.lastname@example.org to discuss further -- I can't tell from your handle what your user id is.