How can we make DocumentCloud better?

UTF-8 character support

If I go to the "Text" tab when viewing a document that contains accents (like any French language document), all the accented characters become question marks: "?". This affects the faceting on Analyze > View Entities; cities like "Montréal" show up as "Montr". This is a critical issue for my jurisdiction (Quebec), where the official language is French.

I know Solr can be configured to handle these characters - I don't know if you're using another library for text extraction before sending it to Solr.

13 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    oxford.tuxedo shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →
    completed  ·  AdminJustin Reese (Admin, DocumentCloud) responded  · 

    This should be resolved, as we can process many non-English documents and encode the text files as UTF-8.

    To properly process the docs in a non-English language, click the pen/edit icon on the upload modal and choose the language before uploading. For existing documents, you can change their Document Info to indicate another language and then reprocess the text. You can find more info in our help docs: https://www.documentcloud.org/help/modification

    1 comment

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • AdminAmanda Hickman (Admin, DocumentCloud) commented  ·   ·  Flag as inappropriate

        Ahh, we should touch base about that. We've tried to be very clear with folks who are likely to be uploading non-English language documents that we explicitly do not support them for a number of different technical reasons well beyond just character sets. At this stage in our beta we ask that you limit your use of DocumentCloud to English language documents. Drop us a line at support@documentcloud.org to discuss further -- I can't tell from your handle what your user id is.

      Feedback and Knowledge Base