{"id":309,"date":"2022-04-20T15:25:45","date_gmt":"2022-04-20T08:25:45","guid":{"rendered":"http:\/\/thnkandgrow.com\/?p=309"},"modified":"2022-04-24T18:52:36","modified_gmt":"2022-04-24T11:52:36","slug":"find-something-in-csv-with-ruby","status":"publish","type":"post","link":"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/","title":{"rendered":"T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name trong file CSV 13M line"},"content":{"rendered":"\n

\u0110\u1ec1 b\u00e0i to\u00e1n:
T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name<\/code> trong file CSV
File CSV c\u00f3 kho\u1ea3n h\u01a1n 13 tri\u1ec7u d\u00f2ng d\u1eef li\u1ec7u (n\u1eb7ng kho\u1ea3ng 500Mb)<\/p>\n\n\n\n

# CSV file\n\n\"user_name\",\"suspect_flag\"\naadilab,suspect_conversations_flag\naadilab,suspect_searches_flag<\/code><\/pre>\n\n\n\n

M\u00ecnh c\u00f3 benchmark 1 s\u1ed1 c\u00e1ch:<\/p>\n\n\n\n

require 'benchmark'\ndef ben_csv\n  @csv = CSV.read(\"db\/suspect_workers_flag_super_fake.csv\")\n  Benchmark.bmbm do |x|\n    x.report(\"select array:\") {\n      p @csv.select {|x, y| x == \"zylafl\"}.count\n    }\n  end\nend\n\ndef ben_file\n  @csv_str = File.read(\"db\/suspect_workers_flag_super_fake.csv\")\n  Benchmark.bmbm do |x|\n    x.report(\"string regex:\") {\n      p @csv_str.scan(\/zylafl\\,\/).count\n    }\n  end\nend\n\ndef ben_shellscript\n  Benchmark.bmbm do |x|\n    x.report(\"shel script:\") {\n      p `grep -o 'zylafl,' db\/suspect_workers_flag_super_fake.csv | wc -l`\n    }\n  end\nend<\/code><\/pre>\n\n\n\n

Result: <\/p>\n\n\n\n

Rehearsal -------------------------------------------------\nstring regex: 3\n  0.273001   0.001638   0.274639 (  0.275356)\n---------------------------------------- total: 0.274639sec\n\nRehearsal ------------------------------------------------\nshel script: \"       3\\n\"\n  3.452150   0.595941  13.451730 (  9.518582)\n-------------------------------------- total: 13.451730sec\n\nRehearsal -------------------------------------------------\nselect array: 3\n  4.287471   0.338820   4.626291 (  4.680256)\n---------------------------------------- total: 4.626291sec<\/code><\/pre>\n\n\n\n

Nh\u01b0 v\u1eady, n\u1ebfu \u0111\u1ecdc file CSV nh\u01b0 1 chu\u1ed7i v\u00e0 d\u00f9ng regex th\u00ec t\u1ed1c \u0111\u1ed9 nhanh nh\u1ea5t.
M\u1ed9t s\u1ed1 riskies:<\/p>\n\n\n\n

  • C\u00f3 nhi\u1ec1u request v\u00e0 h\u1ec7 th\u1ed1ng s\u1ebd load file CSV l\u00ean \u0111\u1ec3 \u0111\u1ecdc, do ch\u01b0a c\u00f3 c\u01a1 ch\u1ebf cache n\u00ean s\u1ebd t\u1ed1n kh\u00e1 nhi\u1ec1u t\u00e0i nguy\u00ean<\/li>
  • T\u1ed1c \u0111\u1ed9 s\u1ebd b\u1ecb gi\u1ea3m xu\u1ed1ng, RAM, CPU s\u1ebd \u0111\u01b0\u1ee3c d\u00f9ng nhi\u1ec1u h\u01a1n n\u1ebfu c\u00f3 nhi\u1ec1u d\u1eef li\u1ec7u h\u01a1n v\u00e0 c\u00f3 th\u1ec3 d\u1eabn \u0111\u1ebfn Not responding<\/code> tr\u00ean server<\/li><\/ul>\n\n\n\n


    Another solution from HaVS:
    T\u1ea1o file CSV m\u1edbi, trong file m\u1edbi n\u00e0y s\u1ebd t\u1ed5ng k\u1ebft l\u1ea1i s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name.<\/p>\n\n\n\n

    # CSV file\n\n\"user_name\", \"count\"\nabc,3\nabd,2\nabf,2<\/code><\/pre>\n\n\n\n

    Khi c\u1ea7n m\u00ecnh s\u1ebd g\u1ecdi file m\u1edbi \u0111\u1ec3 l\u1ea5y value ra.<\/p>\n\n\n\n

    C\u1ea7n t\u1ea1o worker \u0111\u1ec3 update file m\u1edbi v\u00e0o m\u1ed7i t\u1ed1i ho\u1eb7c 1 khung th\u1eddi gian n\u00e0o \u0111\u00f3<\/p>\n\n\n\n

    Trigger callback khi record \u0111\u01b0\u1ee3c insert v\u00e0o db (c\u00f3 th\u1ec3 s\u1ebd kh\u00f3 n\u1ebfu d\u00f9ng data t\u1eeb database kh\u00e1c)<\/p>\n\n\n\n

    Risk: <\/p>\n\n\n\n

    • Data c\u00f3 th\u1ec3 kh\u00f4ng \u0111\u01b0\u1ee3c up-to-date<\/li><\/ul>\n\n\n\n

      B\u1ea1n c\u00f2n c\u00e1ch n\u00e0o kh\u00e1c hay h\u01a1n kh\u00f4ng?<\/p>\n\n\n\n

      <\/p>\n","protected":false},"excerpt":{"rendered":"

      \u0110\u1ec1 b\u00e0i to\u00e1n:T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name trong file CSVFile CSV c\u00f3 kho\u1ea3n h\u01a1n 13 tri\u1ec7u d\u00f2ng d\u1eef li\u1ec7u (n\u1eb7ng kho\u1ea3ng 500Mb) M\u00ecnh c\u00f3 benchmark 1 s\u1ed1 c\u00e1ch: Result: Nh\u01b0 v\u1eady, n\u1ebfu \u0111\u1ecdc file CSV nh\u01b0 1 chu\u1ed7i v\u00e0 d\u00f9ng regex th\u00ec t\u1ed1c \u0111\u1ed9 nhanh nh\u1ea5t.M\u1ed9t s\u1ed1 riskies: C\u00f3 nhi\u1ec1u request v\u00e0 h\u1ec7 […]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[60,58],"yoast_head":"\nT\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name trong file CSV 13M line » Th?nk And Grow<\/title>\n<meta name=\"description\" content=\"Dive deep into the latest trends and insights in technology with our engaging articles. Stay informed and ahead of the curve with our expert analysis and in-depth coverage.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"T\u00ecm s\u1ed1 l\u1ea7n xu\u1ea5t hi\u1ec7n c\u1ee7a user_name trong file CSV 13M line » Th?nk And Grow\" \/>\n<meta property=\"og:description\" content=\"Dive deep into the latest trends and insights in technology with our engaging articles. Stay informed and ahead of the curve with our expert analysis and in-depth coverage.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/\" \/>\n<meta property=\"og:site_name\" content=\"Th?nk And Grow\" \/>\n<meta property=\"article:published_time\" content=\"2022-04-20T08:25:45+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-04-24T11:52:36+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/thnkandgrow.com\/#website\",\"url\":\"https:\/\/thnkandgrow.com\/\",\"name\":\"Th?nk And Grow\",\"description\":\"Just Do It!\",\"publisher\":{\"@id\":\"https:\/\/thnkandgrow.com\/#\/schema\/person\/4056838e18c94bc665494c1e8f9f2873\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/thnkandgrow.com\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#webpage\",\"url\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/\",\"name\":\"T\\u00ecm s\\u1ed1 l\\u1ea7n xu\\u1ea5t hi\\u1ec7n c\\u1ee7a user_name trong file CSV 13M line » Th?nk And Grow\",\"isPartOf\":{\"@id\":\"https:\/\/thnkandgrow.com\/#website\"},\"datePublished\":\"2022-04-20T08:25:45+00:00\",\"dateModified\":\"2022-04-24T11:52:36+00:00\",\"description\":\"Dive deep into the latest trends and insights in technology with our engaging articles. Stay informed and ahead of the curve with our expert analysis and in-depth coverage.\",\"breadcrumb\":{\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/thnkandgrow.com\/\",\"url\":\"https:\/\/thnkandgrow.com\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#webpage\"}}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#webpage\"},\"author\":{\"@id\":\"https:\/\/thnkandgrow.com\/#\/schema\/person\/4056838e18c94bc665494c1e8f9f2873\"},\"headline\":\"T\\u00ecm s\\u1ed1 l\\u1ea7n xu\\u1ea5t hi\\u1ec7n c\\u1ee7a user_name trong file CSV 13M line\",\"datePublished\":\"2022-04-20T08:25:45+00:00\",\"dateModified\":\"2022-04-24T11:52:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#webpage\"},\"wordCount\":274,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/thnkandgrow.com\/#\/schema\/person\/4056838e18c94bc665494c1e8f9f2873\"},\"keywords\":[\"csv\",\"ruby\"],\"articleSection\":[\"Technology\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/thnkandgrow.com\/blog\/2022\/04\/20\/find-something-in-csv-with-ruby\/#respond\"]}]},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/thnkandgrow.com\/#\/schema\/person\/4056838e18c94bc665494c1e8f9f2873\",\"name\":\"kokorolx\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/thnkandgrow.com\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/s3.amazonaws.com\/thnkandgrow.com-media\/wp-content\/uploads\/2023\/05\/13223538\/Amazon-EC2.jpg\",\"contentUrl\":\"https:\/\/s3.amazonaws.com\/thnkandgrow.com-media\/wp-content\/uploads\/2023\/05\/13223538\/Amazon-EC2.jpg\",\"width\":750,\"height\":375,\"caption\":\"kokorolx\"},\"logo\":{\"@id\":\"https:\/\/thnkandgrow.com\/#personlogo\"},\"sameAs\":[\"https:\/\/thnkandgrow.com\"],\"url\":\"https:\/\/thnkandgrow.com\/blog\/author\/kokoro-lehoanggmail-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/posts\/309"}],"collection":[{"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/comments?post=309"}],"version-history":[{"count":7,"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/posts\/309\/revisions"}],"predecessor-version":[{"id":321,"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/posts\/309\/revisions\/321"}],"wp:attachment":[{"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/media?parent=309"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/categories?post=309"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thnkandgrow.com\/wp-json\/wp\/v2\/tags?post=309"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}